(cherry picked from commit 43761473c254b45883a64441dd0bc85a42f3645c)
There is a double fetch problem in audit_log_single_execve_arg()
where we first check the execve(2) argumnets for any "bad" characters
which would require hex encoding and then re-fetch the arguments for
logging in the audit record[1]. Of course this leaves a window of
opportunity for an unsavory application to munge with the data.
This patch reworks things by only fetching the argument data once[2]
into a buffer where it is scanned and logged into the audit
records(s). In addition to fixing the double fetch, this patch
improves on the original code in a few other ways: better handling
of large arguments which require encoding, stricter record length
checking, and some performance improvements (completely unverified,
but we got rid of some strlen() calls, that's got to be a good
thing).
As part of the development of this patch, I've also created a basic
regression test for the audit-testsuite, the test can be tracked on
GitHub at the following link:
* https://github.com/linux-audit/audit-testsuite/issues/25
[1] If you pay careful attention, there is actually a triple fetch
problem due to a strnlen_user() call at the top of the function.
[2] This is a tiny white lie, we do make a call to strnlen_user()
prior to fetching the argument data. I don't like it, but due to the
way the audit record is structured we really have no choice unless we
copy the entire argument at once (which would require a rather
wasteful allocation). The good news is that with this patch the
kernel no longer relies on this strnlen_user() value for anything
beyond recording it in the log, we also update it with a trustworthy
value whenever possible.
Reported-by: Pengfei Wang <wpengfeinudt@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Change-Id: I10e979e94605e3cf8d461e3e521f8f9837228aa5
Bug: 30956807
The SchedTune tasks accounting is used to identify how many tasks are in
a boostgroup and thus to bias the selection of an OPP based on the
maximum boost value of the active boostgroups.
The current implementation however update the accounting after CPU
capacity has been update. This has two effects:
a) when we enqueue a boosted task, we do not immediately boost its CPU
b) when we dequeue a boosted task, we can keep a CPU boosted even if not
required
This patch change the order of the SchedTune accounting and SchedFreq
updated to ensure to have always an updated representation of which
boosted tasks are runnable on a CPU before updating its capacity.
Reported-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
The previous patch:
e7ce26f - FIXUP: sched/tune: fix accounting for runnable tasks
squashed together patches of a series to fix SchedTune's accounting
issues. However, in the consolidation and cleanup of the series to merge
in the Android Common Kernel, we somehow missed a couple of important
changes:
1) the schedtune_exit function is not more required, because e7ce26f
fixes accounting of exiting tasks in a different way
2) the schedtune_initialized flag was not set at the end of
scheddtune_init_cgroup() thus failing to enabled SchedTune at boot.
This patch thus is to be considered an integration of e7ce26f.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[jstultz: Cherry-picked from android-3.18. It should be noted that
some of this patch was already applied in the 4.4 patches (schedtune_exit
doesn't exist for example), but this patch just ensures things are totally
synced up]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Use do_div() instead of "/" operator to fix undefined references to
"__aeabi_uldivmod" build error for ARCH=arm.
Also in TP_fast_assign(), along with do_div() usage, replace "," with
";" which would have resulted in a syntax error (!), because
'#define TP_fast_assign(args...) args' would have stripped off the ","
and left white space between these two assignments after CPP phase.
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[jstultz: Cherry-picked from common/android-3.18]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Include clocksource/arm_arch_timer.h to fix implicit function
declaration of ‘arch_timer_read_counter’ build error for ARCH=arm.
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[jstultz: Cherry-picked from common/android-3.18]
Signed-off-by: John Stultz <john.stultz@linaro.org>
commit 444969223c81c7d0a95136b7b4cfdcfbc96ac5bd upstream.
The following commit:
9642d18eee ("nohz: Affine unpinned timers to housekeepers")'
intended to affine unpinned timers to housekeepers:
unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(houserkeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself)
However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the
intention to:
unpinned timers(full dynaticks, idle) => any housekeepers(no mattter cpu topology)
unpinned timers(full dynaticks, busy) => any housekeepers(no mattter cpu topology)
unpinned timers(housekeepers, idle) => any busy cpus(otherwise, fallback to any housekeepers)
This patch fixes it by checking if there are busy housekeepers nearby,
otherwise falls to any housekeepers/itself. After the patch:
unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(housekeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself)
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Fixed the changelog. ]
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 'commit 9642d18eee ("nohz: Affine unpinned timers to housekeepers")'
Link: http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 173be9a14f7b2e901cf77c18b1aafd4d672e9d9e upstream.
Mike reports:
Roughly 10% of the time, ltp testcase getrusage04 fails:
getrusage04 0 TINFO : Expected timers granularity is 4000 us
getrusage04 0 TINFO : Using 1 as multiply factor for max [us]time increment (1000+4000us)!
getrusage04 0 TINFO : utime: 0us; stime: 179us
getrusage04 0 TINFO : utime: 3751us; stime: 0us
getrusage04 1 TFAIL : getrusage04.c:133: stime increased > 5000us:
And tracked it down to the case where the task simply doesn't get
_any_ [us]time ticks.
Update the code to assume all rtime is utime when we lack information,
thus ensuring a task that elides the tick gets time accounted.
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fredrik Markstrom <fredrik.markstrom@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Fixes: 9d7fb04276 ("sched/cputime: Guarantee stime + utime == rtime")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f3b0946d629c8bfbd3e5f038e30cb9c711a35f10 upstream.
Bharat Kumar Gogada reported issues with the generic MSI code, where the
end-point ended up with garbage in its MSI configuration (both for the vector
and the message).
It turns out that the two MSI paths in the kernel are doing slightly different
things:
generic MSI: disable MSI -> allocate MSI -> enable MSI -> setup EP
PCI MSI: disable MSI -> allocate MSI -> setup EP -> enable MSI
And it turns out that end-points are allowed to latch the content of the MSI
configuration registers as soon as MSIs are enabled. In Bharat's case, the
end-point ends up using whatever was there already, which is not what you
want.
In order to make things converge, we introduce a new MSI domain flag
(MSI_FLAG_ACTIVATE_EARLY) that is unconditionally set for PCI/MSI. When set,
this flag forces the programming of the end-point as soon as the MSIs are
allocated.
A consequence of this is that we have an extra activate in irq_startup, but
that should be without much consequence.
tglx:
- Several people reported a VMWare regression with PCI/MSI-X passthrough. It
turns out that the patch also cures that issue.
- We need to have a look at the MSI disable interrupt path, where we write
the msg to all zeros without disabling MSI in the PCI device. Is that
correct?
Fixes: 52f518a3a7 "x86/MSI: Use hierarchical irqdomains to manage MSI interrupts"
Reported-and-tested-by: Bharat Kumar Gogada <bharat.kumar.gogada@xilinx.com>
Reported-and-tested-by: Foster Snowhill <forst@forstwoof.ru>
Reported-by: Matthias Prager <linux@matthiasprager.de>
Reported-by: Jason Taylor <jason.taylor@simplivity.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Link: http://lkml.kernel.org/r/1468426713-31431-1-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This deliberately changes the behavior of the per-cpuset
cpus file to not be effected by hotplug. When a cpu is offlined,
it will be removed from the cpuset/cpus file. When a cpu is onlined,
if the cpuset originally requested that that cpu was part of the cpuset,
that cpu will be restored to the cpuset. The cpus files still
have to be hierachical, but the ranges no longer have to be out of
the currently online cpus, just the physically present cpus.
Change-Id: I22cdf33e7d312117bcefba1aeb0125e1ada289a9
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
x86_64:allmodconfig fails to build with the following error.
ERROR: "rcu_sync_lockdep_assert" [kernel/locking/locktorture.ko] undefined!
Introduced by commit 3228c5eb7a ("RFC: FROMLIST: locking/percpu-rwsem:
Optimize readers and reduce global impact"). The applied upstream version
exports the missing symbol, so let's do the same.
Change-Id: If4e516715c3415fe8c82090f287174857561550d
Fixes: 3228c5eb7a ("RFC: FROMLIST: locking/percpu-rwsem: Optimize ...")
Signed-off-by: Guenter Roeck <groeck@chromium.org>
cgroup_threadgroup_rwsem is acquired in read mode during process exit
and fork. It is also grabbed in write mode during
__cgroups_proc_write(). I've recently run into a scenario with lots
of memory pressure and OOM and I am beginning to see
systemd
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
percpu_down_write+0x114/0x170
__cgroup_procs_write.isra.12+0xb8/0x3c0
cgroup_file_write+0x74/0x1a0
kernfs_fop_write+0x188/0x200
__vfs_write+0x6c/0xe0
vfs_write+0xc0/0x230
SyS_write+0x6c/0x110
system_call+0x38/0xb4
This thread is waiting on the reader of cgroup_threadgroup_rwsem to
exit. The reader itself is under memory pressure and has gone into
reclaim after fork. There are times the reader also ends up waiting on
oom_lock as well.
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
jbd2_log_wait_commit+0xd4/0x180
ext4_evict_inode+0x88/0x5c0
evict+0xf8/0x2a0
dispose_list+0x50/0x80
prune_icache_sb+0x6c/0x90
super_cache_scan+0x190/0x210
shrink_slab.part.15+0x22c/0x4c0
shrink_zone+0x288/0x3c0
do_try_to_free_pages+0x1dc/0x590
try_to_free_pages+0xdc/0x260
__alloc_pages_nodemask+0x72c/0xc90
alloc_pages_current+0xb4/0x1a0
page_table_alloc+0xc0/0x170
__pte_alloc+0x58/0x1f0
copy_page_range+0x4ec/0x950
copy_process.isra.5+0x15a0/0x1870
_do_fork+0xa8/0x4b0
ppc_clone+0x8/0xc
In the meanwhile, all processes exiting/forking are blocked almost
stalling the system.
This patch moves the threadgroup_change_begin from before
cgroup_fork() to just before cgroup_canfork(). There is no nee to
worry about threadgroup changes till the task is actually added to the
threadgroup. This avoids having to call reclaim with
cgroup_threadgroup_rwsem held.
tj: Subject and description edits.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
[jstultz: Cherry-picked from:
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 568ac888215c7f]
Change-Id: Ie8ece84fb613cf6a7b08cea1468473a8df2b9661
Signed-off-by: John Stultz <john.stultz@linaro.org>
The current percpu-rwsem read side is entirely free of serializing insns
at the cost of having a synchronize_sched() in the write path.
The latency of the synchronize_sched() is too high for cgroups. The
commit 1ed1328792 talks about the write path being a fairly cold path
but this is not the case for Android which moves task to the foreground
cgroup and back around binder IPC calls from foreground processes to
background processes, so it is significantly hotter than human initiated
operations.
Switch cgroup_threadgroup_rwsem into the slow mode for now to avoid the
problem, hopefully it should not be that slow after another commit
80127a39681b ("locking/percpu-rwsem: Optimize readers and reduce global
impact").
We could just add rcu_sync_enter() into cgroup_init() but we do not want
another synchronize_sched() at boot time, so this patch adds the new helper
which doesn't block but currently can only be called before the first use.
Cc: Tejun Heo <tj@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Reported-by: John Stultz <john.stultz@linaro.org>
Reported-by: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[jstultz: backported to 4.4]
Change-Id: I34aa9c394d3052779b56976693e96d861bd255f2
Mailing-list-URL: https://lkml.org/lkml/2016/8/11/557
Signed-off-by: John Stultz <john.stultz@linaro.org>
Currently the percpu-rwsem switches to (global) atomic ops while a
writer is waiting; which could be quite a while and slows down
releasing the readers.
This patch cures this problem by ordering the reader-state vs
reader-count (see the comments in __percpu_down_read() and
percpu_down_write()). This changes a global atomic op into a full
memory barrier, which doesn't have the global cacheline contention.
This also enables using the percpu-rwsem with rcu_sync disabled in order
to bias the implementation differently, reducing the writer latency by
adding some cost to readers.
Mailing-list-URL: https://lkml.org/lkml/2016/8/9/181
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[jstultz: Backported to 4.4]
Change-Id: I8ea04b4dca2ec36f1c2469eccafde1423490572f
Signed-off-by: John Stultz <john.stultz@linaro.org>
commit bca014caaa6130e57f69b5bf527967aa8ee70fdd upstream.
Signing a module should only make it trusted by the specific kernel it
was built for, not anything else. Loading a signed module meant for a
kernel with a different ABI could have interesting effects.
Therefore, treat all signatures as invalid when a module is
force-loaded.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 43761473c254b45883a64441dd0bc85a42f3645c upstream.
There is a double fetch problem in audit_log_single_execve_arg()
where we first check the execve(2) argumnets for any "bad" characters
which would require hex encoding and then re-fetch the arguments for
logging in the audit record[1]. Of course this leaves a window of
opportunity for an unsavory application to munge with the data.
This patch reworks things by only fetching the argument data once[2]
into a buffer where it is scanned and logged into the audit
records(s). In addition to fixing the double fetch, this patch
improves on the original code in a few other ways: better handling
of large arguments which require encoding, stricter record length
checking, and some performance improvements (completely unverified,
but we got rid of some strlen() calls, that's got to be a good
thing).
As part of the development of this patch, I've also created a basic
regression test for the audit-testsuite, the test can be tracked on
GitHub at the following link:
* https://github.com/linux-audit/audit-testsuite/issues/25
[1] If you pay careful attention, there is actually a triple fetch
problem due to a strnlen_user() call at the top of the function.
[2] This is a tiny white lie, we do make a call to strnlen_user()
prior to fetching the argument data. I don't like it, but due to the
way the audit record is structured we really have no choice unless we
copy the entire argument at once (which would require a rather
wasteful allocation). The good news is that with this patch the
kernel no longer relies on this strnlen_user() value for anything
beyond recording it in the log, we also update it with a trustworthy
value whenever possible.
Reported-by: Pengfei Wang <wpengfeinudt@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.
With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.
This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a426f99c91d1036767a7819aaaba6bd3191b7f06)
Signed-off-by: Javi Merino <javi.merino@arm.com>
Two fixups that have been reported on LKML. The next version of
scheduler-driver cpu frequency selection patch set should include
these fixes and we can drop this patch then.
Signed-off-by: Ricky Liang <jcliang@chromium.org>
Change-Id: Ia2f8b5c0dd5dac06580256eeb4b259929688af68
This may be useful for detecting and debugging RT throttling issues.
Change-Id: I5807a897d11997d76421c1fcaa2918aad988c6c9
Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[jstultz: forwardported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Existing debug prints do not provide any clues about which tasks
may have triggered RT throttling. Print the names and PIDs of
all tasks on the throttled rt_rq to help narrow down the source
of the problem.
Change-Id: I180534c8a647254ed38e89d0c981a8f8bccd741c
Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
without locks, a caller might observe an old value and race with the
set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
it:
__kthread_bind()
do_set_cpus_allowed()
<SYSCALL>
sched_setaffinity()
if (p->flags & PF_NO_SETAFFINITIY)
set_cpus_allowed_ptr()
p->flags |= PF_NO_SETAFFINITY
Fix the bug by putting everything under the regular scheduler locks.
This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 25834c73f9)
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
BUG=chrome-os-partner:44828
TEST=Boot kernel on Oak.
TEST=smaug-release and strago-release trybots.
Change-Id: Id3c898c5ee1a22ed704e83f2ecf5f78199280d38
Reviewed-on: https://chromium-review.googlesource.com/321264
Commit-Ready: Ricky Liang <jcliang@chromium.org>
Tested-by: Ricky Liang <jcliang@chromium.org>
Reviewed-by: Ricky Liang <jcliang@chromium.org>
Conflicts:
kernel/sched/core.c
This CL separates the notion of boost and prefer_idle schedtune
attributes in cpu selection. Today only top-app
tasks are boosted. The CPU selection is slightly tweaked such that
higher order cpus are preferred only for boosted tasks (top-app) and the
rest would be skewed towards lower order cpus.
This avoids starvation issues for fg tasks when interacting with high
priority top-app tasks (a problem often seen in the case of system_server).
bug: 30245369
bug: 30292998
Change-Id: I0377e00893b9f6586eec55632a265518fd2fa8a1
Conflicts:
kernel/sched/fair.c
Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca ("vmstat: do not use deferrable delayed work for
vmstat_update"). This in turn can cause multiple interruptions of the
applications because the vmstat updater may run at
Make vmstate_update deferrable again and provide a function that folds
the differentials when the processor is going to idle mode thus
addressing the issue of the above commit in a clean way.
Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.
Change-Id: Idf256cfacb40b4dc8dbb6795cf06b34e8fec7a06
Fixes: ba4877b9ca ("vmstat: do not use deferrable delayed work for vmstat_update")
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 0eb77e9880321915322d42913c3b53241739c8aa
[shashim@codeaurora.org: resolve minor merge conflicts]
Signed-off-by: Shiraz Hashim <shashim@codeaurora.org>
[jstultz: fwdport to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
When a task leaves a rq because it is migrated away it carries its
utilization with him. In this case and OPP update on the src rq might be
needed. The corresponding update at dst rq will happen at enqueue time.
Change-Id: I22754a43760fc8d22a488fe15044af93787ea7a8
sched/fair: Fix uninitialised variable in idle_balance
compiler warned, looks legit.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
cpufreq_sched_limits (called when CPUFREQ_GOV_LIMITS event happens)
bails out if policy->rwsem is already locked. However, that rwsem is
always guaranteed to be locked when we get here after a thermal
throttling event happens:
th_throttling ->
cpufreq_update_policy()
...
down_write(&policy->rwsem);
...
cpufreq_set_policy() ->
...
__cpufreq_governor(policy, CPUFREQ_GOV_LIMITS); ->
cpufreq_sched_limits()
...
if (!down_write_trylock(&policy->rwsem))
return; <-- BAIL OUT!
So, we don't currently react immediately to thermal capping event (even
if reaction is still quick in practice, ~1ms, as lots of events are likely
to trigger a frequency selection on a high loaded system).
Fix this bug by removing the bail out condition.
While we are at it we also slightly change handling of the new limits by
clamping the last requested_freq between policy's max and min. Doing so
gives us the oppurtunity to correctly restore the last requested
frequency as soon as a thermal unthrottling event happens.
bug: 30481949
Change-Id: I3c13e818f238c1ffa66b34e419e8b87314b57427
Suggested-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
When idle cpus cannot be found for Top-app/FG tasks, the cpu selection
algorithm picks a cpu with lowest OPP amongst the busy cpus as a second
choice.
Mitigates the "runnable" time for ui and render threads.
bug: 30481949
bug: 30342017
bug: 30508678
Change-Id: I5a97e31d33284895c0fa6f6942102713ee576d77
SchedTune needs to walk the scheduling domains to compute the energy
normalization constants used for PE space filtering. To build such
constants we need the energy model data for each CPU in the system.
However, by walking the SDs as a late initcall stage, the userspace has
been already initialized and it could happen that some CPUs are
hotplugged out.
For example, this could happen if a user-space thermal manager daemon
detects that CPUs are to much hot during the boot process.
To avoid such a race condition we can anticipate the SchedTune
initialization code to be a postcore_initicall. This allows to keep the
SchedTune initialization code as simple as an initcall while still safely
relaying on SDs provided data.
Such calls are executed before user-space is initialized and thus, apart
from the case of unlucky early-init kernel space generated hotplugs,
this solution should be safe enough to get all the data we need.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Hint to enable biasing of tasks towards idle cpus, even when a given
task is negatively boosted. The mechanism allows upto 20% reduction in
camera power without hurting performance.
bug: 28312446
Change-Id: I97ea5671aa1e6bcb165408b41e17bc82e41c2c9e
If cpus are busy, the cpu selection algorithm was favoring
cpus with lower capacity. This can result in uneven packing
since there will be a bias toward the same cpu until there
is a capacity change. Instead use the utilization so there
is immediate feedback as tasks are assigned
BUG: 30115868
Change-Id: I0ac7ae3ab5d8f2f5a5838c29bb6da2c3e8ef44e8
Bug: 29000863
Signed-off-by: albert.zl_huang <albert.zl_huang@htc.com>
Change-Id: I2b5a28b0a9edb31bdaa1ca2310397dd2f36f6c23
Updated to use arch_timer_read_counter() as arch_counter_get_cntvct
doesn't exist in this kernel.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
The current kernel allows to use either PELT or WALT to track CPUs utilizations.
One of the main differences between the two approaches is that PELT
tracks only utilization of SCHED_OTHER classes while WALT tracks all tasks
with a single signal.
The current sched_freq_tick does not make this distinction and, when WALT
is in use, we end up adding multiple time the contribution related to
the RT and DL classes. This patch fixes this issue by:
1. providing two different code paths for PELT and WALT, thus granting that
when we switch to PELT we get the original behaviour based on the assumption
that class aggregations is done underneath by SchedFreq.
2. avoiding the double accounting of DL and RT workloads, when WALT is in use,
by just adding a margin to the original WALT signal when we need to check
if the CFS capacity has to be increased.
Change-Id: I7326fd50e868e97fb5e12351917e9d2969bfdae7
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
During scheduler tick handling, the frequency was being set to
max-freq if the current frequency is less than the current
utilization. Change to just request "right" frequency instead
of max.
BUG: 29871410
Change-Id: I6fe65b14413da44b1520ba116f72320083eb92f8
The CPU utilization reported when WALT is in use already tracks the
contributions due to RT and DL workloads. However, SchedFreq exposes
different capacity update functions, one for each class, and does classes
utilization internally at update_cpu_capacity_request() call time.
This patch ensures that when WALT is in use, the:
cpu_sched_capacity_reqs::cfs
value is tracking just the load generated by SCHED_OTHER tasks.
Change-Id: Ibd9c9a10874a1d91f62477034548f7664e57cd6a
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
use a window based view of time in order to track task
demand and CPU utilization in the scheduler.
Window Assisted Load Tracking (WALT) implementation credits:
Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,
Pavan Kumar Kondeti, Olav Haugan
2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla
and Todd Kjos
Change-Id: I21408236836625d4e7d7de1843d20ed5ff36c708
Includes fixes for issues:
eas/walt: Use walt_ktime_clock() instead of ktime_get_ns() to avoid a
race resulting in watchdog resets
BUG: 29353986
Change-Id: Ic1820e22a136f7c7ebd6f42e15f14d470f6bbbdb
Handle walt accounting anomoly during resume
During resume, there is a corner case where on wakeup, a task's
prev_runnable_sum can go negative. This is a workaround that
fixes the condition and warns (instead of crashing).
BUG: 29464099
Change-Id: I173e7874324b31a3584435530281708145773508
Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
The current definition of the Performance Boost (PB) and Performance Constraint
(PC) regions is has two main issues:
1) in the computation of the boost index we overflow the thresholds_gains
table for boost=100
2) the two cuts had _NOT_ the same ratio
The last point means that when boost=0 we do _not_ have a "standard" EAS
behaviour, i.e. accepting all candidate which decrease energy regardless
of their impact on performances. Instead, we accept only schedule candidate
which are in the Optimal region, i.e. decrease energy while increasing
performances.
This behaviour can have a negative impact also on CPU selection policies
which tries to spread tasks to reduce latencies. Indeed, for example
we could end up rejecting a schedule candidate which want to move a task
from a congested CPU to an idle one while, specifically in the case where
the target CPU will be running on a lower OPP.
This patch fixes these two issues by properly clamping the boost value
in the appropriate range to compute the threshold indexes as well as
by using the same threshold index for both cuts.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
sched/tune: fix update of threshold index for boost groups
When SchedTune is configured to work with CGroup mode, each time we update
the boost value of a group we do not update the threshed indexes for the
definition of the Performance Boost (PC) and Performance Constraint (PC)
region. This means that while the OPP boosting and CPU biasing selection
is working as expected, the __schedtune_accept_deltas function is always
using the initial values for these cuts.
This patch ensure that each time a new boost value is configured for a
boost group, the cuts for the PB and PC region are properly updated too.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
sched/tune: update PC and PB cuts definition
The current definition of Performance Boost (PB) and Performance
Constraint (PC) cuts defines two "dead regions":
- up to 20% boost: we are in energy-reduction only mode, i.e.
accept all candidate which reduce energy
- over 70% boost: we are in performance-increase only mode, i.e.
accept only sched candidate which do not reduce performances
This patch uses a more fine grained configuration where these two "dead
regions" are reduced to: up to 10% and over 90%.
This should allow to have some boosting benefits starting from 10% boost
values as well as not being to much permissive starting from boost values
of 80%.
Suggested-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
bug: 28312446
Change-Id: Ia326c66521e38c98e7a7eddbbb7c437875efa1ba
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
find_best_target CPU selection is biased towards lower CPU IDs. Bias
towards higher CPUs for boosted tasks. For boosted tasks unconditionally
use the idle CPU returned by find_best_target.
BUG: 29512132
Change-Id: I3d650051752163fcf3dc7909751d1fde3f9d17c0
Conflicts:
kernel/sched/fair.c
Contains:
sched/tune: fix accounting for runnable tasks (1/5)
The accounting for tasks into boost groups of different CPUs is currently
broken mainly because:
a) we do not properly track the change of boost group of a RUNNABLE task
b) there are race conditions between migration code and accounting code
This patch provides a fixes to ensure enqueue/dequeue
accounting also for throttled tasks.
Without this patch is can happen that a task is enqueued into a throttled
RQ thus not being accounted for the boosting of the corresponding RQ.
We could argue that a throttled task should not boost a CPU, however:
a) properly implementing CPU boosting considering throttled tasks will
increase a lot the complexity of the solution
b) it's not easy to quantify the benefits introduced by such a more
complex solution
Since task throttling requires the usage of the CFS bandwidth controller,
which is not widely used on mobile systems (at least not by Android kernels
so far), for the time being we go for the simple solution and boost also
for throttled RQs.
sched/tune: fix accounting for runnable tasks (2/5)
This patch provides the code required to enforce proper locking.
A per boost group spinlock has been added to grant atomic
accounting of tasks as well as to serialise enqueue/dequeue operations,
triggered by tasks migrations, with cgroups's attach/detach operations.
sched/tune: fix accounting for runnable tasks (3/5)
This patch adds cgroups {allow,can,cancel}_attach callbacks.
Since a task can be migrated between boost groups while it's running,
the CGroups's attach callbacks have been added to properly migrate
boost contributions of RUNNABLE tasks.
The RQ's lock is used to serialise enqueue/dequeue operations, triggered
by tasks migrations, with cgroups's attach/detach operations. While the
SchedTune's CPU lock is used to grant atrocity of the accounting within
the CPU.
NOTE: the current implementation does not allows a concurrent CPU migration
and CGroups change.
sched/tune: fix accounting for runnable tasks (4/5)
This fixes accounting for exiting tasks by adding a dedicated call early
in the do_exit() syscall, which disables SchedTune accounting as soon as a
task is flagged PF_EXITING.
This flag is set before the multiple dequeue/enqueue dance triggered
by cgroup_exit() which is useful only to inject useless tasks movements
thus increasing possibilities for race conditions with the migration code.
The schedtune_exit_task() call does the last dequeue of a task from its
current boost group. This is a solution more aligned with what happens in
mainline kernels (>v4.4) where the exit_cgroup does not move anymore a dying
task to the root control group.
sched/tune: fix accounting for runnable tasks (5/5)
To avoid accounting issues at startup, this patch disable the SchedTune
accounting until the required data structures have been properly
initialized.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
With the introduction of initialization function required to compute the
energy normalization constants from DTB at boot time, we have now a
late_initcall which is already used by SchedTune.
This patch consolidate within that function the other initialization
bits which was previously deferred to the first CGroup creation.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
The usage of conditional compiled code is discouraged in fair.c.
This patch clean up a bit fair.c by moving schedtune_{cpu.task}_boost
definitions into tune.h.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>