commit 6bd58f09e1d8cc6c50a824c00bf0d617919986a1 upstream.
The timekeeping code does not currently provide a way to translate
externally provided clocksource cycles to system time. The cycle count
is always provided by the result clocksource read() method internal to
the timekeeping code. The added function timekeeping_cycles_to_ns()
calculated a nanosecond value from a cycle count that can be added to
tk_read_base.base value yielding the current system time. This allows
clocksource cycle values external to the timekeeping code to provide a
cycle count that can be transformed to system time.
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: kevin.b.stanton@intel.com
Cc: kevin.j.clarke@intel.com
Cc: hpa@zytor.com
Cc: jeffrey.t.kirsher@intel.com
Cc: netdev@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christopher S. Hall <christopher.s.hall@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
C-state aware scheduler takes note of wakeup latency of each c-state
level to determine whether to pack or wake up LPM CPU. But it doesn't
distinguish small and large delta as it's inefficient for scheduler to
do so on its critical path.
Disregard wakeup latencies less than 64 us between different c-state
levels. This reduces unnecessary task packing.
CRs-fixed: 1074879
Change-Id: Ib0cadbd390d1a0b6da3e39c98010cedb43e5bf60
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
C-state aware scheduler at present, uses a raw c-state index number as
its determinant and avoids task placement on deeper c-state CPUs at
cost of latency. However there are CPUs offering comparable wake-up
latency at different c-state levels and the wake-up latency at each
c-state levels are already have being fed to scheduler.
Hence use the wakeup_latency as c-state determinant instead of raw
c-state index to avoid unnecessary task packing where it's doable.
CRs-fixed: 1074879
Change-Id: If927f84f6c8ba719716d99669e5d1f1b19aaacbe
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
The check for NULL css is redundant as upper layers are already
making sure that css cannot be NULL. Remove this check. It helps
to silence static analysis errors as well.
Change-Id: I64585ff8cceb307904e20ff788e52eb05c000e1f
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Unfortunately we record PIDs in audit records using a variety of
methods despite the correct way being the use of task_tgid_nr().
This patch converts all of these callers, except for the case of
AUDIT_SET in audit_receive_msg() (see the comment in the code).
Reported-by: Jeff Vander Stoep <jeffv@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Bug: 28952093
(cherry picked from commit fa2bea2f5cca5b8d4a3e5520d2e8c0ede67ac108)
Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Change-Id: If6645f9de8bc58ed9755f28dc6af5fbf08d72a00
The ENERGY_AWARE sched feature flag cannot be set unless
CONFIG_SCHED_DEBUG is enabled.
So this patch allows the flag to default to true at build time
if the config is set.
Change-Id: I8835a571fdb7a8f8ee6a54af1e11a69f3b5ce8e6
Signed-off-by: John Stultz <john.stultz@linaro.org>
It will cause deadlock and while(1) if call printk while schedule is in
progress. The block state like as below:
cpu0(hold the console sem):
printk->console_unlock->up_sem->spin_lock(&sem->lock)->wake_up_process(cpu1)
->try_to_wake_up(cpu1)->while(p->on_cpu).
cpu1(request console sem):
console_lock->down_sem->schedule->idle_banlance->update_cpu_capacity->
printk->console_trylock->spin_lock(&sem->lock).
p->on_cpu will be 1 forever, because the task is still running on cpu1,
so cpu0 is blocked in while(p->on_cpu), but cpu1 could not get
spin_lock(&sem->lock), it is blocked too, it means the task will running
on cpu1 forever.
Signed-off-by: Caesar Wang <wxt@rock-chips.com>
On at least one platform, occasionally the timer providing the wallclock
was able to be reset/go backwards for at least some time after wakeup.
Accept that this might happen and warn the first time, but otherwise just
carry on.
Change-Id: Id3164477ba79049561af7f0889cbeebc199ead4e
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
This is required to allow tasks to freely move between cgroups associated
with the tune controller.
Change-Id: I1f39b957462034586edc2fdc0a35488b314e9c8c
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
The schedtune controller will mimic the cpusets controller configuration
for now. For that we need to make 4 groups in addition to the root
group present by default.
Change-Id: I082f1e4e4ebf863e623cf66ee127eac70a3e2716
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
To support task performance boosting, the usage of a single knob has the
advantage to be a simple solution, both from the implementation and the
usability standpoint. However, on a real system it can be difficult to
identify a single value for the knob which fits the needs of multiple
different tasks. For example, some kernel threads and/or user-space
background services should be better managed the "standard" way while we
still want to be able to boost the performance of specific workloads.
In order to improve the flexibility of the task boosting mechanism this
patch is the first of a small series which extends the previous
implementation to introduce a "per task group" support.
This first patch introduces just the basic CGroups support, a new
"schedtune" CGroups controller is added which allows to configure
different boost value for different groups of tasks.
To keep the implementation simple but still effective for a boosting
strategy, the new controller:
1. allows only a two layer hierarchy
2. supports only a limited number of boost groups
A two layer hierarchy allows to place each task either:
a) in the root control group
thus being subject to a system-wide boosting value
b) in a child of the root group
thus being subject to the specific boost value defined by that
"boost group"
The limited number of "boost groups" supported is mainly motivated by
the observation that in a real system it could be useful to have only
few classes of tasks which deserve different treatment.
For example, background vs foreground or interactive vs low-priority.
As an additional benefit, a limited number of boost groups allows also
to have a simpler implementation especially for the code required to
compute the boost value for CPUs which have runnable tasks belonging to
different boost groups.
Change-Id: I1304e33a8440bfdad9c8bcf8129ff390216f2e32
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Git-commit: 13001f47c9
Git-repo: https://android.googlesource.com/kernel/common
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
commit 735f2770a770156100f534646158cb58cb8b2939 upstream.
Commit fec1d01152 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
exit") has caused a subtle regression in nscd which uses
CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
shared databases, so that the clients are notified when nscd is
restarted. Now, when nscd uses a non-persistent database, clients that
have it mapped keep thinking the database is being updated by nscd, when
in fact nscd has created a new (anonymous) one (for non-persistent
databases it uses an unlinked file as backend).
The original proposal for the CLONE_CHILD_CLEARTID change claimed
(https://lkml.org/lkml/2006/10/25/233):
: The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
: on behalf of pthread_create() library calls. This feature is used to
: request that the kernel clear the thread-id in user space (at an address
: provided in the syscall) when the thread disassociates itself from the
: address space, which is done in mm_release().
:
: Unfortunately, when a multi-threaded process incurs a core dump (such as
: from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
: the other threads, which then proceed to clear their user-space tids
: before synchronizing in exit_mm() with the start of core dumping. This
: misrepresents the state of process's address space at the time of the
: SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
: problems (misleading him/her to conclude that the threads had gone away
: before the fault).
:
: The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
: core dump has been initiated.
The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
seems to have a larger scope than the original patch asked for. It
seems that limitting the scope of the check to core dumping should work
for SIGSEGV issue describe above.
[Changelog partly based on Andreas' description]
Fixes: fec1d01152 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit")
Link: http://lkml.kernel.org/r/1471968749-26173-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: William Preston <wpreston@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Andreas Schwab <schwab@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e7d316a02f683864a12389f8808570e37fb90aa3 upstream.
We have scripts which write to certain fields on 3.18 kernels but this
seems to be failing on 4.4 kernels. An entry which we write to here is
xfrm_aevent_rseqth which is u32.
echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth
Commit 230633d109 ("kernel/sysctl.c: detect overflows when converting
to int") prevented writing to sysctl entries when integer overflow
occurs. However, this does not apply to unsigned integers.
Heinrich suggested that we introduce a new option to handle 64 bit
limits and set min as 0 and max as UINT_MAX. This might not work as it
leads to issues similar to __do_proc_doulongvec_minmax. Alternatively,
we would need to change the datatype of the entry to 64 bit.
static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
{
i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32)
vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.
Introduce a new proc handler proc_douintvec. Individual proc entries
will need to be updated to use the new handler.
[akpm@linux-foundation.org: coding-style fixes]
Fixes: 230633d109 ("kernel/sysctl.c:detect overflows when converting to int")
Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ae6c33ba6e37eea3012fe2640b22400ef3f2d0f3 upstream.
Commit bbeddf52ad ("printk: move braille console support into separate
braille.[ch] files") moved the parsing of braille-related options into
_braille_console_setup(), changing the type of variable str from char*
to char**. In this commit, memcmp(str, "brl,", 4) was correctly updated
to memcmp(*str, "brl,", 4) but not memcmp(str, "brl=", 4).
Update the code to make "brl=" option work again and replace memcmp()
with strncmp() to make the compiler able to detect such an issue.
Fixes: bbeddf52ad ("printk: move braille console support into separate braille.[ch] files")
Link: http://lkml.kernel.org/r/20160823165700.28952-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2c81a6477081966fe80b8c6daa68459bca896774 upstream.
The following commit:
66eb579e66 ("perf: allow for PMU-specific event filtering")
added the pmu::filter_match() callback. This was intended to
avoid HW constraints on events from resulting in extremely
pessimistic scheduling.
However, pmu::filter_match() is only called for the leader of each event
group. When the leader is a SW event, we do not filter the groups, and
may fail at pmu::add() time, and when this happens we'll give up on
scheduling any event groups later in the list until they are rotated
ahead of the failing group.
This can result in extremely sub-optimal event scheduling behaviour,
e.g. if running the following on a big.LITTLE platform:
$ taskset -c 0 ./perf stat \
-e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \
-e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \
ls
<not counted> context-switches (0.00%)
<not counted> armv8_cortex_a57/config=0x11/ (0.00%)
24 context-switches (37.36%)
57589154 armv8_cortex_a53/config=0x11/ (37.36%)
Here the 'a53' event group was always eligible to be scheduled, but
the 'a57' group never eligible to be scheduled, as the task was always
affine to a Cortex-A53 CPU. The SW (group leader) event in the 'a57'
group was eligible, but the HW event failed at pmu::add() time,
resulting in ctx_flexible_sched_in giving up on scheduling further
groups with HW events.
One way of avoiding this is to check pmu::filter_match() on siblings
as well as the group leader. If any of these fail their
pmu::filter_match() call, we must skip the entire group before
attempting to add any events.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 66eb579e66 ("perf: allow for PMU-specific event filtering")
Link: http://lkml.kernel.org/r/1465917041-15339-1-git-send-email-mark.rutland@arm.com
[ Small readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 28b89b9e6f7b6c8fef7b3af39828722bca20cfee upstream.
A discrepancy between cpu_online_mask and cpuset's effective_cpus
mask is inevitable during hotplug since cpuset defers updating of
effective_cpus mask using a workqueue, during which time nothing
prevents the system from more hotplug operations. For that reason
guarantee_online_cpus() walks up the cpuset hierarchy until it finds
an intersection under the assumption that top cpuset's effective_cpus
mask intersects with cpu_online_mask even with such a race occurring.
However a sequence of CPU hotplugs can open a time window, during which
none of the effective CPUs in the top cpuset intersect with
cpu_online_mask.
For example when there are 4 possible CPUs 0-3 and only CPU0 is online:
======================== ===========================
cpu_online_mask top_cpuset.effective_cpus
======================== ===========================
echo 1 > cpu2/online.
CPU hotplug notifier woke up hotplug work but not yet scheduled.
[0,2] [0]
echo 0 > cpu0/online.
The workqueue is still runnable.
[2] [0]
======================== ===========================
Now there is no intersection between cpu_online_mask and
top_cpuset.effective_cpus. Thus invoking sys_sched_setaffinity() at
this moment can cause following:
Unable to handle kernel NULL pointer dereference at virtual address 000000d0
------------[ cut here ]------------
Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
Modules linked in:
CPU: 2 PID: 1420 Comm: taskset Tainted: G W 4.4.8+ #98
task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
PC is at guarantee_online_cpus+0x2c/0x58
LR is at cpuset_cpus_allowed+0x4c/0x6c
<snip>
Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
Call trace:
[<ffffffc0001389b0>] guarantee_online_cpus+0x2c/0x58
[<ffffffc00013b208>] cpuset_cpus_allowed+0x4c/0x6c
[<ffffffc0000d61f0>] sched_setaffinity+0xc0/0x1ac
[<ffffffc0000d6374>] SyS_sched_setaffinity+0x98/0xac
[<ffffffc000085cb0>] el0_svc_naked+0x24/0x28
The top cpuset's effective_cpus are guaranteed to be identical to
cpu_online_mask eventually. Hence fall back to cpu_online_mask when
there is no intersection between top cpuset's effective_cpus and
cpu_online_mask.
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current (CFS) scheduler implementation does not allow "to boost"
tasks performance by running them at a higher OPP compared to the
minimum required to meet their workload demands.
To support tasks performance boosting the scheduler should provide a
"knob" which allows to tune how much the system is going to be optimised
for energy efficiency vs performance.
This patch is the first of a series which provides a simple interface to
define a tuning knob. One system-wide "boost" tunable is exposed via:
/proc/sys/kernel/sched_cfs_boost
which can be configured in the range [0..100], to define a percentage
where:
- 0% boost requires to operate in "standard" mode by scheduling
tasks at the minimum capacities required by the workload demand
- 100% boost requires to push at maximum the task performances,
"regardless" of the incurred energy consumption
A boost value in between these two boundaries is used to bias the
power/performance trade-off, the higher the boost value the more the
scheduler is biased toward performance boosting instead of energy
efficiency.
Change-Id: I59a41725e2d8f9238a61dfb0c909071b53560fc0
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Git-commit: 63c8fad2b06805ef88f1220551289f0a3c3529f1
Git-repo: https://source.codeaurora.org/quic/la/kernel/msm-4.4
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
This ensures that the load balancer always works correctly even
without compiler optimizations.
Change-Id: I36408ae65833b624401e60edfb50c19cc061d7bf
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
The CPU controller hasn't kept up with the various changes in the whole
cgroup initialization / destruction sequence, and commit:
2e91fa7f6d ("cgroup: keep zombies associated with their original cgroups")
caused it to explode.
The reason for this is that zombies do not inhibit css_offline() from
being called, but do stall css_released(). Now we tear down the cfs_rq
structures on css_offline() but zombies can run after that, leading to
use-after-free issues.
The solution is to move the tear-down to css_released(), which
guarantees nobody (including no zombies) is still using our cgroup.
Furthermore, a few simple cleanups are possible too. There doesn't
appear to be any point to us using css_online() (anymore?) so fold that
in css_alloc().
And since cgroup code guarantees an RCU grace period between
css_released() and css_free() we can forgo using call_rcu() and free the
stuff immediately.
Change-Id: I51af3d4f0e5dd1c9df6375cce4bb933f67f1022e
Suggested-by: Tejun Heo <tj@kernel.org>
Reported-by: Kazuki Yamaguchi <k@rhe.jp>
Reported-by: Niklas Cassel <niklas.cassel@axis.com>
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2e91fa7f6d ("cgroup: keep zombies associated with their original cgroups")
Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Git-commit: 2f5177f0fd7e531b26d54633be62d1d4cb94621c
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
When a cgroup's CPU runqueue is destroyed, it should remove its
remaining load accounting from its parent cgroup.
The current site for doing so it unsuited because its far too late and
unordered against other cgroup removal (->css_free() will be, but we're also
in an RCU callback).
Put it in the ->css_offline() callback, which is the start of cgroup
destruction, right after the group has been made unavailable to
userspace. The ->css_offline() callbacks are called in hierarchical order
after the following v4.4 commit:
aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children")
Change-Id: Ice7cbd71d9e545da84d61686aa46c7213607bb9d
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Git-commit: 6fe1f348b3dd1f700f9630562b7d38afd6949568
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
"int" type is used to hold the time difference between the successive
updates to nr_run in sched_update_nr_prod(). This can result in
overflow, if the function is called ~2.15 sec after it was called
before. The most probable scenarios are when CPU is idle and
hotplugged. But as we update the last_time of all possible CPUs in
sched_get_nr_running_avg() periodically from a deferrable timer context
(core_ctl module), this overflow is observed only when the system is
completely idle for long time. When this overflow happens we hit
a BUG_ON() in sched_get_nr_running_avg().
Use "u64" type instead of "int" for holding the time difference and
add additional BUG_ON() to catch the instances where sched_clock()
returns a backward value.
Change-Id: I284abb5889ceb8cf9cc689c79ed69422a0e74986
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
The HMP scheduler has two types of task placement boost policies.
(1) boost-on-big policy make use of all big CPUs up to their full capacity
before using the little CPUs. This improves performance on true b.L systems
where the big CPUs have higher efficiency compared to the little CPUs.
(2) boost-on-all policy place the tasks on the CPU having the highest
spare capacity. This policy is optimal for SMP like systems.
The scheduler sets the boost policy to boost-on-big on systems which has
CPUs of different efficiencies. However it is possible that CPUs of the
same micro architecture to have slight difference in efficiency due to
other factors like cache size. Selecting the boost-on-big policy based
on relative difference in efficiency is not optimal on such systems.
The boost-policy device tree property is introduced to specify the
required boost type and it overrides the default selection of boost
type in the scheduler. The possible values for this property are
"boost-on-big" and "boost-on-all".
Change-Id: Iac19183fa7d4bfd9e5746b02a02b2b19cf64b78d
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Add a stub function for init_cluster() and remove a ifdefry
for SCHED_HMP in sched_init()
Change-Id: I6745485152d735436d8398818f7fb5e70ce5ee65
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
The current policy has a preference to select an idle CPU in the waker
cluster compared to the waker CPU running only 1 task. By selecting
an idle CPU, it eliminates the chance of waker migrating to a
different CPU after the wakee preempts it. This policy is also not
susceptible to the incorrect "sync" usage i.e the waker does not
goto sleep after waking up the wakee.
However LPM exit latency associated with an idle CPU outweigh the
above benefits on some targets. So add a knob to prefer the waker
CPU having only 1 runnable task over idle CPUs in the waker cluster.
Change-Id: Id974748c07625c1b19112235f426a5d204dfdb33
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
When cycle_counter is used to estimate the frequency, calling
update_task_ravg() twice on the same task without refreshing
the wallclock results in a division by zero bug. Add a safety
check in update_task_ravg() to prevent this.
The above bug is hit from __schedule() when next == prev. There
is no need to call update_task_ravg() twice for PUT_PREV_TASK
and PICK_NEXT_TASK events for the same task. Calling
update_task_ravg() with TASK_UPDATE event is sufficient.
Change-Id: Ib3af9004f2462618c535b8195377bedb584d0261
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
After the introduction of "33c24b sched: add cpu isolation support"
select_fallback_rq() might sometimes be unable find any CPU to place
a task on. This happens when the all online CPUs are isolated and
the allow isolated flag is set to false. In such cases, we have
little choice but to use an isolated CPU and wait for core control
to eventually un-isolate one or more online CPUs.
Change-Id: Id8738bd8493c11731c5491efcc99eb90f051233e
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
A timer might be running when we are trying to move the timer to another
CPU so ensure that we wait for the timer to finish before migrating.
Change-Id: I4c9ee39c715baebfbdb8a50476a475e38b092f70
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
commit 924d8696751c4b9e58263bc82efdafcf875596a6 upstream.
rtree_next_node() walks the linked list of leaf nodes to find the next
block of pages in the struct memory_bitmap. If it walks off the end of
the list of nodes, it walks the list of memory zones to find the next
region of memory. If it walks off the end of the list of zones, it
returns false.
This leaves the struct bm_position's node and zone pointers pointing
at their respective struct list_heads in struct mem_zone_bm_rtree.
memory_bm_find_bit() uses struct bm_position's node and zone pointers
to avoid walking lists and trees if the next bit appears in the same
node/zone. It handles these values being stale.
Swap rtree_next_node()s 'step then test' to 'test-next then step',
this means if we reach the end of memory we return false and leave
the node and zone pointers as they were.
This fixes a panic on resume using AMD Seattle with 64K pages:
[ 6.868732] Freezing user space processes ... (elapsed 0.000 seconds) done.
[ 6.875753] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds)
[ 6.896453] PM: Using 3 thread(s) for decompression.
[ 6.896453] PM: Loading and decompressing image data (5339 pages)...
[ 7.318890] PM: Image loading progress: 0%
[ 7.323395] Unable to handle kernel paging request at virtual address 00800040
[ 7.330611] pgd = ffff000008df0000
[ 7.334003] [00800040] *pgd=00000083fffe0003, *pud=00000083fffe0003, *pmd=00000083fffd0003, *pte=0000000000000000
[ 7.344266] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[ 7.349825] Modules linked in:
[ 7.352871] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G W I 4.8.0-rc1 #4737
[ 7.360512] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1002C 04/08/2016
[ 7.369109] task: ffff8003c0220000 task.stack: ffff8003c0280000
[ 7.375020] PC is at set_bit+0x18/0x30
[ 7.378758] LR is at memory_bm_set_bit+0x24/0x30
[ 7.383362] pc : [<ffff00000835bbc8>] lr : [<ffff0000080faf18>] pstate: 60000045
[ 7.390743] sp : ffff8003c0283b00
[ 7.473551]
[ 7.475031] Process swapper/0 (pid: 1, stack limit = 0xffff8003c0280020)
[ 7.481718] Stack: (0xffff8003c0283b00 to 0xffff8003c0284000)
[ 7.800075] Call trace:
[ 7.887097] [<ffff00000835bbc8>] set_bit+0x18/0x30
[ 7.891876] [<ffff0000080fb038>] duplicate_memory_bitmap.constprop.38+0x54/0x70
[ 7.899172] [<ffff0000080fcc40>] snapshot_write_next+0x22c/0x47c
[ 7.905166] [<ffff0000080fe1b4>] load_image_lzo+0x754/0xa88
[ 7.910725] [<ffff0000080ff0a8>] swsusp_read+0x144/0x230
[ 7.916025] [<ffff0000080fa338>] load_image_and_restore+0x58/0x90
[ 7.922105] [<ffff0000080fa660>] software_resume+0x2f0/0x338
[ 7.927752] [<ffff000008083350>] do_one_initcall+0x38/0x11c
[ 7.933314] [<ffff000008b40cc0>] kernel_init_freeable+0x14c/0x1ec
[ 7.939395] [<ffff0000087ce564>] kernel_init+0x10/0xfc
[ 7.944520] [<ffff000008082e90>] ret_from_fork+0x10/0x40
[ 7.949820] Code: d2800022 8b400c21 f9800031 9ac32043 (c85f7c22)
[ 7.955909] ---[ end trace 0024a5986e6ff323 ]---
[ 7.960529] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Here struct mem_zone_bm_rtree's start_pfn has been returned instead of
struct rtree_node's addr as the node/zone pointers are corrupt after
we walked off the end of the lists during mark_unsafe_pages().
This behaviour was exposed by commit 6dbecfd345a6 ("PM / hibernate:
Simplify mark_unsafe_pages()"), which caused mark_unsafe_pages() to call
duplicate_memory_bitmap(), which uses memory_bm_find_bit() after walking
off the end of the memory bitmap.
Fixes: 3a20cb1779 (PM / Hibernate: Implement position keeping in radix tree)
Signed-off-by: James Morse <james.morse@arm.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 62822e2ec4ad091ba31f823f577ef80db52e3c2c upstream.
Restore the processor state before calling any other functions to
ensure per-CPU variables can be used with KASLR memory randomization.
Tracing functions use per-CPU variables (GS based on x86) and one was
called just before restoring the processor state fully. It resulted
in a double fault when both the tracing & the exception handler
functions tried to use a per-CPU variable.
Fixes: bb3632c610 (PM / sleep: trace events for suspend/resume)
Reported-and-tested-by: Borislav Petkov <bp@suse.de>
Reported-by: Jiri Kosina <jikos@kernel.org>
Tested-by: Rafael J. Wysocki <rafael@kernel.org>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1245800c0f96eb6ebb368593e251d66c01e61022 upstream.
The iter->seq can be reset outside the protection of the mutex. So can
reading of user data. Move the mutex up to the beginning of the function.
Fixes: d7350c3f45 ("tracing/core: make the read callbacks reentrants")
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>