Commit graph

22582 commits

Author SHA1 Message Date
Jann Horn
74b23f79f1 fs/coredump: prevent fsuid=0 dumps into user-controlled directories
commit 378c6520e7d29280f400ef2ceaf155c86f05a71a upstream.

This commit fixes the following security hole affecting systems where
all of the following conditions are fulfilled:

 - The fs.suid_dumpable sysctl is set to 2.
 - The kernel.core_pattern sysctl's value starts with "/". (Systems
   where kernel.core_pattern starts with "|/" are not affected.)
 - Unprivileged user namespace creation is permitted. (This is
   true on Linux >=3.8, but some distributions disallow it by
   default using a distro patch.)

Under these conditions, if a program executes under secure exec rules,
causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
namespace, changes its root directory and crashes, the coredump will be
written using fsuid=0 and a path derived from kernel.core_pattern - but
this path is interpreted relative to the root directory of the process,
allowing the attacker to control where a coredump will be written with
root privileges.

To fix the security issue, always interpret core_pattern for dumps that
are written under SUID_DUMP_ROOT relative to the root directory of init.

Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:58 -07:00
Tejun Heo
36591ef19a cgroup: ignore css_sets associated with dead cgroups during migration
commit 2b021cbf3cb6208f0d40fd2f1869f237934340ed upstream.

Before 2e91fa7f6d ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks.  As init_css_set is always valid,
this worked fine.

However, after 2e91fa7f6d, zombie tasks stay with the css_set it was
associated with at the time of death.  Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2.  After T
becomes a zombie, it would still remain associated with A and B.  If A
only contains zombie tasks, it can be removed.  On removal, A gets
marked offline but stays pinned until all zombies are drained.  At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.

 WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
 CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
 ...
 Call Trace:
  [<ffffffff8127e63c>] dump_stack+0x4e/0x82
  [<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
  [<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
  [<ffffffff810c33e1>] cgroup_get+0x121/0x160
  [<ffffffff810c349b>] link_css_set+0x7b/0x90
  [<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
  [<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
  [<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
  [<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
  [<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
  [<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
  [<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
  [<ffffffff81151673>] __vfs_write+0x23/0xe0
  [<ffffffff81152494>] vfs_write+0xa4/0x1a0
  [<ffffffff811532d4>] SyS_write+0x44/0xa0
  [<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f

It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration.  This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 2e91fa7f6d ("cgroup: keep zombies associated with their original cgroups")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:54 -07:00
Joshua Hunt
5f4a82d5e3 watchdog: don't run proc_watchdog_update if new value is same as old
commit a1ee1932aa6bea0bb074f5e3ced112664e4637ed upstream.

While working on a script to restore all sysctl params before a series of
tests I found that writing any value into the
/proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
causes them to call proc_watchdog_update().

  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.

There doesn't appear to be a reason for doing this work every time a write
occurs, so only do it when the values change.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:54 -07:00
Chris Friesen
af080e5802 sched/cputime: Fix steal_account_process_tick() to always return jiffies
commit f9c904b7613b8b4c85b10cd6b33ad41b2843fa9d upstream.

The callers of steal_account_process_tick() expect it to return
whether a jiffy should be considered stolen or not.

Currently the return value of steal_account_process_tick() is in
units of cputime, which vary between either jiffies or nsecs
depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN.

If cputime has nsecs granularity and there is a tiny amount of
stolen time (a few nsecs, say) then we will consider the entire
tick stolen and will not account the tick on user/system/idle,
causing /proc/stats to show invalid data.

The fix is to change steal_account_process_tick() to accumulate
the stolen time and only account it once it's worth a jiffy.

(Thanks to Frederic Weisbecker for suggestions to fix a bug in my
first version of the patch.)

Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/56DBBDB8.40305@mail.usask.ca
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:35 -07:00
Alexander Shishkin
37014e0c5c perf/core: Fix perf_sched_count derailment
commit 927a5570855836e5d5859a80ce7e91e963545e8f upstream.

The error path in perf_event_open() is such that asking for a sampling
event on a PMU that doesn't generate interrupts will end up in dropping
the perf_sched_count even though it hasn't been incremented for this
event yet.

Given a sufficient amount of these calls, we'll end up disabling
scheduler's jump label even though we'd still have active events in the
system, thereby facilitating the arrival of the infernal regions upon us.

I'm fixing this by moving account_event() inside perf_event_alloc().

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1456917854-29427-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:34 -07:00
Guenter Roeck
860df91e2a PM / suspend: Add dependency on RTC_LIB
Commit 1eff8f99f9 ("PM / Suspend: Print wall time at suspend entry and
exit") calls rtc_time_to_tm(), which in turn calls rtc_time64_to_tm().
Since RTC_LIB is not mandatory for all architetures, this can result in
the following build error.

suspend.c:(.text+0x2f36c): undefined reference to `rtc_time64_to_tm'

rtc_time64_to_tm() is implemented in rtc-lib, so SUSPEND now needs to
select RTC_LIB.

Fixes: 1eff8f99f9 ("PM / Suspend: Print wall time at suspend entry and exit")
Signed-off-by: Guenter Roeck <groeck@chromium.org>
2016-04-07 16:49:57 +05:30
Brian Norris
49f63e1559 ANDROID: kernel/watchdog: fix unused variable warning
kernel/watchdog.c:122:22: warning: ‘hardlockup_allcpu_dumped’ defined but not used [-Wunused-variable]

Change-Id: I99e97e7cc31b589cd674fd4495832c9ef036d0b9
Signed-off-by: Brian Norris <briannorris@google.com>
2016-04-07 16:49:55 +05:30
Prasad Sodagudi
efea890321 genirq: call cancel_work_sync from irq_set_affinity_notifier
When ever notification of IRQ affinity changes, call
cancel_work_sync from irq_set_affinity_notifier to cancel
all pending works to avoid work list corruption.

Change-Id: I1f093bcc43be8c6696bad29250e4926cbc6c4029
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
2016-03-25 16:02:35 -07:00
Pavankumar Kondeti
bd887e4a58 sched: fix circular dependency of rq->lock and kswadp waitqueue lock
There is a deadlock scenario due to the circular dependency of CPU's
rq->lock and kswapd's waitqueue lock.

(1) when kswapd is woken up, try_to_wake_up() is called with it's
waitqueue lock held. It's previous CPU is offline, so it is woken
up on a different CPU. We try to acquire the offline CPU's rq->lock
in either cpufreq change callback or fixup_busy_time()

(2) At the same time, the offline CPU is coming online and init_idle()
is called from __cpu_up(). init_idle() calls __sched_fork() with
rq->lock held. A debug object allocation in hrtimer_init() called
from __sched_fork() is trying to wakeup the kswapd and attempts to
take the waitqueue lock held in the (1) path.

Task specific initialization is done in __sched_fork() and rq->lock
is not held when it is called for other tasks. The same holds true for
the idle task as well. __sched_fork() for the idle task is called only
when the CPU is not active.

Acquire the rq->lock after calling __sched_fork() in init_idle()
to fix this deadlock.

CRs-Fixed: 965873
Change-Id: Ib8a265835c29861dba571c9b2a6b7e75b5cb43ee
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
[satyap: trivial merge conflicts resolution and omitted changes for QHMP]
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
2016-03-23 21:30:42 -07:00
Joonwoo Park
375d7195fc sched: move out migration notification out of spinlock
The commit 5e16bbc2fb ("sched: Streamline the task migration locking
a little") hardened task migration locking and now __migrate_task() is
called after rq lock held.  Move out notification out of spinlock.

Change-Id: I553adcfe80d5c670f4ddf83438226fd5e0924fe8
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 21:25:25 -07:00
Joonwoo Park
d96fdc91d1 sched: fix compile failure with !CONFIG_SCHED_HMP
Fix various compilation failures when CONFIG_SCHED_HMP or
CONFIG_SCHED_INPUT isn't enabled.

Change-Id: I385dd37cfd778919f54f606bc13bebedd2fb5b9e
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 21:25:24 -07:00
Joonwoo Park
16ecb20600 sched: restrict sync wakee placement bias with waker's demand
Biasing sync wakee task towards waker CPU's cluster makes sense when the
waker's demand is high enough so the wakee also can take advantage
of high CPU frequency voted because of waker's load.  Placing sync wakee
on the low demand waker's CPU can lead placement imbalance which can
lead unnecessary migration.

Introduce a new tunable "sched_big_waker_task_load" that defines the big
waker so scheduler avoid wakee on waker's cluster bias when the waker's
load is below the tunable.

CRs-fixed: 971295
Change-Id: I1550ede0a71ac8c9be74a7daabe164c6a269a3fb
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
[joonwoop@codeaurora.org: fixed a minor conflict in
 include/linux/sched/sysctl.h.]
2016-03-23 21:25:23 -07:00
Joonwoo Park
616e04a51c sched: add preference for waker cluster CPU in wakee task placement
If sync wakee task's demand is small it's worth to place the wakee task
on waker's cluster for better performance in the sense that waker and
wakee are corelated so the wakee should take advantage of waker cluster's
frequency which is voted by the waker along with cache locality benefit.
While biasing towards the waker's cluster we want to avoid the waker CPU
as much as possible as placing the wakee on the waker's CPU can make the
waker got preempted and migrated by load balancer.

Introduce a new tunable 'sched_small_wakee_task_load' that differentiates
eligible small wakee task and place the small wakee tasks on the waker's
cluster.

CRs-fixed: 971295
Change-Id: I96897d9a72a6f63dca4986d9219c2058cd5a7916
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
[joonwoop@codeaurora.org: fixed a minor conflict in
 include/linux/sched/sysctl.h.]
2016-03-23 21:25:22 -07:00
Olav Haugan
b29f9a7a84 sched/core: Add protection against null-pointer dereference
p->grp is being accessed outside of lock which can cause null-pointer
dereference. Fix this and also add rcu critical section around access
of this data structure.

CRs-fixed: 985379
Change-Id: Ic82de6ae2821845d704f0ec18046cc6a24f98e39
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in init_new_task_load().]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 21:25:21 -07:00
Joonwoo Park
615b6f6221 sched: allow select_prev_cpu_us to be set to values greater than 100us
At present sched_select_prev_cpu_us tunable is restricted to values
below 100us.  Fix this unintended restriction.

CRs-Fixed: 972237
Change-Id: I5eaf9f40468805c396328ca1022baef32acf8de0
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 21:25:20 -07:00
Pavankumar Kondeti
fbeb32ce8f sched: clean up idle task's mark_start restoring in init_idle()
The idle task's mark_start can get updated even without the CPU being
online. Hence the mark_start is restored when the CPU is coming online.

The idle task's mark_start is reset in init_idle()->__sched_fork()->
init_new_task_load(). The original mark_start is saved and restored
later. This can be avoided by moving init_new_task_load() to
wake_up_new_task(), which never gets called for an idle task.

We only care about idle task's ravg.mark_start and not initializing
the other fields of ravg struct will not have any side effects.

This clean up allows the subsequent patches to drop the rq->lock
while calling __sched_fork() in init_idle().

CRs-Fixed: 965873
Change-Id: I41de6d69944d7d44b9c4d11b2d97ad01bd8fe96d
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
[joonwoop@codeaurora.org: fixed a minor conflict in core.c.  omitted
 changes for CONFIG_SCHED_QHMP.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 21:25:19 -07:00
Pavankumar Kondeti
58d411413f sched: let sched_boost take precedence over sched_restrict_cluster_spill
When sched_restrict_cluster_spill knob is enabled, RT tasks are restricted
to lower power cluster. This knob also restricts inter cluster no-hz kicks.
Ignore this knob setting when sched_boost is enabled so that tasks are
placed on CPUs with highest spare capacity.

CRs-Fixed: 968852
Change-Id: I01b3fc10b39dc834a733d64c2ee29c308d7ff730
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2016-03-23 21:25:18 -07:00
Pavankumar Kondeti
6d742ce87b sched: Add separate load tracking histogram to predict loads
Current window based load tracking only saves history for five
windows. A historically heavy task's heavy load will be completely
forgotten after five windows of light load. Even before the five
window expires, a heavy task wakes up on same CPU it used to run won't
trigger any frequency change until end of the window. It would starve
for the entire window. It also adds one "small" load window to
history because it's accumulating load at a low frequency, further
reducing the tracked load for this heavy task.

Ideally, scheduler should be able to identify such tasks and notify
governor to increase frequency immediately after it wakes up.

Add a histogram for each task to track a much longer load history. A
prediction will be made based on runtime of previous or current
window, histogram data and load tracked in recent windows. Prediction
of all tasks that is currently running or runnable on a CPU is
aggregated and reported to CPUFreq governor in sched_get_cpus_busy().

sched_get_cpus_busy() now returns predicted busy time in addition
to previous window busy time and new task busy time, scaled to
the CPU maximum possible frequency.

Tunables:

- /proc/sys/kernel/sched_gov_alert_freq (KHz)

This tunable can be used to further filter the notifications.
Frequency alert notification is sent only when the predicted
load exceeds previous window load by sched_gov_alert_freq converted to
load.

Change-Id: If29098cd2c5499163ceaff18668639db76ee8504
Suggested-by: Saravana Kannan <skannan@codeaurora.org>
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
[joonwoop@codeaurora.org: fixed merge conflicts around __migrate_task()
 and removed changes for CONFIG_SCHED_QHMP.]
2016-03-23 21:25:17 -07:00
Junjie Wu
efa673322f sched: Provide a wake up API without sending freq notifications
Each time a task wakes up, scheduler evaluates its load and notifies
governor if the resulting frequency of destination CPU is larger than
a threshold. However, some governor wakes up a separate task that
handles frequency change, which again calls wake_up_process().

This is dangerous because if the task being woken up meets the
threshold and ends up being moved around, there is a potential for
endless recursive notifications.

Introduce a new API for waking up a task without triggering
frequency notification.

Change-Id: I24261af81b7dc410c7fb01eaa90920b8d66fbd2a
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
2016-03-23 21:25:17 -07:00
Pavankumar Kondeti
71a8c392b7 sched: Take downmigrate threshold into consideration
If the tasks are run on the higher capacity cluster solely due to the
reason that they can not be be fit in the lower capacity cluster, the
downmigrate threshold prevents the frequent tasks migrations between
the clusters.

Change-Id: I234a23ffd907c2476c94d5f6227dab1bb6c9bebb
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2016-03-23 21:25:16 -07:00
Pavankumar Kondeti
6003b006be sched: Provide a facility to restrict RT tasks to lower power cluster
The current CPU selection algorithm for RT tasks looks for the
least loaded CPU in all clusters. Stop the search at the lowest
possible power cluster based on "sched_restrict_cluster_spill"
sysctl tunable.

Change-Id: I34fdaefea56e0d1b7e7178d800f1bb86aa0ec01c
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2016-03-23 21:25:15 -07:00
Pavankumar Kondeti
8cd1d7ef16 sched: Take cluster's minimum power into account for optimizing sbc()
The select_best_cpu() algorithm iterates over all the clusters and
selects the most power efficient CPU that satisfies the task needs.
During the search, skip the next cluster if its minimum power cost
is higher than the power cost of an eligible CPU found in the previous
cluster.

In a b.L system, if the BIG cluster minimum power cost is higher than
the maximum power cost of the little cluster, this optimization avoids
searching the BIG cluster if an eligible CPU is found in the little
cluster.

Change-Id: I5e3755f107edb6c72180edbec2a658be931c276d
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2016-03-23 21:25:14 -07:00
Pavankumar Kondeti
6418f213ab sched: Revise the inter cluster load balance restrictions
The frequency based inter cluster load balance restrictions are not
reliable as frequency does not provide a good estimate of the CPU's
current load. Replace them with the spill_load and spill_nr_run
based checks.

The higher capacity cluster is restricted from pulling the tasks from
the lower capacity cluster unless all of the lower capacity CPUs are
above spill. This behavior can be controlled by a sysctl tunable and
it is disabled by default (i.e. no load balance restrictions).

Change-Id: I45c09c8adcb61a8a7d4e08beadf2f97f1805fb42
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
[joonwoop@codeaurora.org: fixed merge conflicts due to omitted changes
 for CONFIG_SCHED_QHMP.]
2016-03-23 21:25:13 -07:00
Srivatsa Vaddagiri
3004236139 sched: colocate related threads
Provide userspace interface for tasks to be grouped together as
"related" threads. For example, all threads involved in updating
display buffer could be tagged as related.

Scheduler will attempt to provide special treatment for group of
related threads such as:

1) Colocation of related threads in same "preferred" cluster
2) Aggregation of demand towards determination of cluster frequency

This patch extends scheduler to provide best-effort colocation support
for a group of related threads.

Change-Id: Ic2cd769faf5da4d03a8f3cb0ada6224d0101a5f5
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor merge conflicts.  removed ifdefry
 for CONFIG_SCHED_QHMP.]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>

Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 21:25:12 -07:00
Srivatsa Vaddagiri
df6bfcaf70 sched: Update fair and rt placement logic to use scheduler clusters
Make use of clusters in the fair and rt scheduling classes. This is
needed as the freq domain mask can no longer be used to do correct
task placement. The freq domain mask was being used to demarcate
clusters.

Change-Id: I57f74147c7006f22d6760256926c10fd0bf50cbd
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed merge conflicts due to omitted changes
 for CONFIG_SCHED_QHMP.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 21:25:11 -07:00
Srivatsa Vaddagiri
cb1bb6a8f4 sched: Introduce the concept CPU clusters in the scheduler
A cluster is set of CPUs sharing some power controls and an L2 cache.
This patch buids a list of clusters at bootup which are sorted by
their max_power_cost. Many cluster-shared attributes like cur_freq,
max_freq etc are needlessly maintained in per-cpu 'struct rq' currently.
Consolidate them in a cluster structure.

Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in
 arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for
 CONFIG_SCHED_QHMP.]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2016-03-23 21:25:10 -07:00
Satya Durga Srinivasu Prabhala
ffff87b7bf kernel/watchdog.c: fix compilation warning on Kernel 4.4
This change fixes below compilation warning on Kernel 4.4.

watchdog.c:122:22: warning: 'hardlockup_allcpu_dumped' \
defined but not used [-Wunused-variable]

Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
2016-03-23 21:25:03 -07:00
Junjie Wu
0ea6cc5218 tracing: power: Add trace events for core control
Add trace events for core control module.

Change-Id: I36da5381709f81ef1ba82025cd9cf8610edef3fc
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
2016-03-23 21:24:30 -07:00
Jeevan Shriram
48d195bfd6 sched: remove init_new_task_load from CONFIG_SMP
Move init_new_task_load function from CONFIG_SMP to avoid
linking error for ARCH=um

Signed-off-by: Jeevan Shriram <jshriram@codeaurora.org>
2016-03-23 21:24:22 -07:00
Junjie Wu
809d8460ca sched: Export sched_setscheduler_nocheck()
Export sched_setscheduler_nocheck() so that external kernel modules
can use it.

Change-Id: Ib50f537f5aef50c365ba63fb8ffce05bc1c7c431
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
Signed-off-by: Bryan Huntsman <bryanh@codeaurora.org>
2016-03-23 21:24:02 -07:00
Bryan Huntsman
f720d40148 Revert "sched: Export sched_setscheduler_nocheck"
This reverts commit 84778472e1.

Signed-off-by: Bryan Huntsman <bryanh@codeaurora.org>
2016-03-23 21:24:01 -07:00
Lingutla Chandrasekhar
2fbd2f64e4 time: alarmtimer: include lpm-levels for MSM targets only
lpm-level headers required only when CONFIG_MSM_PM is set.

To compile msm kernel for other targets (arch=um), add config
check to include lpm levels.

Change-Id: Ia1bd51da4952e56b945a5e51a3b1ff8aaa643cd5
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Jeevan Shriram <jshriram@codeaurora.org>
2016-03-23 21:23:46 -07:00
Mohit Aggarwal
4b81b05c57 rtc: alarm: Change wake-up source
Currently, RTC_ALARM is used to wake-up target from
suspend state and is also used for power-off alarm
feature. This patch uses qtimer to wake-up from
suspend state.

Change-Id: Ia42cfecd573309be2f03c18b4f1c321be8202d7d
Signed-off-by: Mohit Aggarwal <maggarwa@codeaurora.org>
2016-03-23 21:23:46 -07:00
Vignesh Radhakrishnan
1dd9d8dd98 kmemleak : Make module scanning optional using config
Currently kmemleak scans module memory as provided
in the area list. This takes up lot of time with
irq's and preemption disabled. Provide a compile
time configurable config to enable this functionality.

Change-Id: I5117705e7e6726acdf492e7f87c0703bc1f28da0
Signed-off-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
[satyap: trivial merge conflict resolution and remove duplicate entry]
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
2016-03-23 21:23:43 -07:00
Vignesh Radhakrishnan
1723ce4914 smpboot: use kmemleak_not_leak for smpboot_thread_data
Kmemleak reports the following memory leak :

    [<ffffffc0002faef8>] create_object+0x140/0x274
    [<ffffffc000cc3598>] kmemleak_alloc+0x80/0xbc
    [<ffffffc0002f707c>] kmem_cache_alloc_trace+0x148/0x1d8
    [<ffffffc00024504c>] __smpboot_create_thread.part.2+0x2c/0xec
    [<ffffffc0002452b4>] smpboot_register_percpu_thread+0x90/0x118
    [<ffffffc0016067c0>] spawn_ksoftirqd+0x1c/0x30
    [<ffffffc000200824>] do_one_initcall+0xb0/0x14c
    [<ffffffc001600820>] kernel_init_freeable+0x84/0x1e0
    [<ffffffc000cc273c>] kernel_init+0x10/0xcc
    [<ffffffc000203bbc>] ret_from_fork+0xc/0x50

This memory allocated here points to smpboot_thread_data.
Data is used as an argument for this kthread.

This will be used when smpboot_thread_fn runs. Therefore,
is not a leak.

Call kmemleak_not_leak for smpboot_thread_data pointer
to ensure that kmemleak doesn't report it as a memory
leak.

Change-Id: I02b0a7debea3907b606856e069d63d7991b67cd9
Signed-off-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
2016-03-23 21:23:42 -07:00
Christoph Lameter
8fcc9882c3 vmstat: make vmstat_updater deferrable again and shut down on idle
Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca ("vmstat: do not use deferrable delayed work for
vmstat_update").  This in turn can cause multiple interruptions of the
applications because the vmstat updater may run at

Make vmstate_update deferrable again and provide a function that folds
the differentials when the processor is going to idle mode thus
addressing the issue of the above commit in a clean way.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Change-Id: Idf256cfacb40b4dc8dbb6795cf06b34e8fec7a06
Fixes: ba4877b9ca ("vmstat: do not use deferrable delayed work for vmstat_update")
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 0eb77e9880321915322d42913c3b53241739c8aa
[shashim@codeaurora.org: resolve minor merge conflicts]
Signed-off-by: Shiraz Hashim <shashim@codeaurora.org>
[satyap: resolve trivial merge conflicts]
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
2016-03-23 21:22:14 -07:00
Rusty Russell
606471708a modules: fix longstanding /proc/kallsyms vs module insertion race.
For CONFIG_KALLSYMS, we keep two symbol tables and two string tables.
There's one full copy, marked SHF_ALLOC and laid out at the end of the
module's init section.  There's also a cut-down version that only
contains core symbols and strings, and lives in the module's core
section.

After module init (and before we free the module memory), we switch
the mod->symtab, mod->num_symtab and mod->strtab to point to the core
versions.  We do this under the module_mutex.

However, kallsyms doesn't take the module_mutex: it uses
preempt_disable() and rcu tricks to walk through the modules, because
it's used in the oops path.  It's also used in /proc/kallsyms.
There's nothing atomic about the change of these variables, so we can
get the old (larger!) num_symtab and the new symtab pointer; in fact
this is what I saw when trying to reproduce.

By grouping these variables together, we can use a
carefully-dereferenced pointer to ensure we always get one or the
other (the free of the module init section is already done in an RCU
callback, so that's safe).  We allocate the init one at the end of the
module init section, and keep the core one inside the struct module
itself (it could also have been allocated at the end of the module
core, but that's probably overkill).

CRs-Fixed: 982779
Change-Id: I519f081967785e44a6ea33b16b1da64b14979963
Reported-by: Weilong Chen <chenweilong@huawei.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111541
Cc: stable@kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Git-commit: 8244062ef1e54502ef55f54cced659913f244c3e
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[salvares@codeaurora.org: resolved context conflicts in module.c]
Signed-off-by: Sanrio Alvares <salvares@codeaurora.org>
2016-03-23 21:21:40 -07:00
Vinayak Menon
ad6f486284 mm: swap: swap ratio support
Add support to receive a static ratio from userspace to
divide the swap pages between ZRAM and disk based swap
devices. The existing infrastructure allows to keep
same priority for multiple swap devices, which results
in round robin distribution of pages. With this patch,
the ratio can be defined.

CRs-fixed: 968416
Change-Id: I54f54489db84cabb206569dd62d61a8a7a898991
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-03-23 21:19:04 -07:00
Tingting Yang
77f8a75933 msm: move printk out of spin lock low_water_lock
cpu3 stuck in printk more time in spin lock low_water_lock cause cpu0
get spin lock fail and system crashed.

CRs-Fixed: 521570
Change-Id: I75356a4b4171ae2888ce6cce792f569b5ca8cdcf
Signed-off-by: Tingting Yang <tingting@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
2016-03-23 21:19:01 -07:00
Srinivas Rao L
cf52f9b467 qos: Disable irq notifier when qos request is removed
When a qos request is added with IRQ affinity, irq notifier
for that irq is added and with out disabling the irq
notifier if the irq is free it is through a warning.
So disable irq notifier when the qos request is removed.

Change-Id: I50faa4ecbe1b632c0f0f203ca52faf18753c33d4
Signed-off-by: Srinivas Rao L <lsrao@codeaurora.org>
2016-03-23 21:17:12 -07:00
Steven Cahail
d9ad69783b trace: ipc_logging: Increase maximum size of logging context name
The current maximum length of an IPC Logging log context name is 20
characters. Some clients are now using names longer than this.

Increase the maximum length to 32. This change increases the IPC Logging
version to 3.

Change-Id: I9daecb8a7c6c3aea427efd1c75e307456e9c6c21
Signed-off-by: Steven Cahail <scahail@codeaurora.org>
2016-03-23 21:15:11 -07:00
Steven Cahail
d3bc56a0d1 trace: ipc_logging: Destroy debugfs directories with logging contexts
IPC Logging currently creates debugfs directories for each of its
logging contexts, but does not remove them when destroying the contexts.
If a user attempts to access a directory associated with a destroyed
context, a crash will result.

Destroy debugfs directories when the logging context is destroyed.

Change-Id: I5a3b1cbf2fb5d9d0ede3d9da0fccd605b9fdf619
Signed-off-by: Steven Cahail <scahail@codeaurora.org>
2016-03-23 21:15:04 -07:00
Jeevan Shriram
ed8ae25f2a cgroup: fix uninitialized usage of a variable
It is possible that 'root' variable is used uninitialized. This
change avoids usage of uninitialized usage of the variable.

Change-Id: I9a3bd941a23736cb003f209cf6dde84fd859e9e6
Signed-off-by: Jeevan Shriram <jshriram@codeaurora.org>
2016-03-23 21:12:09 -07:00
Prasad Sodagudi
ef45790690 printk: Add all cpu notifiers under CONSOLE_FLUSH_ON_HOTPLUG flag
Add all cpu notifiers in CONSOLE_FLUSH_ON_HOTPLUG config
flag to avoid hotplug latencies and by default config
CONSOLE_FLUSH_ON_HOTPLUG flag is disabled.

Change-Id: I389f207d8faf84cfd4267d52213e40a47a43774d
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
2016-03-23 21:11:59 -07:00
Joonwoo Park
646bf5125d timer: make deferrable cpu unbound timers really not bound to a cpu
When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
queue_delayed_work() it's probably intended to run the work item on any
CPU that isn't idle. However, we queue the work to run at a later time
by starting a deferrable timer that binds to whatever CPU the work is
queued on which is same with queue_delayed_work_on(smp_processor_id())
effectively.

As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
In fact this is perfectly fine with UP kernel and also won't affect much a
system without dyntick with SMP kernel too as every cpus run timers
periodically.  But on SMP systems with dyntick current implementation leads
deferrable timers not very scalable because the timer's base which has
queued the deferrable timer won't wake up till next non-deferrable timer
expires even though there are possible other non idle cpus are running
which are able to run expired deferrable timers.

The deferrable work is a good example of the current implementation's
victim like below.

INIT_DEFERRABLE_WORK(&dwork, fn);
CPU 0                                 CPU 1
queue_delayed_work(wq, &dwork, HZ);
    queue_delayed_work_on(WORK_CPU_UNBOUND);
        ...
	__mod_timer() -> queues timer to the
			 current cpu's timer
			 base.
	...
tick_nohz_idle_enter() -> cpu enters idle.
A second later
cpu 0 is now in idle.                 cpu 1 exits idle or wasn't in idle so
                                      now it's in active but won't
cpu 0 won't wake up till next         handle cpu unbound deferrable timer
non-deferrable timer expires.         as it's in cpu 0's timer base.

To make all cpu unbound deferrable timers are scalable, introduce a common
timer base which is only for cpu unbound deferrable timers to make those
are indeed cpu unbound so that can be scheduled by tick_do_timer_cpu.
This common timer fixes scalability issue of delayed work and all other cpu
unbound deferrable timer using implementations.

Change-Id: I8b6c57d8b6445a76fa02a8cb598a8ef22aef7200
CC: Thomas Gleixner <tglx@linutronix.de>
CC: John Stultz <john.stultz@linaro.org>
CC: Tejun Heo <tj@kernel.org>
[joonwoop@codeaurora.org: timer->base replaced with CPU index so get
 the deferrable timer wheel from lock_timer_base() instead of
 do_init_timer().]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 21:10:48 -07:00
Rohit Gupta
960dbb1751 perf: Skip permission checks on kernel owned perf events
The perf event permission checks are necessary because they introduce
a security concern where one userspace task can monitor another task
and gain security related information. That concern doesn't exist for
kernel owned perf events since the kernel already has access to
everything. So, skip permission checks for kernel owned perf events.

Change-Id: I7121f5e03cf6ce8f0bfc9b5a69488efb80a97051
Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>
[satyap: trivial merge conflict resolution]
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
2016-03-23 20:58:14 -07:00
Neil Leeder
e9e1c5a00f perf: pass correct parameter when removing event
An incorrect pointer to the event was being passed, instead of
pointer to the remove event struct. Pass the correct pointer.

Change-Id: I7c35c5bb3a14d74a9b36c3d1dbd7af0bf80e7efe
Signed-off-by: Neil Leeder <nleeder@codeaurora.org>
2016-03-23 20:58:13 -07:00
Neil Leeder
86cd54bb19 perf: support hotplug
Add support for hotplugged cpu cores.

Change-Id: I0538ed67f1ad90bbd0510a7ba137cb6d1ad42172
Signed-off-by: Neil Leeder <nleeder@codeaurora.org>
[satyap: trivial merge conflict resolution]
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
2016-03-23 20:58:09 -07:00
Rohit Vaswani
fdeaa450d8 fixup! block/fs: keep track of the task that dirtied the page 2016-03-23 20:57:46 -07:00
Matt Wagantall
e1b2469529 trace: cpu_freq_switch: Add profiler for CPU frequency switch times
It is sometimes useful to profile how long CPU frequency switches
take, and traces have already been added for this purpose. Make
use of these and the trace_stat framework to generate statistical
histograms of frequency switch times in the following format:

 # cat /sys/kernel/debug/tracing/trace_stat/cpu_freq_switch
  CPU START_KHZ  END_KHZ COUNT AVG_US MIN_US MAX_US
    |         |        |     |      |      |      |
    0    384000  1512000     3   2787   1648   3418
    0    486000   384000     1   1129   1129   1129
    0   1458000   384000     1   3174   3174   3174
    0   1512000   384000     1   3265   3265   3265
    0   1512000   486000     1   3235   3235   3235
    0   1512000  1458000     1    213    213    213
    0   1512000  1512000     1      0      0      0

Profiling is disabled by default (since it does incur some
overhead). It can be enabled or re-disabled echoing 1 or 0
to /sys/kernel/debug/tracing/cpu_freq_switch_profile_enabled

Change-Id: I3ef7f9d681b7bd13bcaa031003b10312afe1aefe
Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
2016-03-23 20:57:45 -07:00