Commit graph

22227 commits

Author SHA1 Message Date
Todd Kjos
8935b6b4d2 FIXUP: sched: Fix double-release of spinlock in move_queued_task
BUG: 29519455
Change-Id: I4d1c27a1b4bcbba03d4b175d170cfe1701a90ffd
2016-08-11 14:26:47 -07:00
Todd Kjos
740d312ce8 FIXUP: sched/fair: Fix hang during suspend in sched_group_energy
BUG: 29353986
Change-Id: I0d0d8d5c107a2e0bd219819e036091106bb40e11
2016-08-11 14:26:46 -07:00
Patrick Bellasi
5156b67204 FIXUP: sched: fix SchedFreq integration for both PELT and WALT
The current kernel allows to use either PELT or WALT to track CPUs utilizations.
One of the main differences between the two approaches is that PELT
tracks only utilization of SCHED_OTHER classes while WALT tracks all tasks
with a single signal.

The current sched_freq_tick does not make this distinction and, when WALT
is in use, we end up adding multiple time the contribution related to
the RT and DL classes. This patch fixes this issue by:

1. providing two different code paths for PELT and WALT, thus granting that
   when we switch to PELT we get the original behaviour based on the assumption
   that class aggregations is done underneath by SchedFreq.

2. avoiding the double accounting of DL and RT workloads, when WALT is in use,
   by just adding a margin to the original WALT signal when we need to check
   if the CFS capacity has to be increased.

Change-Id: I7326fd50e868e97fb5e12351917e9d2969bfdae7
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-08-11 14:26:45 -07:00
Todd Kjos
782c9d64a1 sched: EAS: Avoid causing spikes to max-freq unnecessarily
During scheduler tick handling, the frequency was being set to
max-freq if the current frequency is less than the current
utilization. Change to just request "right" frequency instead
of max.

BUG: 29871410
Change-Id: I6fe65b14413da44b1520ba116f72320083eb92f8
2016-08-11 14:26:45 -07:00
Patrick Bellasi
dfc1151b46 FIXUP: sched: fix set_cfs_cpu_capacity when WALT is in use
The CPU utilization reported when WALT is in use already tracks the
contributions due to RT and DL workloads. However, SchedFreq exposes
different capacity update functions, one for each class, and does classes
utilization internally at update_cpu_capacity_request() call time.

This patch ensures that when WALT is in use, the:
  cpu_sched_capacity_reqs::cfs
value is tracking just the load generated by SCHED_OTHER tasks.

Change-Id: Ibd9c9a10874a1d91f62477034548f7664e57cd6a
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-08-11 14:26:44 -07:00
Srinath Sridharan
519c62750e sched/walt: Accounting for number of irqs pending on each core
Schedules on a core whose irq count is less than a threshold.
Improves I/O performance of EAS.

Change-Id: I08ff7dd0d22502a0106fc636b1af2e6fe9e758b5
2016-08-11 14:26:43 -07:00
Srivatsa Vaddagiri
efb86bd08a sched: Introduce Window Assisted Load Tracking (WALT)
use a window based view of time in order to track task
demand and CPU utilization in the scheduler.

Window Assisted Load Tracking (WALT) implementation credits:
 Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,
 Pavan Kumar Kondeti, Olav Haugan

2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla
            and Todd Kjos

Change-Id: I21408236836625d4e7d7de1843d20ed5ff36c708

Includes fixes for issues:

eas/walt: Use walt_ktime_clock() instead of ktime_get_ns() to avoid a
race resulting in watchdog resets
BUG: 29353986
Change-Id: Ic1820e22a136f7c7ebd6f42e15f14d470f6bbbdb

Handle walt accounting anomoly during resume

During resume, there is a corner case where on wakeup, a task's
prev_runnable_sum can go negative. This is a workaround that
fixes the condition and warns (instead of crashing).

BUG: 29464099
Change-Id: I173e7874324b31a3584435530281708145773508

Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-11 14:26:43 -07:00
Patrick Bellasi
345be81dba sched/tune: fix PB and PC cuts indexes definition
The current definition of the Performance Boost (PB) and Performance Constraint
(PC) regions is has two main issues:
1) in the computation of the boost index we overflow the thresholds_gains
   table for boost=100
2) the two cuts had _NOT_ the same ratio

The last point means that when boost=0 we do _not_ have a "standard" EAS
behaviour, i.e. accepting all candidate which decrease energy regardless
of their impact on performances. Instead, we accept only schedule candidate
which are in the Optimal region, i.e. decrease energy while increasing
performances.

This behaviour can have a negative impact also on CPU selection policies
which tries to spread tasks to reduce latencies. Indeed, for example
we could end up rejecting a schedule candidate which want to move a task
from a congested CPU to an idle one while, specifically in the case where
the target CPU will be running on a lower OPP.

This patch fixes these two issues by properly clamping the boost value
in the appropriate range to compute the threshold indexes as well as
by using the same threshold index for both cuts.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>

sched/tune: fix update of threshold index for boost groups

When SchedTune is configured to work with CGroup mode, each time we update
the boost value of a group we do not update the threshed indexes for the
definition of the Performance Boost (PC) and Performance Constraint (PC)
region. This means that while the OPP boosting and CPU biasing selection
is working as expected, the __schedtune_accept_deltas function is always
using the initial values for these cuts.

This patch ensure that each time a new boost value is configured for a
boost group, the cuts for the PB and PC region are properly updated too.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>

sched/tune: update PC and PB cuts definition

The current definition of Performance Boost (PB) and Performance
Constraint (PC) cuts defines two "dead regions":
- up to 20% boost: we are in energy-reduction only mode, i.e.
  accept all candidate which reduce energy
- over 70% boost: we are in performance-increase only mode, i.e.
  accept only sched candidate which do not reduce performances

This patch uses a more fine grained configuration where these two "dead
regions" are reduced to: up to 10% and over 90%.
This should allow to have some boosting benefits starting from 10% boost
values as well as not being to much permissive starting from boost values
of 80%.

Suggested-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>

bug: 28312446
Change-Id: Ia326c66521e38c98e7a7eddbbb7c437875efa1ba

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-08-11 14:26:42 -07:00
Todd Kjos
369bcbbae4 sched/fair: optimize idle cpu selection for boosted tasks
find_best_target CPU selection is biased towards lower CPU IDs. Bias
towards higher CPUs for boosted tasks. For boosted tasks unconditionally
use the idle CPU returned by find_best_target.

BUG: 29512132
Change-Id: I3d650051752163fcf3dc7909751d1fde3f9d17c0

Conflicts:
	kernel/sched/fair.c
2016-08-11 14:26:41 -07:00
Patrick Bellasi
af14760e19 FIXUP: sched/tune: fix accounting for runnable tasks
Contains:

sched/tune: fix accounting for runnable tasks (1/5)

The accounting for tasks into boost groups of different CPUs is currently
broken mainly because:
a) we do not properly track the change of boost group of a RUNNABLE task
b) there are race conditions between migration code and accounting code

This patch provides a fixes to ensure enqueue/dequeue
accounting also for throttled tasks.

Without this patch is can happen that a task is enqueued into a throttled
RQ thus not being accounted for the boosting of the corresponding RQ.
We could argue that a throttled task should not boost a CPU, however:
a) properly implementing CPU boosting considering throttled tasks will
   increase a lot the complexity of the solution
b) it's not easy to quantify the benefits introduced by such a more
   complex solution

Since task throttling requires the usage of the CFS bandwidth controller,
which is not widely used on mobile systems (at least not by Android kernels
so far), for the time being we go for the simple solution and boost also
for throttled RQs.

sched/tune: fix accounting for runnable tasks (2/5)

This patch provides the code required to enforce proper locking.
A per boost group spinlock has been added to grant atomic
accounting of tasks as well as to serialise enqueue/dequeue operations,
triggered by tasks migrations, with cgroups's attach/detach operations.

sched/tune: fix accounting for runnable tasks (3/5)

This patch adds cgroups {allow,can,cancel}_attach callbacks.

Since a task can be migrated between boost groups while it's running,
the CGroups's attach callbacks have been added to properly migrate
boost contributions of RUNNABLE tasks.

The RQ's lock is used to serialise enqueue/dequeue operations, triggered
by tasks migrations, with cgroups's attach/detach operations. While the
SchedTune's CPU lock is used to grant atrocity of the accounting within
the CPU.

NOTE: the current implementation does not allows a concurrent CPU migration
      and CGroups change.

sched/tune: fix accounting for runnable tasks (4/5)

This fixes accounting for exiting tasks by adding a dedicated call early
in the do_exit() syscall, which disables SchedTune accounting as soon as a
task is flagged PF_EXITING.

This flag is set before the multiple dequeue/enqueue dance triggered
by cgroup_exit() which is useful only to inject useless tasks movements
thus increasing possibilities for race conditions with the migration code.
The schedtune_exit_task() call does the last dequeue of a task from its
current boost group. This is a solution more aligned with what happens in
mainline kernels (>v4.4) where the exit_cgroup does not move anymore a dying
task to the root control group.

sched/tune: fix accounting for runnable tasks (5/5)

To avoid accounting issues at startup, this patch disable the SchedTune
accounting until the required data structures have been properly
initialized.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-11 14:26:41 -07:00
Patrick Bellasi
7f8f24a0ea sched/tune: use a single initialisation function
With the introduction of initialization function required to compute the
energy normalization constants from DTB at boot time, we have now a
late_initcall which is already used by SchedTune.

This patch consolidate within that function the other initialization
bits which was previously deferred to the first CGroup creation.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-10 15:21:52 -07:00
Patrick Bellasi
274bbcfbe4 sched/{fair,tune}: simplify fair.c code
The usage of conditional compiled code is discouraged in fair.c.

This patch clean up a bit fair.c by moving schedtune_{cpu.task}_boost
definitions into tune.h.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-08-10 15:20:38 -07:00
Patrick Bellasi
254f5095a8 FIXUP: sched/tune: fix payoff calculation for boost region
The definition of the acceptance regions as well as the translation of
these regions into a payoff value was both wrong which turned out in:
a) a wrong definition of payoff for the performance boost region
b) a correct "by chance" definition of the payoff for the performance
   constraint region (i.e. two sign errors together fixing the formula)

This patch provides a better description of the cut regions as well as
a fixed version of the payoff computations, which are now reduced to a
single formula usable for both cases.

Reported-by: Leo Yan <leo.yan@linaro.org>
Reviewed-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-08-10 15:18:40 -07:00
Srinath Sridharan
00aae8d5d5 sched/tune: Add support for negative boost values
Change-Id: I164ee04ba98c3a776605f18cb65ee61b3e917939

Contains also:

eas/stune: schedtune cpu boost_max must be non-negative.

This is to avoid under-accounting cpu capacity which may
cause task stacking and frequency spikes.

Change-Id: Ie1c1cbd52a6edb77b4c15a830030aa748dff6f29
2016-08-10 15:18:35 -07:00
Patrick Bellasi
6ba071d89d FIX: sched/tune: move schedtune_nornalize_energy into fair.c
The energy normalization function is required to get the proper values
for the P-E space filtering function to work.
That normalization is part of the hot wakeup path and currently implemented
with a function call.

Moving the normalization function into fair.c allows the compiler to
further optimize that code by reducing overheads in the wakeup hot path.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-10 15:17:45 -07:00
Patrick Bellasi
28e8cb961c FIX: sched/tune: update usage of boosted task utilisation on CPU selection
A boosted task needs to be scheduled on a CPU which can grant a minimum
capacity which is higher than its utilization.
However, a task can be allocated on a CPU which already provides an utilization
which is higher than the task boosted utilization itself.
Moreover, with the previous approach a task 100% boosted is not fitting any
CPU.

This patch makes use of the boosted task utilization just as a threashold
which defines the minimum capacity should be available on a CPU to host that
task.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-08-10 15:16:02 -07:00
Todd Kjos
c50cc2299c sched/fair: add tunable to set initial task load
The choice of initial task load upon fork has a large influence
on CPU and OPP selection when scheduler-driven DVFS is in use.
Make this tuneable by adding a new sysctl "sched_initial_task_util".

If the sched governor is not used, the default remains at SCHED_LOAD_SCALE
Otherwise, the value from the sysctl is used. This defaults to 0.

Signed-off-by: "Todd Kjos <tkjos@google.com>"
2016-08-10 15:15:55 -07:00
Juri Lelli
4a5e890ec6 sched/fair: add tunable to force selection at cpu granularity
EAS assumes that clusters with smaller capacity cores are more
energy-efficient. This may not be true on non-big-little devices,
so EAS can make incorrect cluster selections when finding a CPU
to wake. The "sched_is_big_little" hint can be used to cause a
cpu-based selection instead of cluster-based selection.

This change incorporates the addition of the sync hint enable patch

EAS did not honour synchronous wakeup hints, a new sysctl is
created to ask EAS to use this information when selecting a CPU.
The control is called "sched_sync_hint_enable".

Also contains:

EAS: sched/fair: for SMP bias toward idle core with capacity

For SMP devices, on wakeup bias towards idle cores that have capacity
vs busy devices that need a higher OPP

eas: favor idle cpus for boosted tasks

BUG: 29533997
BUG: 29512132
Change-Id: I0cc9a1b1b88fb52916f18bf2d25715bdc3634f9c
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>

eas/sched/fair: Favoring busy cpus with low OPPs

BUG: 29533997
BUG: 29512132
Change-Id: I9305b3239698d64278db715a2e277ea0bb4ece79

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-08-10 15:15:46 -07:00
Srinath Sridharan
2e9abbc942 sched: EAS: take cstate into account when selecting idle core
Introduce a new sysctl for this option, 'sched_cstate_aware'.
When this is enabled, select_idle_sibling in CFS is modified to
choose the idle CPU in the sibling group which has the lowest
idle state index - idle state indexes are assumed to increase
as sleep depth and hence wakeup latency increase. In this way,
we attempt to minimise wakeup latency when an idle CPU is
required.

Signed-off-by: Srinath Sridharan <srinathsr@google.com>

Includes:
sched: EAS: fix select_idle_sibling

when sysctl_sched_cstate_aware is enabled, best_idle cpu will not be chosen
in the original flow because it will goto done directly

Bug: 30107557
Change-Id: Ie09c2e3960cafbb976f8d472747faefab3b4d6ac
Signed-off-by: martin_liu <martin_liu@htc.com>
2016-08-10 15:15:39 -07:00
Srinath Sridharan
d753e92e19 sched/cpufreq_sched: Consolidated update
Contains:

sched/cpufreq_sched: use shorter throttle for raising OPP

Avoid cases where a brief drop in load causes a change to a low OPP
for the full throttle period. Use a shorter throttle period for
raising OPP than for lowering OPP.

sched-freq: Fix handling of max/min frequency

This reverts commit 9726142608f5b3bf5df4280243c9d324e692a510.

Change-Id: Ia78095354f7ad9492f00deb509a2b45112361eda

sched/cpufreq: Increasing throttle_down_nsec to 50ms

Change-Id: I2d8969cf2a64fa719b9dd86f43f9dd14b1ff84fe

sched-freq: make throttle times tunable

Change-Id: I127879645367425b273441d7f0306bb15d5633cb

Signed-off-by: Srinath Sridharan <srinathsr@google.com>
Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[jstultz: Fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-10 15:14:51 -07:00
Patrick Bellasi
765c2ab363 FIXUP: sched: fix build for non-SMP target
Currently the build for a single-core (e.g. user-mode) Linux is broken
and this configuration is required (at least) to run some network tests.

The main issues for the current code support on single-core systems are:
1. {se,rq}::sched_avg is not available nor maintained for !SMP systems
   This means that load and utilisation signals are NOT available in single
   core systems. All the EAS code depends on these signals.
2. sched_group_energy is also SMP dependant. Again this means that all the
   EAS setup and preparation code (energyn model initialization) has to be
   properly guarded/disabled for !SMP systems.
3. SchedFreq depends on utilization signal, which is not available on
   !SMP systems.
4. SchedTune is useless on unicore systems if SchedFreq is not available.
5. WALT machinery is not required on single-core systems.

This patch addresses all these issues by enforcing some constraints for
single-core systems:
a) WALT, SchedTune and SchedTune are now dependant on SMP
b) The default governor for !SMP systems is INTERACTIVE
c) The energy model initialisation/build functions are
d) Other minor code re-arrangements and CONFIG_SMP guarding to enable
   single core builds.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-08-10 15:08:01 -07:00
Patrick Bellasi
ae54c7741f DEBUG: sched/tune: add tracepoint on P-E space filtering
Change-Id: I31dfed67c0486713b88efb75df767329f2802e06
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-08-10 15:07:28 -07:00
Patrick Bellasi
4525aa343e DEBUG: sched/tune: add tracepoint for energy_diff() values
Change-Id: Id8fafbd85f6d81248f322e073ee790a7ceec0bf7
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-08-10 15:07:07 -07:00
Patrick Bellasi
962b7c10ab DEBUG: sched/tune: add tracepoint for task boost signal
Change-Id: I545d3bf5569fc41c0fa70f51dff9a19c11d532ee
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-08-10 15:06:58 -07:00
Joseph Lo
3a400abdc5 CHROMIUM: sched: update the average of nr_running
Doing a Exponential moving average per nr_running++/-- does not
guarantee a fixed sample rate which induces errors if there are lots of
threads being enqueued/dequeued from the rq (Linpack mt). Instead of
keeping track of the avg, the scheduler now keeps track of the integral
of nr_running and allows the readers to perform filtering on top.

Original-author: Sai Charan Gurrappadi <sgurrappadi@nvidia.com>

Change-Id: Id946654f32fa8be0eaf9d8fa7c9a8039b5ef9fab
Signed-off-by: Joseph Lo <josephl@nvidia.com>
Signed-off-by: Andrew Bresticker <abrestic@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/174694
Reviewed-on: https://chromium-review.googlesource.com/272853
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-10 15:01:22 -07:00
Alexey Dobriyan
470f47fcf2 posix_cpu_timer: Exit early when process has been reaped
commit 2c13ce8f6b2f6fd9ba2f9261b1939fc0f62d1307 upstream.

Variable "now" seems to be genuinely used unintialized
if branch

	if (CPUCLOCK_PERTHREAD(timer->it_clock)) {

is not taken and branch

	if (unlikely(sighand == NULL)) {

is taken. In this case the process has been reaped and the timer is marked as
disarmed anyway. So none of the postprocessing of the sample is
required. Return right away.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Link: http://lkml.kernel.org/r/20160707223911.GA26483@p183.telecom.by
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10 11:49:29 +02:00
Peter Zijlstra
34bf12312b sched/fair: Fix effective_load() to consistently use smoothed load
commit 7dd4912594daf769a46744848b05bd5bc6d62469 upstream.

Starting with the following commit:

  fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")

calc_tg_weight() doesn't compute the right value as expected by effective_load().

The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.

Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.

So stop using calc_tg_weight() and do it explicitly.

The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10 11:49:28 +02:00
Tejun Heo
75d6026fd7 cgroup: set css->id to -1 during init
commit 8fa3b8d689a54d6d04ff7803c724fb7aca6ce98e upstream.

If percpu_ref initialization fails during css_create(), the free path
can end up trying to free css->id of zero.  As ID 0 is unused, it
doesn't cause a critical breakage but it does trigger a warning
message.  Fix it by setting css->id to -1 from init_and_link_css().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Wenwei Tao <ww.tao0320@gmail.com>
Fixes: 01e586598b ("cgroup: release css->id after css_free")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10 11:49:27 +02:00
Andrey Ryabinin
dc20f3244a kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w
commit 57675cb976eff977aefb428e68e4e0236d48a9ff upstream.

Lengthy output of sysrq-w may take a lot of time on slow serial console.

Currently we reset NMI-watchdog on the current CPU to avoid spurious
lockup messages. Sometimes this doesn't work since softlockup watchdog
might trigger on another CPU which is waiting for an IPI to proceed.
We reset softlockup watchdogs on all CPUs, but we do this only after
listing all tasks, and this may be too late on a busy system.

So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10 11:49:25 +02:00
Dmitry Shmidt
b558f17a13 This is the 4.4.16 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXmOXmAAoJEDjbvchgkmk+QYIP/1S8oBZsvjfDzvH8t63HyLeH
 i43MFlYoFAqUIZc002XpluSvZ8uHoG+r7R8Hq3wmv48wxe3M6OBnMdBVTht6mPw+
 t5OLTZr40lWaJm2EIi4aekueMIrCgmL+Et+IFYv7ZVBuYLteVcfny+zdq4EqGmgj
 /a19+L/sTTr4SHtJIhHxWhiVJ9fVMgQk/N3VgQmIiNF2+lVbiFI7QQiDPLbFl0KK
 CM4ETO22HxHCYilGpzhpSMsHCxv12VqNaXNLAsPAepGGW7PqvUmrEWAqgwsbOfRc
 GxTLNk0dUgJqMrfEpQ8ZOMlgzvCAYG2jZuNSuT+nuzrWSUP+WOGRi9TTTxp1CYuZ
 PHlhNTH7ZnqosxJUUZS2d9N5ygpqD48Rhlfl824YzOWCy94VeUnedkVLb20uJwPF
 Y5aQ5WjktBC9why5e4OgGQERvx/U9KTk8E1zRfZZPc2oft9My0YxuemjjKAKZiYN
 ne4WhXbgOJTQkAoZwh2xqny3bWyEaoSrWpQ3R7bBJ9SIRLEOdCKzKpduDbAnbMP7
 QWgQOQC/6qA1mKqjrqF4KPA1Quo9PcUK2Ajh523ewMGCowgY90vyejAgh4Q8g0GC
 fKlx+jJDoKVDbQ8v4hc9PPHMsNNIKT9a1ptwVS3lE+bq1D5Ffm57A4/uOTMYHVab
 gKqu8h1CA0MCVBsH3nNA
 =nY8S
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.16' into android-4.4.y

This is the 4.4.16 stable release

Change-Id: Ibaf7b7e03695e1acebc654a2ca1a4bfcc48fcea4
2016-08-01 15:57:55 -07:00
Amit Pundir
818aa36ea8 Revert "panic: Add board ID to panic output"
This reverts commit 4e09c51018.

I checked for the usage of this debug helper in AOSP common kernels as
well as vendor kernels (e.g exynos, msm, mediatek, omap, tegra, x86,
x86_64) hosted at https://android.googlesource.com/kernel/ and I found
out that other than few fairly obsolete Omap trees (for tuna & Glass)
and Exynos tree (for Manta), there is no active user of this debug
helper. So we can safely remove this helper code.

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-08-01 11:17:00 -07:00
Steven Rostedt (Red Hat)
bc64a83932 tracing: Handle NULL formats in hold_module_trace_bprintk_format()
commit 70c8217acd4383e069fe1898bbad36ea4fcdbdcc upstream.

If a task uses a non constant string for the format parameter in
trace_printk(), then the trace_printk_fmt variable is set to NULL. This
variable is then saved in the __trace_printk_fmt section.

The function hold_module_trace_bprintk_format() checks to see if duplicate
formats are used by modules, and reuses them if so (saves them to the list
if it is new). But this function calls lookup_format() that does a strcmp()
to the value (which is now NULL) and can cause a kernel oops.

This wasn't an issue till 3debb0a9ddb ("tracing: Fix trace_printk() to print
when not using bprintk()") which added "__used" to the trace_printk_fmt
variable, and before that, the kernel simply optimized it out (no NULL value
was saved).

The fix is simply to handle the NULL pointer in lookup_format() and have the
caller ignore the value if it was NULL.

Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.com

Reported-by: xingzhen <zhengjun.xing@intel.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Fixes: 3debb0a9ddb ("tracing: Fix trace_printk() to print when not using bprintk()")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27 09:47:32 -07:00
Peter Zijlstra
43b1bfec0e sched/fair: Fix cfs_rq avg tracking underflow
commit 8974189222159154c55f24ddad33e3613960521a upstream.

As per commit:

  b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

> the code generated from update_cfs_rq_load_avg():
>
> 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> 		removed_load = 1;
> 	}
>
> turns into:
>
> ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
> ffffffff8108706b:       48 85 c0                test   %rax,%rax
> ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
> ffffffff81087070:       4c 89 f8                mov    %r15,%rax
> ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
> ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
> ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
> ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
> ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
> ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.

So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
       maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27 09:47:31 -07:00
Paolo Bonzini
71ef2c1131 locking/static_key: Fix concurrent static_key_slow_inc()
commit 4c5ea0a9cd02d6aa8adc86e100b2a4cff8d614ff upstream.

The following scenario is possible:

    CPU 1                                   CPU 2
    static_key_slow_inc()
     atomic_inc_not_zero()
      -> key.enabled == 0, no increment
     jump_label_lock()
     atomic_inc_return()
      -> key.enabled == 1 now
                                            static_key_slow_inc()
                                             atomic_inc_not_zero()
                                              -> key.enabled == 1, inc to 2
                                             return
                                            ** static key is wrong!
     jump_label_update()
     jump_label_unlock()

Testing the static key at the point marked by (**) will follow the
wrong path for jumps that have not been patched yet.  This can
actually happen when creating many KVM virtual machines with userspace
LAPIC emulation; just run several copies of the following program:

    #include <fcntl.h>
    #include <unistd.h>
    #include <sys/ioctl.h>
    #include <linux/kvm.h>

    int main(void)
    {
        for (;;) {
            int kvmfd = open("/dev/kvm", O_RDONLY);
            int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
            close(ioctl(vmfd, KVM_CREATE_VCPU, 1));
            close(vmfd);
            close(kvmfd);
        }
        return 0;
    }

Every KVM_CREATE_VCPU ioctl will attempt a static_key_slow_inc() call.
The static key's purpose is to skip NULL pointer checks and indeed one
of the processes eventually dereferences NULL.

As explained in the commit that introduced the bug:

  706249c222 ("locking/static_keys: Rework update logic")

jump_label_update() needs key.enabled to be true.  The solution adopted
here is to temporarily make key.enabled == -1, and use go down the
slow path when key.enabled <= 0.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 706249c222 ("locking/static_keys: Rework update logic")
Link: http://lkml.kernel.org/r/1466527937-69798-1-git-send-email-pbonzini@redhat.com
[ Small stylistic edits to the changelog and the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27 09:47:29 -07:00
Peter Zijlstra
a39e660a55 locking/qspinlock: Fix spin_unlock_wait() some more
commit 2c610022711675ee908b903d242f0b90e1db661f upstream.

While this prior commit:

  54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")

... fixes spin_is_locked() and spin_unlock_wait() for the usage
in ipc/sem and netfilter, it does not in fact work right for the
usage in task_work and futex.

So while the 2 locks crossed problem:

	spin_lock(A)		spin_lock(B)
	if (!spin_is_locked(B)) spin_unlock_wait(A)
	  foo()			foo();

... works with the smp_mb() injected by both spin_is_locked() and
spin_unlock_wait(), this is not sufficient for:

	flag = 1;
	smp_mb();		spin_lock()
	spin_unlock_wait()	if (!flag)
				  // add to lockless list
	// iterate lockless list

... because in this scenario, the store from spin_lock() can be delayed
past the load of flag, uncrossing the variables and loosing the
guarantee.

This patch reworks spin_is_locked() and spin_unlock_wait() to work in
both cases by exploiting the observation that while the lock byte
store can be delayed, the contender must have registered itself
visibly in other state contained in the word.

It also allows for architectures to override both functions, as PPC
and ARM64 have an additional issue for which we currently have no
generic solution.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <waiman.long@hpe.com>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27 09:47:29 -07:00
Chris Wilson
c7f47e59c3 locking/ww_mutex: Report recursive ww_mutex locking early
commit 0422e83d84ae24b933e4b0d4c1e0f0b4ae8a0a3b upstream.

Recursive locking for ww_mutexes was originally conceived as an
exception. However, it is heavily used by the DRM atomic modesetting
code. Currently, the recursive deadlock is checked after we have queued
up for a busy-spin and as we never release the lock, we spin until
kicked, whereupon the deadlock is discovered and reported.

A simple solution for the now common problem is to move the recursive
deadlock discovery to the first action when taking the ww_mutex.

Suggested-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1464293297-19777-1-git-send-email-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27 09:47:29 -07:00
Jann Horn
85a09f2732 UPSTREAM: sched: panic on corrupted stack end
(cherry picked from commit 29d6455178a09e1dc340380c582b13356227e8df)

Until now, hitting this BUG_ON caused a recursive oops (because oops
handling involves do_exit(), which calls into the scheduler, which in
turn raises an oops), which caused stuff below the stack to be
overwritten until a panic happened (e.g.  via an oops in interrupt
context, caused by the overwritten CPU index in the thread_info).

Just panic directly.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: Ia3acb3f747f7a58ec2d071644433b0591925969f
Bug: 29444228
2016-07-25 13:03:32 -07:00
Riley Andrews
dd802a9c97 cpuset: Add allow_attach hook for cpusets on android.
This patch provides a allow_attach hook for cpusets,
which resolves lots of the following logcat noise.

W SchedPolicy: add_tid_to_cgroup failed to write '2816' (Permission denied); fd=29
W ActivityManager: Failed setting process group of 2816 to 0
W System.err: java.lang.IllegalArgumentException
W System.err:    at android.os.Process.setProcessGroup(Native Method)
W System.err:    at com.android.server.am.ActivityManagerService.applyOomAdjLocked(ActivityManagerService.java:18763)
W System.err:    at com.android.server.am.ActivityManagerService.updateOomAdjLocked(ActivityManagerService.java:19028)
W System.err:    at com.android.server.am.ActivityManagerService.updateOomAdjLocked(ActivityManagerService.java:19106)
W System.err:    at com.android.server.am.ActiveServices.serviceDoneExecutingLocked(ActiveServices.java:2015)
W System.err:    at com.android.server.am.ActiveServices.publishServiceLocked(ActiveServices.java:905)
W System.err:    at com.android.server.am.ActivityManagerService.publishService(ActivityManagerService.java:16065)
W System.err:    at android.app.ActivityManagerNative.onTransact(ActivityManagerNative.java:1007)
W System.err:    at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2493)
W System.err:    at android.os.Binder.execTransact(Binder.java:453)

Change-Id: Ic1b61b2bbb7ce74c9e9422b5e22ee9078251de21
[Ported to 4.4, added commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-07-21 09:09:02 -07:00
Daniel Borkmann
11bef1439d bpf, perf: delay release of BPF prog after grace period
[ Upstream commit ceb56070359b7329b5678b5d95a376fcb24767be ]

Commit dead9f29dd ("perf: Fix race in BPF program unregister") moved
destruction of BPF program from free_event_rcu() callback to __free_event(),
which is problematic if used with tail calls: if prog A is attached as
trace event directly, but at the same time present in a tail call map used
by another trace event program elsewhere, then we need to delay destruction
via RCU grace period since it can still be in use by the program doing the
tail call (the prog first needs to be dropped from the tail call map, then
trace event with prog A attached destroyed, so we get immediate destruction).

Fixes: dead9f29dd ("perf: Fix race in BPF program unregister")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Jann Horn <jann@thejh.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-11 09:31:11 -07:00
Jann Horn
e6ec611add BACKPORT: ptrace: use fsuid, fsgid, effective creds for fs access checks
This patch backports 969624b (which backports caaee6234d0 upstream),
from the v4.4-stable branch to the common/android-4.4 branch.

This patch is needed to provide the PTRACE_MODE_ATTACH_FSCREDS definition
which was used by the backported version of proc/<tid>/timerslack_ns
in change-id: Ie5799b9a3402a31f88cd46437dcda4a0e46415a7

commit caaee6234d05a58c5b4d05e7bf766131b810a657 upstream.

By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.

To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.

The problem was that when a privileged task had temporarily dropped its
privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.

While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.

In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:

 /proc/$pid/stat - uses the check for determining whether pointers
     should be visible, useful for bypassing ASLR
 /proc/$pid/maps - also useful for bypassing ASLR
 /proc/$pid/cwd - useful for gaining access to restricted
     directories that contain files with lax permissions, e.g. in
     this scenario:
     lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
     drwx------ root root /root
     drwxr-xr-x root root /root/foobar
     -rw-r--r-- root root /root/foobar/secret

Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[jstultz: Cherry-picked for common/android-4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-07-07 16:30:43 -07:00
John Stultz
286d25ba46 BACKPORT: timer: convert timer_slack_ns from unsigned long to u64
This backports da8b44d5a9f8bf26da637b7336508ca534d6b319 from upstream.

This patchset introduces a /proc/<pid>/timerslack_ns interface which
would allow controlling processes to be able to set the timerslack value
on other processes in order to save power by avoiding wakeups (Something
Android currently does via out-of-tree patches).

The first patch tries to fix the internal timer_slack_ns usage which was
defined as a long, which limits the slack range to ~4 seconds on 32bit
systems.  It converts it to a u64, which provides the same basically
unlimited slack (500 years) on both 32bit and 64bit machines.

The second patch introduces the /proc/<pid>/timerslack_ns interface
which allows the full 64bit slack range for a task to be read or set on
both 32bit and 64bit machines.

With these two patches, on a 32bit machine, after setting the slack on
bash to 10 seconds:

$ time sleep 1

real    0m10.747s
user    0m0.001s
sys     0m0.005s

The first patch is a little ugly, since I had to chase the slack delta
arguments through a number of functions converting them to u64s.  Let me
know if it makes sense to break that up more or not.

Other than that things are fairly straightforward.

This patch (of 2):

The timer_slack_ns value in the task struct is currently a unsigned
long.  This means that on 32bit applications, the maximum slack is just
over 4 seconds.  However, on 64bit machines, its much much larger (~500
years).

This disparity could make application development a little (as well as
the default_slack) to a u64.  This means both 32bit and 64bit systems
have the same effective internal slack range.

Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
the interface as a unsigned long, so we preserve that limitation on
32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
actually larger then what can be stored by an unsigned long.

This patch also modifies hrtimer functions which specified the slack
delta as a unsigned long.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-01 18:12:06 -07:00
Jann Horn
c08b1a593a sched: panic on corrupted stack end
commit 29d6455178a09e1dc340380c582b13356227e8df upstream.

Until now, hitting this BUG_ON caused a recursive oops (because oops
handling involves do_exit(), which calls into the scheduler, which in
turn raises an oops), which caused stuff below the stack to be
overwritten until a panic happened (e.g.  via an oops in interrupt
context, caused by the overwritten CPU index in the thread_info).

Just panic directly.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24 10:18:20 -07:00
Daniel Borkmann
bfe951d547 bpf, inode: disallow userns mounts
[ Upstream commit 612bacad78ba6d0a91166fc4487af114bac172a8 ]

Follow-up to commit e27f4a942a0e ("bpf: Use mount_nodev not mount_ns
to mount the bpf filesystem"), which removes the FS_USERNS_MOUNT flag.

The original idea was to have a per mountns instance instead of a
single global fs instance, but that didn't work out and we had to
switch to mount_nodev() model. The intent of that middle ground was
that we avoid users who don't play nice to create endless instances
of bpf fs which are difficult to control and discover from an admin
point of view, but at the same time it would have allowed us to be
more flexible with regard to namespaces.

Therefore, since we now did the switch to mount_nodev() as a fix
where individual instances are created, we also need to remove userns
mount flag along with it to avoid running into mentioned situation.
I don't expect any breakage at this early point in time with removing
the flag and we can revisit this later should the requirement for
this come up with future users. This and commit e27f4a942a0e have
been split to facilitate tracking should any of them run into the
unlikely case of causing a regression.

Fixes: b2197755b2 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24 10:18:17 -07:00
Eric W. Biederman
5b7ea922e1 bpf: Use mount_nodev not mount_ns to mount the bpf filesystem
[ Upstream commit e27f4a942a0ee4b84567a3c6cfa84f273e55cbb7 ]

While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the
bpf filesystem.  Looking at the code I saw a broken usage of mount_ns
with current->nsproxy->mnt_ns. As the code does not acquire a
reference to the mount namespace it can not possibly be correct to
store the mount namespace on the superblock as it does.

Replace mount_ns with mount_nodev so that each mount of the bpf
filesystem returns a distinct instance, and the code is not buggy.

In discussion with Hannes Frederic Sowa it was reported that the use
of mount_ns was an attempt to have one bpf instance per mount
namespace, in an attempt to keep resources that pin resources from
hiding.  That intent simply does not work, the vfs is not built to
allow that kind of behavior.  Which means that the bpf filesystem
really is buggy both semantically and in it's implemenation as it does
not nor can it implement the original intent.

This change is userspace visible, but my experience with similar
filesystems leads me to believe nothing will break with a model of each
mount of the bpf filesystem is distinct from all others.

Fixes: b2197755b2 ("bpf: add support for persistent maps/progs")
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24 10:18:16 -07:00
Thierry Strudel
872e24621a ANDROID: cpu: send KOBJ_ONLINE event when enabling cpus
In case some sysfs nodes needs to be labeled with a different label than
sysfs then user needs to be notified when a core is brought back online.

Signed-off-by: Thierry Strudel <tstrudel@google.com>
Bug: 29359497
Change-Id: I0395c86e01cd49c348fda8f93087d26f88557c91
2016-06-21 12:58:10 -07:00
Willy Tarreau
fa6d0ba12a pipe: limit the per-user amount of pages allocated in pipes
commit 759c01142a5d0f364a462346168a56de28a80f52 upstream.

On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Moritz Muehlenhoff <moritz@wikimedia.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:35 -07:00
Oleg Nesterov
0eea2e24fc wait/ptrace: assume __WALL if the child is traced
commit bf959931ddb88c4e4366e96dd22e68fa0db9527c upstream.

The following program (simplified version of generated by syzkaller)

	#include <pthread.h>
	#include <unistd.h>
	#include <sys/ptrace.h>
	#include <stdio.h>
	#include <signal.h>

	void *thread_func(void *arg)
	{
		ptrace(PTRACE_TRACEME, 0,0,0);
		return 0;
	}

	int main(void)
	{
		pthread_t thread;

		if (fork())
			return 0;

		while (getppid() != 1)
			;

		pthread_create(&thread, NULL, thread_func, NULL);
		pthread_join(thread, NULL);
		return 0;
	}

creates an unreapable zombie if /sbin/init doesn't use __WALL.

This is not a kernel bug, at least in a sense that everything works as
expected: debugger should reap a traced sub-thread before it can reap the
leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.

Unfortunately, it seems that /sbin/init in most (all?) distributions
doesn't use it and we have to change the kernel to avoid the problem.
Note also that most init's use sys_waitid() which doesn't allow __WALL, so
the necessary user-space fix is not that trivial.

This patch just adds the "ptrace" check into eligible_child().  To some
degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is
mostly ignored when the tracee reports to debugger.  Or WSTOPPED, the
tracer doesn't need to set this flag to wait for the stopped tracee.

This obviously means the user-visible change: __WCLONE and __WALL no
longer have any meaning for debugger.  And I can only hope that this won't
break something, but at least strace/gdb won't suffer.

We could make a more conservative change.  Say, we can take __WCLONE into
account, or !thread_group_leader().  But it would be nice to not
complicate these historical/confusing checks.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: <syzkaller@googlegroups.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:35 -07:00
Vik Heyndrickx
1df73f1884 sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems
commit 20878232c52329f92423d27a60e48b6a6389e0dd upstream.

Systems show a minimal load average of 0.00, 0.01, 0.05 even when they
have no load at all.

Uptime and /proc/loadavg on all systems with kernels released during the
last five years up until kernel version 4.6-rc5, show a 5- and 15-minute
minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on
idle systems, but the way the kernel calculates this value prevents it
from getting lower than the mentioned values.

Likewise but not as obviously noticeable, a fully loaded system with no
processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95
(multiplied by number of cores).

Once the (old) load becomes 93 or higher, it mathematically can never
get lower than 93, even when the active (load) remains 0 forever.
This results in the strange 0.00, 0.01, 0.05 uptime values on idle
systems.  Note: 93/2048 = 0.0454..., which rounds up to 0.05.

It is not correct to add a 0.5 rounding (=1024/2048) here, since the
result from this function is fed back into the next iteration again,
so the result of that +0.5 rounding value then gets multiplied by
(2048-2037), and then rounded again, so there is a virtual "ghost"
load created, next to the old and active load terms.

By changing the way the internally kept value is rounded, that internal
value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon
increasing load, the internally kept load value is rounded up, when the
load is decreasing, the load value is rounded down.

The modified code was tested on nohz=off and nohz kernels. It was tested
on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was
tested on single, dual, and octal cores system. It was tested on virtual
hosts and bare hardware. No unwanted effects have been observed, and the
problems that the patch intended to fix were indeed gone.

Tested-by: Damien Wyart <damien.wyart@free.fr>
Signed-off-by: Vik Heyndrickx <vik.heyndrickx@veribox.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Doug Smythies <dsmythies@telus.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0f004f5a69 ("sched: Cure more NO_HZ load average woes")
Link: http://lkml.kernel.org/r/e8d32bff-d544-7748-72b5-3c86cc71f09f@veribox.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:49 -07:00
Steven Rostedt (Red Hat)
f199023137 ring-buffer: Prevent overflow of size in ring_buffer_resize()
commit 59643d1535eb220668692a5359de22545af579f6 upstream.

If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE
then the DIV_ROUND_UP() will return zero.

Here's the details:

  # echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb

tracing_entries_write() processes this and converts kb to bytes.

 18014398509481980 << 10 = 18446744073709547520

and this is passed to ring_buffer_resize() as unsigned long size.

 size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);

Where DIV_ROUND_UP(a, b) is (a + b - 1)/b

BUF_PAGE_SIZE is 4080 and here

 18446744073709547520 + 4080 - 1 = 18446744073709551599

where 18446744073709551599 is still smaller than 2^64

 2^64 - 18446744073709551599 = 17

But now 18446744073709551599 / 4080 = 4521260802379792

and size = size * 4080 = 18446744073709551360

This is checked to make sure its still greater than 2 * 4080,
which it is.

Then we convert to the number of buffer pages needed.

 nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE)

but this time size is 18446744073709551360 and

 2^64 - (18446744073709551360 + 4080 - 1) = -3823

Thus it overflows and the resulting number is less than 4080, which makes

  3823 / 4080 = 0

an nr_pages is set to this. As we already checked against the minimum that
nr_pages may be, this causes the logic to fail as well, and we crash the
kernel.

There's no reason to have the two DIV_ROUND_UP() (that's just result of
historical code changes), clean up the code and fix this bug.

Fixes: 83f40318da ("ring-buffer: Make removal of ring buffer pages atomic")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:49 -07:00
Steven Rostedt (Red Hat)
dfb71aefc9 ring-buffer: Use long for nr_pages to avoid overflow failures
commit 9b94a8fba501f38368aef6ac1b30e7335252a220 upstream.

The size variable to change the ring buffer in ftrace is a long. The
nr_pages used to update the ring buffer based on the size is int. On 64 bit
machines this can cause an overflow problem.

For example, the following will cause the ring buffer to crash:

 # cd /sys/kernel/debug/tracing
 # echo 10 > buffer_size_kb
 # echo 8556384240 > buffer_size_kb

Then you get the warning of:

 WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260

Which is:

  RB_WARN_ON(cpu_buffer, nr_removed);

Note each ring buffer page holds 4080 bytes.

This is because:

 1) 10 causes the ring buffer to have 3 pages.
    (10kb requires 3 * 4080 pages to hold)

 2) (2^31 / 2^10  + 1) * 4080 = 8556384240
    The value written into buffer_size_kb is shifted by 10 and then passed
    to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760

 3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE
    which is 4080. 8761737461760 / 4080 = 2147484672

 4) nr_pages is subtracted from the current nr_pages (3) and we get:
    2147484669. This value is saved in a signed integer nr_pages_to_update

 5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int
    turns into the value of -2147482627

 6) As the value is a negative number, in update_pages_handler() it is
    negated and passed to rb_remove_pages() and 2147482627 pages will
    be removed, which is much larger than 3 and it causes the warning
    because not all the pages asked to be removed were removed.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001

Fixes: 7a8e76a382 ("tracing: unified trace buffer")
Reported-by: Hao Qin <QEver.cn@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:49 -07:00