Commit graph

1423 commits

Author SHA1 Message Date
Linus Torvalds
50d3f7d55a UPSTREAM: Clarify naming of thread info/stack allocators
We've had the thread info allocated together with the thread stack for
most architectures for a long time (since the thread_info was split off
from the task struct), but that is about to change.

But the patches that move the thread info to be off-stack (and a part of
the task struct instead) made it clear how confused the allocator and
freeing functions are.

Because the common case was that we share an allocation with the thread
stack and the thread_info, the two pointers were identical.  That
identity then meant that we would have things like

	ti = alloc_thread_info_node(tsk, node);
	...
	tsk->stack = ti;

which certainly _worked_ (since stack and thread_info have the same
value), but is rather confusing: why are we assigning a thread_info to
the stack? And if we move the thread_info away, the "confusing" code
just gets to be entirely bogus.

So remove all this confusion, and make it clear that we are doing the
stack allocation by renaming and clarifying the function names to be
about the stack.  The fact that the thread_info then shares the
allocation is an implementation detail, and not really about the
allocation itself.

This is a pure renaming and type fix: we pass in the same pointer, it's
just that we clarify what the pointer means.

The ia64 code that actually only has one single allocation (for all of
task_struct, thread_info and kernel thread stack) now looks a bit odd,
but since "tsk->stack" is actually not even used there, that oddity
doesn't matter.  It would be a separate thing to clean that up, I
intentionally left the ia64 changes as a pure brute-force renaming and
type change.

Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bug: 38331309
Change-Id: I870b5476fc900c9145134f9dd3ed18a32a490162
(cherry picked from commit b235beea9e996a4d36fed6cfef4801a3e7d7a9a5)
Signed-off-by: Zubin Mithra <zsm@google.com>
2017-08-09 15:23:22 +01:00
Dietmar Eggemann
4b85765a3d sched/fair: Add eas (& cas) specific rq, sd and task stats
The statistic counter are placed in the eas (& cas) wakeup path. Each
of them has one representation for the runqueue (rq), the sched_domain
(sd) and the task.
A task counter is always incremented. A rq counter is always
incremented for the rq the scheduler is currently running on. A sd
counter is only incremented if a relation to a sd exists.

The counters are exposed:

(1) In /proc/schedstat for rq's and sd's:

$ cat /proc/schedstat
...
cpu0 71422 0 2321254 ...
eas  44144 0 0 19446 0 24698 568435 51621 156932 133 222011 17459 120279 516814 83 0 156962 359235 176439 139981
  <- runqueue for cpu0
...
domain0 3 42430 42331 ...
eas 0 0 0 14200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 66355 0  <- MC sched domain for cpu0
...

The per-cpu eas vector has the following elements:

sis_attempts  sis_idle   sis_cache_affine sis_suff_cap    sis_idle_cpu    sis_count               ||
secb_attempts secb_sync  secb_idle_bt     secb_insuff_cap secb_no_nrg_sav secb_nrg_sav secb_count ||
fbt_attempts  fbt_no_cpu fbt_no_sd        fbt_pref_idle   fbt_count                               ||
cas_attempts  cas_count

The following relations exist between these counters (from cpu0 eas
vector above):

sis_attempts = sis_idle + sis_cache_affine + sis_suff_cap + sis_idle_cpu + sis_count

44144        = 0        + 0                + 19446        + 0            + 24698

secb_attempts = secb_sync + secb_idle_bt + secb_insuff_cap + secb_no_nrg_sav + secb_nrg_sav + secb_count

568435        = 51621     + 156932       + 133             + 222011          + 17459        + 120279

fbt_attempts = fbt_no_cpu + fbt_no_sd + fbt_pref_idle + fbt_count + (return -1)

516814       = 83         + 0         + 156962        + 359235    + (534)

cas_attempts = cas_count + (return -1 or smp_processor_id())

176439       = 139981    + (36458)

(2) In /proc/$PROCESS_PID/task/$TASK_PID/sched for a task.

example: main thread of system_server

$ cat /proc/1083/task/1083/sched

...
se.statistics.nr_wakeups_sis_attempts        :                  945
se.statistics.nr_wakeups_sis_idle            :                    0
se.statistics.nr_wakeups_sis_cache_affine    :                    0
se.statistics.nr_wakeups_sis_suff_cap        :                  219
se.statistics.nr_wakeups_sis_idle_cpu        :                    0
se.statistics.nr_wakeups_sis_count           :                  726
se.statistics.nr_wakeups_secb_attempts       :                10376
se.statistics.nr_wakeups_secb_sync           :                 1462
se.statistics.nr_wakeups_secb_idle_bt        :                 6984
se.statistics.nr_wakeups_secb_insuff_cap     :                    3
se.statistics.nr_wakeups_secb_no_nrg_sav     :                  927
se.statistics.nr_wakeups_secb_nrg_sav        :                  206
se.statistics.nr_wakeups_secb_count          :                  794
se.statistics.nr_wakeups_fbt_attempts        :                 8914
se.statistics.nr_wakeups_fbt_no_cpu          :                    0
se.statistics.nr_wakeups_fbt_no_sd           :                    0
se.statistics.nr_wakeups_fbt_pref_idle       :                 6987
se.statistics.nr_wakeups_fbt_count           :                 1554
se.statistics.nr_wakeups_cas_attempts        :                 3107
se.statistics.nr_wakeups_cas_count           :                 1195
...

The same relation between the counters as in the per-cpu case apply.

Change-Id: Ie7d01267c78a3f41f60a3ef52917d5a5d463f195
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Morten Rasmussen
68a3b157d9 UPSTREAM: sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
Add a topology flag to the sched_domain hierarchy indicating the lowest
domain level where the full range of CPU capacities is represented by
the domain members for asymmetric capacity topologies (e.g. ARM
big.LITTLE).

The flag is intended to indicate that extra care should be taken when
placing tasks on CPUs and this level spans all the different types of
CPUs found in the system (no need to look further up the domain
hierarchy). This information is currently only available through
iterating through the capacities of all the CPUs at parent levels in the
sched_domain hierarchy.

  SD 2      [  0      1      2      3]  SD_ASYM_CPUCAPACITY

  SD 1      [  0      1] [   2      3]  !SD_ASYM_CPUCAPACITY

  CPU:         0      1      2      3
  capacity:  756    756   1024   1024

If the topology in the example above is duplicated to create an eight
CPU example with third sched_domain level on top (SD 3), this level
should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
would both have all CPU capacities represented within them.

Change-Id: I1526407b90567cac387419719b7d7fdc8b259a85
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-6-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1f6e6c7cb9bcd58abb5ee11243e0eefe6b36fc8e)
[trivial merge conflict]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:52 -07:00
Morten Rasmussen
986aa1d498 UPSTREAM: sched/core: Fix power to capacity renaming in comment
It is seems that this one escaped Nico's renaming of cpu_power to
cpu_capacity a while back.

Change-Id: Ic2569d714db7b740f1df4ccc381ba8c1772c2793
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: linux-kernel@vger.kernel.org
Cc: mgalbraith@suse.de
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1466615004-3503-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit bd425d4bfc7a1a6064dbbadfbac9c7eec0e426ec)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:51 -07:00
Steve Muckle
e5da6c11b2 sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
A policy of going to fmax on any RT activity will be detrimental
for power on many platforms. Often RT accounts for only a small amount
of CPU activity so sending the CPU frequency to fmax is overkill. Worse
still, some platforms may not be able to even complete the CPU frequency
change before the RT activity has already completed.

Cpufreq governors have not treated RT activity this way in the past so
it is not part of the expected semantics of the RT scheduling class. The
DL class offers guarantees about task completion and could be used for
this purpose.

Modify the schedutil algorithm to instead use rt_avg as an estimate of
RT utilization of the CPU.

Based on previous work by Vincent Guittot <vincent.guittot@linaro.org>.

Change-Id: I1ed605a3e2512a94d34217a8e57c3fd97cca60be
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2017-06-02 08:01:51 -07:00
Steve Muckle
f02702dcf2 sched: backport cpufreq hooks from 4.9-rc4
The scheduler cpufreq hooks are required by the schedutil cpufreq
governor.

Change-Id: Ied6c46262bb33b7e81bbb3d3d2761124e0c676b7
Signed-off-by: Steve Muckle <smuckle@linaro.org>
[trivial cherry-picking fixes]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:50 -07:00
Greg Kroah-Hartman
29fa724a09 This is the 4.4.63 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlj5tRkACgkQONu9yGCS
 aT5zFxAAouq2kxBFxxJIQ3255yy/7B6oBYrhilQZPrETC800PUaIqZtuQZPpaoqb
 3gG0+12ve0CMHK+PidEwsQlMlAHNI1xbzmUHm2UIrLYYCV817DTkEsc7JXGUvYVA
 /YA71GASKmLVi9DnsawRb0ELhTeQHec76LrPlgvyWH/OMEtNcMOv/8oWfTq9bKV2
 HsHC6MOwT2R86ukhYYmcfFHomTnJSpW7KtGXwNC/LhohzIfsKQKGQWb1f1j1aHGC
 u5yQ5Qc9T+DhPMHAEY+xuURz/3ohpUL8aSQXk7pua/bTD0X0klNQcf/BXVJXsaeI
 s4g78q+YdTcPL81rkEW+7yUvAlb3u+FdVr+wjsl/s6ih4iL0EgBsoClqUjGUUoz+
 jvCXHiMP7lHi50eIkppQf/yZSVKSobKn5YYf9AA+y6tQ9R9GguDS/IQSRe2HnHeR
 OymCBXa6BSmQGGyPiMUBiNTix6roJ8Vr4dK9lbsQXZ+YZICXWs1rpMOy5HK9EJWf
 M6YF6l9lHwQ38AN+MhsjUXIyKLp9zCk7syeFaeK6k/IA2kcm7dL/momiZ1QIBnhq
 OHB3iwEPZ5Rr4CVjk5j7Ue22ubdrtpc8IfTYV95N7nv+g3nBwe22k+RDi70NiDwk
 2pnBqhO/vtPRE9Ry3QBS73VEeXgNb9IIVwQ7hi9Rk7KUgmdEOOo=
 =iS0x
 -----END PGP SIGNATURE-----

Merge 4.4.63 into android-4.4

Changes in 4.4.63:
	cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
	thp: fix MADV_DONTNEED vs clear soft dirty race
	drm/nouveau/mpeg: mthd returns true on success now
	drm/nouveau/mmu/nv4a: use nv04 mmu rather than the nv44 one
	CIFS: store results of cifs_reopen_file to avoid infinite wait
	Input: xpad - add support for Razer Wildcat gamepad
	perf/x86: Avoid exposing wrong/stale data in intel_pmu_lbr_read_32()
	x86/vdso: Ensure vdso32_enabled gets set to valid values only
	x86/vdso: Plug race between mapping and ELF header setup
	acpi, nfit, libnvdimm: fix interleave set cookie calculation (64-bit comparison)
	iscsi-target: Fix TMR reference leak during session shutdown
	iscsi-target: Drop work-around for legacy GlobalSAN initiator
	scsi: sr: Sanity check returned mode data
	scsi: sd: Consider max_xfer_blocks if opt_xfer_blocks is unusable
	scsi: sd: Fix capacity calculation with 32-bit sector_t
	xen, fbfront: fix connecting to backend
	libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat
	irqchip/irq-imx-gpcv2: Fix spinlock initialization
	ftrace: Fix removing of second function probe
	char: Drop bogus dependency of DEVPORT on !M68K
	char: lack of bool string made CONFIG_DEVPORT always on
	Revert "MIPS: Lantiq: Fix cascaded IRQ setup"
	kvm: fix page struct leak in handle_vmon
	zram: do not use copy_page with non-page aligned address
	powerpc: Disable HFSCR[TM] if TM is not supported
	crypto: ahash - Fix EINPROGRESS notification callback
	ath9k: fix NULL pointer dereference
	dvb-usb-v2: avoid use-after-free
	ext4: fix inode checksum calculation problem if i_extra_size is small
	platform/x86: acer-wmi: setup accelerometer when machine has appropriate notify event
	rtc: tegra: Implement clock handling
	mm: Tighten x86 /dev/mem with zeroing reads
	dvb-usb: don't use stack for firmware load
	dvb-usb-firmware: don't do DMA on stack
	virtio-console: avoid DMA from stack
	pegasus: Use heap buffers for all register access
	rtl8150: Use heap buffers for all register access
	catc: Combine failure cleanup code in catc_probe()
	catc: Use heap buffer for memory size test
	ibmveth: calculate gso_segs for large packets
	SUNRPC: fix refcounting problems with auth_gss messages.
	tty/serial: atmel: RS485 half duplex w/DMA: enable RX after TX is done
	net: ipv6: check route protocol when deleting routes
	sctp: deny peeloff operation on asocs with threads sleeping on it
	MIPS: fix Select HAVE_IRQ_EXIT_ON_IRQ_STACK patch.
	Linux 4.4.63

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-04-21 09:47:01 +02:00
Tejun Heo
3144d81a77 cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
commit 77f88796cee819b9c4562b0b6b44691b3b7755b1 upstream.

Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland.  kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed.  The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21 09:30:04 +02:00
Jin Qian
a4f5f251e9 ANDROID: sched: add a counter to track fsync
Change-Id: I6c138de5b2332eea70f57e098134d1d141247b3f
Signed-off-by: Jin Qian <jinqian@google.com>
2017-03-14 13:07:19 -07:00
Dmitry Shmidt
712517177d This is the 4.4.40 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlhvboUACgkQONu9yGCS
 aT6eKxAAptTMEtfLi+wtyTgwAW/0bIMJ/jj57d06Q8jZa80MsSoYBysoZfDrTZjh
 i2/vXh8dqR6AcZQeURt6OJVRbSIw5H7qPQSVfmrQbCocrGwk6eu5Yrv85J1SuB5c
 Ad2NaYZ+H+UYdwKH0xzjWXgcgnZ9PSChXE85hLTGtM9J64AeZZ0aQDNQXAdpphLg
 UUDpoglD5+oeHywCLQ/H68rjSytmLLjmLTeRgK5hIIMefnsNx/eo3O7cZZmdPC3r
 81E+dvPhch42a/rLdRkb/1msNXNRptUQkYQ8+anFn9fvNn8M8ZfPmu9GhUF001Jl
 R3V2Q/I2hxGoMyrxpHYVx8puyYqzTMuVpanScmaJbq7TA33LV8+VAyENLIdNHWTf
 6ZI2MBxwLt+hkqJPSfdMKCyXB4DGVqRmiy5LqRbb2/Dp8xHlVTYLYO+rq3mXveuF
 8PeRm3MhhYfGQXcq9sbUlv0hNe2SxBJm7j8QPj6uHJN/EOWtAtmg8Y+b+V3DKYNp
 2+Zoz4qir8S5CkofMyYJRPrjn5clc0iwcfnLo57VzxsPO3Y3nNxjAtnGPjyndt3f
 8LKjrCUzncQVCLRCaS9xnMKC1qwKlY3MGi40qjHepFfjBKzgfZKotE7pdeU7v8Vk
 JLIgvHzo1AguCRvoPgP6sqPZ0ilFrJz7u7eK4sQh1pqBVRoiANI=
 =p6OP
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.40' into android-4.4.y

This is the 4.4.40 stable release
2017-01-09 10:12:25 -08:00
Eric W. Biederman
1c1f15f8eb ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
commit 64b875f7ac8a5d60a4e191479299e931ee949b67 upstream.

When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
overlooked.  This can result in incorrect behavior when an application
like strace traces an exec of a setuid executable.

Further PT_PTRACE_CAP does not have enough information for making good
security decisions as it does not report which user namespace the
capability is in.  This has already allowed one mistake through
insufficient granulariy.

I found this issue when I was testing another corner case of exec and
discovered that I could not get strace to set PT_PTRACE_CAP even when
running strace as root with a full set of caps.

This change fixes the above issue with strace allowing stracing as
root a setuid executable without disabling setuid.  More fundamentaly
this change allows what is allowable at all times, by using the correct
information in it's decision.

Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 11:16:11 +01:00
Dmitry Shmidt
441e10ac4c Merge remote-tracking branch 'common/android-4.4' into android-4.4.y 2016-09-13 14:47:50 -07:00
Dmitry Shmidt
341e02d8bb Merge remote-tracking branch 'linaro-ext/EAS/v4.4-easv5.2+aosp-changes' into android-4.4
Change-Id: Ic24b43ee867bc4f70b31bedaad734717b64b86a1
2016-09-08 17:07:42 -07:00
Srivatsa Vaddagiri
efb86bd08a sched: Introduce Window Assisted Load Tracking (WALT)
use a window based view of time in order to track task
demand and CPU utilization in the scheduler.

Window Assisted Load Tracking (WALT) implementation credits:
 Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,
 Pavan Kumar Kondeti, Olav Haugan

2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla
            and Todd Kjos

Change-Id: I21408236836625d4e7d7de1843d20ed5ff36c708

Includes fixes for issues:

eas/walt: Use walt_ktime_clock() instead of ktime_get_ns() to avoid a
race resulting in watchdog resets
BUG: 29353986
Change-Id: Ic1820e22a136f7c7ebd6f42e15f14d470f6bbbdb

Handle walt accounting anomoly during resume

During resume, there is a corner case where on wakeup, a task's
prev_runnable_sum can go negative. This is a workaround that
fixes the condition and warns (instead of crashing).

BUG: 29464099
Change-Id: I173e7874324b31a3584435530281708145773508

Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-11 14:26:43 -07:00
Joseph Lo
3a400abdc5 CHROMIUM: sched: update the average of nr_running
Doing a Exponential moving average per nr_running++/-- does not
guarantee a fixed sample rate which induces errors if there are lots of
threads being enqueued/dequeued from the rq (Linpack mt). Instead of
keeping track of the avg, the scheduler now keeps track of the integral
of nr_running and allows the readers to perform filtering on top.

Original-author: Sai Charan Gurrappadi <sgurrappadi@nvidia.com>

Change-Id: Id946654f32fa8be0eaf9d8fa7c9a8039b5ef9fab
Signed-off-by: Joseph Lo <josephl@nvidia.com>
Signed-off-by: Andrew Bresticker <abrestic@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/174694
Reviewed-on: https://chromium-review.googlesource.com/272853
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-10 15:01:22 -07:00
Dmitry Shmidt
b558f17a13 This is the 4.4.16 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXmOXmAAoJEDjbvchgkmk+QYIP/1S8oBZsvjfDzvH8t63HyLeH
 i43MFlYoFAqUIZc002XpluSvZ8uHoG+r7R8Hq3wmv48wxe3M6OBnMdBVTht6mPw+
 t5OLTZr40lWaJm2EIi4aekueMIrCgmL+Et+IFYv7ZVBuYLteVcfny+zdq4EqGmgj
 /a19+L/sTTr4SHtJIhHxWhiVJ9fVMgQk/N3VgQmIiNF2+lVbiFI7QQiDPLbFl0KK
 CM4ETO22HxHCYilGpzhpSMsHCxv12VqNaXNLAsPAepGGW7PqvUmrEWAqgwsbOfRc
 GxTLNk0dUgJqMrfEpQ8ZOMlgzvCAYG2jZuNSuT+nuzrWSUP+WOGRi9TTTxp1CYuZ
 PHlhNTH7ZnqosxJUUZS2d9N5ygpqD48Rhlfl824YzOWCy94VeUnedkVLb20uJwPF
 Y5aQ5WjktBC9why5e4OgGQERvx/U9KTk8E1zRfZZPc2oft9My0YxuemjjKAKZiYN
 ne4WhXbgOJTQkAoZwh2xqny3bWyEaoSrWpQ3R7bBJ9SIRLEOdCKzKpduDbAnbMP7
 QWgQOQC/6qA1mKqjrqF4KPA1Quo9PcUK2Ajh523ewMGCowgY90vyejAgh4Q8g0GC
 fKlx+jJDoKVDbQ8v4hc9PPHMsNNIKT9a1ptwVS3lE+bq1D5Ffm57A4/uOTMYHVab
 gKqu8h1CA0MCVBsH3nNA
 =nY8S
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.16' into android-4.4.y

This is the 4.4.16 stable release

Change-Id: Ibaf7b7e03695e1acebc654a2ca1a4bfcc48fcea4
2016-08-01 15:57:55 -07:00
John Stultz
286d25ba46 BACKPORT: timer: convert timer_slack_ns from unsigned long to u64
This backports da8b44d5a9f8bf26da637b7336508ca534d6b319 from upstream.

This patchset introduces a /proc/<pid>/timerslack_ns interface which
would allow controlling processes to be able to set the timerslack value
on other processes in order to save power by avoiding wakeups (Something
Android currently does via out-of-tree patches).

The first patch tries to fix the internal timer_slack_ns usage which was
defined as a long, which limits the slack range to ~4 seconds on 32bit
systems.  It converts it to a u64, which provides the same basically
unlimited slack (500 years) on both 32bit and 64bit machines.

The second patch introduces the /proc/<pid>/timerslack_ns interface
which allows the full 64bit slack range for a task to be read or set on
both 32bit and 64bit machines.

With these two patches, on a 32bit machine, after setting the slack on
bash to 10 seconds:

$ time sleep 1

real    0m10.747s
user    0m0.001s
sys     0m0.005s

The first patch is a little ugly, since I had to chase the slack delta
arguments through a number of functions converting them to u64s.  Let me
know if it makes sense to break that up more or not.

Other than that things are fairly straightforward.

This patch (of 2):

The timer_slack_ns value in the task struct is currently a unsigned
long.  This means that on 32bit applications, the maximum slack is just
over 4 seconds.  However, on 64bit machines, its much much larger (~500
years).

This disparity could make application development a little (as well as
the default_slack) to a u64.  This means both 32bit and 64bit systems
have the same effective internal slack range.

Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
the interface as a unsigned long, so we preserve that limitation on
32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
actually larger then what can be stored by an unsigned long.

This patch also modifies hrtimer functions which specified the slack
delta as a unsigned long.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-01 18:12:06 -07:00
Willy Tarreau
fa6d0ba12a pipe: limit the per-user amount of pages allocated in pipes
commit 759c01142a5d0f364a462346168a56de28a80f52 upstream.

On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Moritz Muehlenhoff <moritz@wikimedia.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:35 -07:00
Juri Lelli
dd2460f387 DEBUG: sched,cpufreq: add cpu_capacity change tracepoint
This is useful when we want to compare cpu utilization and
cpu curr capacity side by side.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:54:42 +08:00
Michael Turquette
a967a45c71 sched: scheduler-driven cpu frequency selection
Scheduler-driven CPU frequency selection hopes to exploit both
per-task and global information in the scheduler to improve frequency
selection policy, achieving lower power consumption, improved
responsiveness/performance, and less reliance on heuristics and
tunables. For further discussion on the motivation of this integration
see [0].

This patch implements a shim layer between the Linux scheduler and the
cpufreq subsystem. The interface accepts capacity requests from the
CFS, RT and deadline sched classes. The requests from each sched class
are summed on each CPU with a margin applied to the CFS and RT
capacity requests to provide some headroom. Deadline requests are
expected to be precise enough given their nature to not require
headroom. The maximum total capacity request for a CPU in a frequency
domain drives the requested frequency for that domain.

Policy is determined by both the sched classes and this shim layer.

Note that this algorithm is event-driven. There is no polling loop to
check cpu idle time nor any other method which is unsynchronized with
the scheduler, aside from a throttling mechanism to ensure frequency
changes are not attempted faster than the hardware can accommodate them.

Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
code and test results, and to Ricky Liang <jcliang@chromium.org>
for initialization and static key inc/dec fixes.

[0] http://article.gmane.org/gmane.linux.kernel/1499836

[smuckle@linaro.org: various additions and fixes, revised commit text]

CC: Ricky Liang <jcliang@chromium.org>
Signed-off-by: Michael Turquette <mturquette@baylibre.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:22 +08:00
Morten Rasmussen
f0f739d887 sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
cpufreq is currently keeping it a secret which cpus are sharing
clock source. The scheduler needs to know about clock domains as well
to become more energy aware. The SD_SHARE_CAP_STATES domain flag
indicates whether cpus belonging to the sched_domain share capacity
states (P-states).

There is no connection with cpufreq (yet). The flag must be set by
the arch specific topology code.

cc: Russell King <linux@arm.linux.org.uk>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:50 +08:00
Dietmar Eggemann
0b3bda54d2 sched: Introduce energy data structures
The struct sched_group_energy represents the per sched_group related
data which is needed for energy aware scheduling. It contains:

  (1) number of elements of the idle state array
  (2) pointer to the idle state array which comprises 'power consumption'
      for each idle state
  (3) number of elements of the capacity state array
  (4) pointer to the capacity state array which comprises 'compute
      capacity and power consumption' tuples for each capacity state

The struct sched_group obtains a pointer to a struct sched_group_energy.

The function pointer sched_domain_energy_f is introduced into struct
sched_domain_topology_level which will allow the arch to pass a particular
struct sched_group_energy from the topology shim layer into the scheduler
core.

The function pointer sched_domain_energy_f has an 'int cpu' parameter
since the folding of two adjacent sd levels via sd degenerate doesn't work
for all sd levels. I.e. it is not possible for example to use this feature
to provide per-cpu energy in sd level DIE on ARM's TC2 platform.

It was discussed that the folding of sd levels approach is preferable
over the cpu parameter approach, simply because the user (the arch
specifying the sd topology table) can introduce less errors. But since
it is not working, the 'int cpu' parameter is the only way out. It's
possible to use the folding of sd levels approach for
sched_domain_flags_f and the cpu parameter approach for the
sched_domain_energy_f at the same time though. With the use of the
'int cpu' parameter, an extra check function has to be provided to make
sure that all cpus spanned by a sched group are provisioned with the same
energy data.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:50 +08:00
willy tarreau
5e226f9689 unix: properly account for FDs passed over unix sockets
[ Upstream commit 712f4aad406bb1ed67f3f98d04c044191f0ff593 ]

It is possible for a process to allocate and accumulate far more FDs than
the process' limit by sending them over a unix socket then closing them
to keep the process' fd count low.

This change addresses this problem by keeping track of the number of FDs
in flight per user and preventing non-privileged processes from having
more FDs in flight than their configured FD limit.

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:28:59 -08:00
Peter Zijlstra
be958bdc96 sched/core: Fix unserialized r-m-w scribbling stuff
Some of the sched bitfieds (notably sched_reset_on_fork) can be set
on other than current, this can cause the r-m-w to race with other
updates.

Since all the sched bits are serialized by scheduler locks, pull them
in a separate word.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: hannes@cmpxchg.org
Cc: mhocko@kernel.org
Cc: vdavydov@parallels.com
Link: http://lkml.kernel.org/r/20151125150207.GM11639@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:01:07 +01:00
Sergey Senozhatsky
570f52412a sched/core: Check tgid in is_global_init()
Our global init task can have sub-threads, so ->pid check is not reliable
enough for is_global_init(), we need to check tgid instead. This has been
spotted by Oleg and a fix was proposed by Richard a long time ago (see the
link below).

Oleg wrote:

  : Because is_global_init() is only true for the main thread of /sbin/init.
  :
  : Just look at oom_unkillable_task(). It tries to not kill init. But, say,
  : select_bad_process() can happily find a sub-thread of is_global_init()
  : and still kill it.

I recently hit the problem in question; re-sending the patch (to the
best of my knowledge it has never been submitted) with updated function
comment. Credit goes to Oleg and Richard.

Suggested-by: Richard Guy Briggs <rgb@redhat.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W . Biederman <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge E . Hallyn <serge.hallyn@ubuntu.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://www.redhat.com/archives/linux-audit/2013-December/msg00086.html
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:01:06 +01:00
Linus Torvalds
264015f8a8 libnvdimm for 4.4:
1/ Add support for the ACPI 6.0 NFIT hot add mechanism to process
    updates of the NFIT at runtime.
 
 2/ Teach the coredump implementation how to filter out DAX mappings.
 
 3/ Introduce NUMA hints for allocations made by the pmem driver, and as
    a side effect all devm allocations now hint their NUMA node by
    default.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWQX2sAAoJEB7SkWpmfYgCWsEQAK7w/xM9zClVY/DDlFJxFtYq
 DZJ4faPj+E3FMTiJIEDzjtRgQvOFE+wmJtntYsCqKH/QZmpnyk9jeT/CbJzEEL2k
 WsAk+qHGLcVUlSb36blwN1RFzYqC+IDYThewJqUvxDbOwL1AbiibbX7gplzZHLhW
 +rj3ScVlSNOPRDgGGpkAeLNNsttuKvsGo7nB/JZopm0tV6g14rSK09wQbVhv6S6T
 Lu7VGYqnJlkteL9YlzRiROf9hW2ZFCMGJz1YZydPTy3aX3hGTBX4w2qvmsPwBIKP
 kW/gCNisVJGk1cZCk4joSJ8i/b3x3fE0zdZ5waivJ5jDvYbUUfyk0KtJkfw207Rl
 14yWitUC6aeVuCeOqXHgsjRi+1QVN9Pg7i49xgGiUN1igQiUYRTgQPWZxDv6Zo/s
 USrLFQBaRd+hJw+dl7A47lJ3mUF96tPCoQb4LCQ7DVsg5U4J2TvqXLH9Gek/CCZ4
 QsMkZDTQlZw4+JEDlzBgg/L7xVty8DadplTADMdjaRhFU3y8zKNJ85Ileokt7KVt
 IsBT4+S5HeZLvinZY95932DwAmFp1DtsyENd1BUXL06ddyvlQrFJ6NQaXji4fuDc
 EVQmMoTAqDujZFupMAux9vkUBDFj/hmaVD5F7j3+MWP87OCritw/IZn+2LgTaKoX
 EmttaYrDr2jJwIaGyw+H
 =a2/L
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "Outside of the new ACPI-NFIT hot-add support this pull request is more
  notable for what it does not contain, than what it does.  There were a
  handful of development topics this cycle, dax get_user_pages, dax
  fsync, and raw block dax, that need more more iteration and will wait
  for 4.5.

  The patches to make devm and the pmem driver NUMA aware have been in
  -next for several weeks.  The hot-add support has not, but is
  contained to the NFIT driver and is passing unit tests.  The coredump
  support is straightforward and was looked over by Jeff.  All of it has
  received a 0day build success notification across 107 configs.

  Summary:

   - Add support for the ACPI 6.0 NFIT hot add mechanism to process
     updates of the NFIT at runtime.

   - Teach the coredump implementation how to filter out DAX mappings.

   - Introduce NUMA hints for allocations made by the pmem driver, and
     as a side effect all devm allocations now hint their NUMA node by
     default"

* tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  coredump: add DAX filtering for FDPIC ELF coredumps
  coredump: add DAX filtering for ELF coredumps
  acpi: nfit: Add support for hot-add
  nfit: in acpi_nfit_init, break on a 0-length table
  pmem, memremap: convert to numa aware allocations
  devm_memremap_pages: use numa_mem_id
  devm: make allocations numa aware by default
  devm_memremap: convert to return ERR_PTR
  devm_memunmap: use devres_release()
  pmem: kill memremap_pmem()
  x86, mm: quiet arch_add_memory()
2015-11-10 12:07:22 -08:00
Ross Zwisler
5037835c1f coredump: add DAX filtering for ELF coredumps
Add two new flags to the existing coredump mechanism for ELF files to
allow us to explicitly filter DAX mappings.  This is desirable because
DAX mappings, like hugetlb mappings, have the potential to be very
large.

Update the coredump_filter documentation in
Documentation/filesystems/proc.txt so that it addresses the new DAX
coredump flags.  Also update the documented default value of
coredump_filter to be consistent with the core(5) man page.  The
documentation being updated talks about bit 4, Dump ELF headers, which
is enabled if CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is turned on in the
kernel config.  This kernel config option defaults to "y" if both ELF
binaries and coredump are enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-11-09 13:29:54 -05:00
Oleg Nesterov
9a13049e83 signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
jffs2_garbage_collect_thread() can race with SIGCONT and sleep in
TASK_STOPPED state after it was already sent. Add the new helper,
kernel_signal_stop(), which does this correctly.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Oleg Nesterov
be0e6f290f signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
1. Rename dequeue_signal_lock() to kernel_dequeue_signal(). This
   matches another "for kthreads only" kernel_sigaction() helper.

2. Remove the "tsk" and "mask" arguments, they are always current
   and current->blocked. And it is simply wrong if tsk != current.

3. We could also remove the 3rd "siginfo_t *info" arg but it looks
   potentially useful. However we can simplify the callers if we
   change kernel_dequeue_signal() to accept info => NULL.

4. Remove _irqsave, it is never called from atomic context.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Oleg Nesterov
2e01fabe67 signals: kill block_all_signals() and unblock_all_signals()
It is hardly possible to enumerate all problems with block_all_signals()
and unblock_all_signals().  Just for example,

1. block_all_signals(SIGSTOP/etc) simply can't help if the caller is
   multithreaded. Another thread can dequeue the signal and force the
   group stop.

2. Even is the caller is single-threaded, it will "stop" anyway. It
   will not sleep, but it will spin in kernel space until SIGCONT or
   SIGKILL.

And a lot more. In short, this interface doesn't work at all, at least
the last 10+ years.

Daniel said:

  Yeah the only times I played around with the DRM_LOCK stuff was when
  old drivers accidentally deadlocked - my impression is that the entire
  DRM_LOCK thing was never really tested properly ;-) Hence I'm all for
  purging where this leaks out of the drm subsystem.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Dave Airlie <airlied@redhat.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Linus Torvalds
2e3078af2c Merge branch 'akpm' (patches from Andrew)
Merge patch-bomb from Andrew Morton:

 - inotify tweaks

 - some ocfs2 updates (many more are awaiting review)

 - various misc bits

 - kernel/watchdog.c updates

 - Some of mm.  I have a huge number of MM patches this time and quite a
   lot of it is quite difficult and much will be held over to next time.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
  selftests: vm: add tests for lock on fault
  mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
  mm: introduce VM_LOCKONFAULT
  mm: mlock: add new mlock system call
  mm: mlock: refactor mlock, munlock, and munlockall code
  kasan: always taint kernel on report
  mm, slub, kasan: enable user tracking by default with KASAN=y
  kasan: use IS_ALIGNED in memory_is_poisoned_8()
  kasan: Fix a type conversion error
  lib: test_kasan: add some testcases
  kasan: update reference to kasan prototype repo
  kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
  kasan: various fixes in documentation
  kasan: update log messages
  kasan: accurately determine the type of the bad access
  kasan: update reported bug types for kernel memory accesses
  kasan: update reported bug types for not user nor kernel memory accesses
  mm/kasan: prevent deadlock in kasan reporting
  mm/kasan: don't use kasan shadow pointer in generic functions
  mm/kasan: MODULE_VADDR is not available on all archs
  ...
2015-11-05 23:10:54 -08:00
Tejun Heo
b23afb93d3 memcg: punt high overage reclaim to return-to-userland path
Currently, try_charge() tries to reclaim memory synchronously when the
high limit is breached; however, if the allocation doesn't have
__GFP_WAIT, synchronous reclaim is skipped.  If a process performs only
speculative allocations, it can blow way past the high limit.  This is
actually easily reproducible by simply doing "find /".  slab/slub
allocator tries speculative allocations first, so as long as there's
memory which can be consumed without blocking, it can keep allocating
memory regardless of the high limit.

This patch makes try_charge() always punt the over-high reclaim to the
return-to-userland path.  If try_charge() detects that high limit is
breached, it adds the overage to current->memcg_nr_pages_over_high and
schedules execution of mem_cgroup_handle_over_high() which performs
synchronous reclaim from the return-to-userland path.

As long as kernel doesn't have a run-away allocation spree, this should
provide enough protection while making kmemcg behave more consistently.
It also has the following benefits.

- All over-high reclaims can use GFP_KERNEL regardless of the specific
  gfp mask in use, e.g. GFP_NOFS, when the limit was breached.

- It copes with prio inversion.  Previously, a low-prio task with
  small memory.high might perform over-high reclaim with a bunch of
  locks held.  If a higher prio task needed any of these locks, it
  would have to wait until the low prio task finished reclaim and
  released the locks.  By handing over-high reclaim to the task exit
  path this issue can be avoided.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Tejun Heo
626ebc4100 memcg: flatten task_struct->memcg_oom
task_struct->memcg_oom is a sub-struct containing fields which are used
for async memcg oom handling.  Most task_struct fields aren't packaged
this way and it can lead to unnecessary alignment paddings.  This patch
flattens it.

* task.memcg_oom.memcg          -> task.memcg_in_oom
* task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
* task.memcg_oom.order          -> task.memcg_oom_order
* task.memcg_oom.may_oom        -> task.memcg_may_oom

In addition, task.memcg_may_oom is relocated to where other bitfields are
which reduces the size of task_struct.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Don Zickus
ac1f591249 kernel/watchdog.c: add sysctl knob hardlockup_panic
The only way to enable a hardlockup to panic the machine is to set
'nmi_watchdog=panic' on the kernel command line.

This makes it awkward for end users and folks who want to run automate
tests (like myself).

Mimic the softlockup_panic knob and create a /proc/sys/kernel/hardlockup_panic
knob.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Linus Torvalds
69234acee5 Merge branch 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "The cgroup core saw several significant updates this cycle:

   - percpu_rwsem for threadgroup locking is reinstated.  This was
     temporarily dropped due to down_write latency issues.  Oleg's
     rework of percpu_rwsem which is scheduled to be merged in this
     merge window resolves the issue.

   - On the v2 hierarchy, when controllers are enabled and disabled, all
     operations are atomic and can fail and revert cleanly.  This allows
     ->can_attach() failure which is necessary for cpu RT slices.

   - Tasks now stay associated with the original cgroups after exit
     until released.  This allows tracking resources held by zombies
     (e.g.  pids) and makes it easy to find out where zombies came from
     on the v2 hierarchy.  The pids controller was broken before these
     changes as zombies escaped the limits; unfortunately, updating this
     behavior required too many invasive changes and I don't think it's
     a good idea to backport them, so the pids controller on 4.3, the
     first version which included the pids controller, will stay broken
     at least until I'm sure about the cgroup core changes.

   - Optimization of a couple common tests using static_key"

* 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
  cgroup: fix race condition around termination check in css_task_iter_next()
  blkcg: don't create "io.stat" on the root cgroup
  cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
  cgroup: replace error handling in cgroup_init() with WARN_ON()s
  cgroup: add cgroup_subsys->free() method and use it to fix pids controller
  cgroup: keep zombies associated with their original cgroups
  cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
  cgroup: don't hold css_set_rwsem across css task iteration
  cgroup: reorganize css_task_iter functions
  cgroup: factor out css_set_move_task()
  cgroup: keep css_set and task lists in chronological order
  cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
  cgroup: make css_sets pin the associated cgroups
  cgroup: relocate cgroup_[try]get/put()
  cgroup: move check_for_release() invocation
  cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
  cgroup: make cgroup->nr_populated count the number of populated css_sets
  cgroup: remove an unused parameter from cgroup_task_migrate()
  cgroup: fix too early usage of static_branch_disable()
  cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
  ...
2015-11-05 14:51:32 -08:00
Linus Torvalds
b0f85fa11a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

Changes of note:

 1) Allow to schedule ICMP packets in IPVS, from Alex Gartrell.

 2) Provide FIB table ID in ipv4 route dumps just as ipv6 does, from
    David Ahern.

 3) Allow the user to ask for the statistics to be filtered out of
    ipv4/ipv6 address netlink dumps.  From Sowmini Varadhan.

 4) More work to pass the network namespace context around deep into
    various packet path APIs, starting with the netfilter hooks.  From
    Eric W Biederman.

 5) Add layer 2 TX/RX checksum offloading to qeth driver, from Thomas
    Richter.

 6) Use usec resolution for SYN/ACK RTTs in TCP, from Yuchung Cheng.

 7) Support Very High Throughput in wireless MESH code, from Bob
    Copeland.

 8) Allow setting the ageing_time in switchdev/rocker.  From Scott
    Feldman.

 9) Properly autoload L2TP type modules, from Stephen Hemminger.

10) Fix and enable offload features by default in 8139cp driver, from
    David Woodhouse.

11) Support both ipv4 and ipv6 sockets in a single vxlan device, from
    Jiri Benc.

12) Fix CWND limiting of thin streams in TCP, from Bendik Rønning
    Opstad.

13) Fix IPSEC flowcache overflows on large systems, from Steffen
    Klassert.

14) Convert bridging to track VLANs using rhashtable entries rather than
    a bitmap.  From Nikolay Aleksandrov.

15) Make TCP listener handling completely lockless, this is a major
    accomplishment.  Incoming request sockets now live in the
    established hash table just like any other socket too.

    From Eric Dumazet.

15) Provide more bridging attributes to netlink, from Nikolay
    Aleksandrov.

16) Use hash based algorithm for ipv4 multipath routing, this was very
    long overdue.  From Peter Nørlund.

17) Several y2038 cures, mostly avoiding timespec.  From Arnd Bergmann.

18) Allow non-root execution of EBPF programs, from Alexei Starovoitov.

19) Support SO_INCOMING_CPU as setsockopt, from Eric Dumazet.  This
    influences the port binding selection logic used by SO_REUSEPORT.

20) Add ipv6 support to VRF, from David Ahern.

21) Add support for Mellanox Spectrum switch ASIC, from Jiri Pirko.

22) Add rtl8xxxu Realtek wireless driver, from Jes Sorensen.

23) Implement RACK loss recovery in TCP, from Yuchung Cheng.

24) Support multipath routes in MPLS, from Roopa Prabhu.

25) Fix POLLOUT notification for listening sockets in AF_UNIX, from Eric
    Dumazet.

26) Add new QED Qlogic river, from Yuval Mintz, Manish Chopra, and
    Sudarsana Kalluru.

27) Don't fetch timestamps on AF_UNIX sockets, from Hannes Frederic
    Sowa.

28) Support ipv6 geneve tunnels, from John W Linville.

29) Add flood control support to switchdev layer, from Ido Schimmel.

30) Fix CHECKSUM_PARTIAL handling of potentially fragmented frames, from
    Hannes Frederic Sowa.

31) Support persistent maps and progs in bpf, from Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1790 commits)
  sh_eth: use DMA barriers
  switchdev: respect SKIP_EOPNOTSUPP flag in case there is no recursion
  net: sched: kill dead code in sch_choke.c
  irda: Delete an unnecessary check before the function call "irlmp_unregister_service"
  net: dsa: mv88e6xxx: include DSA ports in VLANs
  net: dsa: mv88e6xxx: disable SA learning for DSA and CPU ports
  net/core: fix for_each_netdev_feature
  vlan: Invoke driver vlan hooks only if device is present
  arcnet/com20020: add LEDS_CLASS dependency
  bpf, verifier: annotate verbose printer with __printf
  dp83640: Only wait for timestamps for packets with timestamping enabled.
  ptp: Change ptp_class to a proper bitmask
  dp83640: Prune rx timestamp list before reading from it
  dp83640: Delay scheduled work.
  dp83640: Include hash in timestamp/packet matching
  ipv6: fix tunnel error handling
  net/mlx5e: Fix LSO vlan insertion
  net/mlx5e: Re-eanble client vlan TX acceleration
  net/mlx5e: Return error in case mlx5e_set_features() fails
  net/mlx5e: Don't allow more than max supported channels
  ...
2015-11-04 09:41:05 -08:00
Linus Torvalds
53528695ff Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes in this cycle were:

   - sched/fair load tracking fixes and cleanups (Byungchul Park)

   - Make load tracking frequency scale invariant (Dietmar Eggemann)

   - sched/deadline updates (Juri Lelli)

   - stop machine fixes, cleanups and enhancements for bugs triggered by
     CPU hotplug stress testing (Oleg Nesterov)

   - scheduler preemption code rework: remove PREEMPT_ACTIVE and related
     cleanups (Peter Zijlstra)

   - Rework the sched_info::run_delay code to fix races (Peter Zijlstra)

   - Optimize per entity utilization tracking (Peter Zijlstra)

   - ... misc other fixes, cleanups and smaller updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
  sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
  sched: Start stopper early
  stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
  stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
  stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
  stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
  stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
  sched/x86: Fix typo in __switch_to() comments
  sched/core: Remove a parameter in the migrate_task_rq() function
  sched/core: Drop unlikely behind BUG_ON()
  sched/core: Fix task and run queue sched_info::run_delay inconsistencies
  sched/numa: Fix task_tick_fair() from disabling numa_balancing
  sched/core: Add preempt_count invariant check
  sched/core: More notrace annotations
  sched/core: Kill PREEMPT_ACTIVE
  sched/core, sched/x86: Kill thread_info::saved_preempt_count
  sched/core: Simplify preempt_count tests
  sched/core: Robustify preemption leak checks
  sched/core: Stop setting PREEMPT_ACTIVE
  ...
2015-11-03 18:03:50 -08:00
Linus Torvalds
2814228699 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molnar:
 "The main changes in this cycle were:

   - Improvements to expedited grace periods (Paul E McKenney)

   - Performance improvements to and locktorture tests for percpu-rwsem
     (Oleg Nesterov, Paul E McKenney)

   - Torture-test changes (Paul E McKenney, Davidlohr Bueso)

   - Documentation updates (Paul E McKenney)

   - Miscellaneous fixes (Paul E McKenney, Boqun Feng, Oleg Nesterov,
     Patrick Marlier)"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs()
  rcu: Better hotplug handling for synchronize_sched_expedited()
  rcu: Enable stall warnings for synchronize_rcu_expedited()
  rcu: Add tasks to expedited stall-warning messages
  rcu: Add online/offline info to expedited stall warning message
  rcu: Consolidate expedited CPU selection
  rcu: Prepare for consolidating expedited CPU selection
  cpu: Remove try_get_online_cpus()
  rcu: Stop excluding CPU hotplug in synchronize_sched_expedited()
  rcu: Stop silencing lockdep false positive for expedited grace periods
  rcu: Switch synchronize_sched_expedited() to IPI
  locktorture: Fix module unwind when bad torture_type specified
  torture: Forgive non-plural arguments
  rcutorture: Fix unused-function warning for torturing_tasks()
  rcutorture: Fix module unwind when bad torture_type specified
  rcu_sync: Cleanup the CONFIG_PROVE_RCU checks
  locking/percpu-rwsem: Clean up the lockdep annotations in percpu_down_read()
  locking/percpu-rwsem: Fix the comments outdated by rcu_sync
  locking/percpu-rwsem: Make use of the rcu_sync infrastructure
  locking/percpu-rwsem: Make percpu_free_rwsem() after kzalloc() safe
  ...
2015-11-03 15:40:38 -08:00
Ingo Molnar
c13dc31adb Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  - Miscellaneous fixes. (Paul E. McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)

  - Improvements to expedited grace periods. (Paul E. McKenney)

  - Performance improvements to and locktorture tests for percpu-rwsem.
    (Oleg Nesterov, Paul E. McKenney)

  - Torture-test changes. (Paul E. McKenney, Davidlohr Bueso)

  - Documentation updates. (Paul E. McKenney)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19 10:09:54 +02:00
Jason Low
c8d75aa47d posix_cpu_timer: Reduce unnecessary sighand lock contention
It was found while running a database workload on large systems that
significant time was spent trying to acquire the sighand lock.

The issue was that whenever an itimer expired, many threads ended up
simultaneously trying to send the signal. Most of the time, nothing
happened after acquiring the sighand lock because another thread
had just already sent the signal and updated the "next expire" time.
The fastpath_timer_check() didn't help much since the "next expire"
time was updated after the threads exit fastpath_timer_check().

This patch addresses this by having the thread_group_cputimer structure
maintain a boolean to signify when a thread in the group is already
checking for process wide timers, and adds extra logic in the fastpath
to check the boolean.

Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: George Spelvin <linux@horizon.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: hideaki.kimura@hpe.com
Cc: terry.rudd@hpe.com
Cc: scott.norton@hpe.com
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1444849677-29330-5-git-send-email-jason.low2@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-15 11:23:41 +02:00
Jason Low
d5c373eb56 posix_cpu_timer: Convert cputimer->running to bool
In the next patch in this series, a new field 'checking_timer' will
be added to 'struct thread_group_cputimer'. Both this and the
existing 'running' integer field are just used as boolean values. To
save space in the structure, we can make both of these fields booleans.

This is a preparatory patch to convert the existing running integer
field to a boolean.

Suggested-by: George Spelvin <linux@horizon.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed: George Spelvin <linux@horizon.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: hideaki.kimura@hpe.com
Cc: terry.rudd@hpe.com
Cc: scott.norton@hpe.com
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1444849677-29330-4-git-send-email-jason.low2@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-15 11:23:41 +02:00
Alexei Starovoitov
aaac3ba95e bpf: charge user for creation of BPF maps and programs
since eBPF programs and maps use kernel memory consider it 'locked' memory
from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
This limit is typically set to 64Kbytes by distros, so almost all
bpf+tracing programs would need to increase it, since they use maps,
but kernel charges maximum map size upfront.
For example the hash map of 1024 elements will be charged as 64Kbyte.
It's inconvenient for current users and changes current behavior for root,
but probably worth doing to be consistent root vs non-root.

Similar accounting logic is done by mmap of perf_event.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:13:36 -07:00
Peter Zijlstra
609ca06638 sched/core: Create preempt_count invariant
Assuming units of PREEMPT_DISABLE_OFFSET for preempt_count() numbers.

Now that TASK_DEAD no longer results in preempt_count() == 3 during
scheduling, we will always call context_switch() with preempt_count()
== 2.

However, we don't always end up with preempt_count() == 2 in
finish_task_switch() because new tasks get created with
preempt_count() == 1.

Create FORK_PREEMPT_COUNT and set it to 2 and use that in the right
places. Note that we cannot use INIT_PREEMPT_COUNT as that serves
another purpose (boot).

After this, preempt_count() is invariant across the context switch,
with exception of PREEMPT_ACTIVE.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:14 +02:00
Peter Zijlstra
87dcbc0610 sched/core: Simplify INIT_PREEMPT_COUNT
As per the following commit:

  d86ee4809d ("sched: optimize cond_resched()")

we need PREEMPT_ACTIVE to avoid cond_resched() from working before
the scheduler is set up.

However, keeping preemption disabled should do the same thing
already, making the PREEMPT_ACTIVE part entirely redundant.

The only complication is !PREEMPT_COUNT kernels, where
PREEMPT_DISABLED ends up being 0. Instead we use an unconditional
PREEMPT_OFFSET to set preempt_count() even on !PREEMPT_COUNT
kernels.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:12 +02:00
Ingo Molnar
fe19159225 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:05:36 +02:00
Juergen Gross
c6e1e7b5b7 sched/core: Make 'sched_domain_topology' declaration static
The 'sched_domain_topology' variable is only used within kernel/sched/core.c.
Make it static.

Signed-off-by: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1442918939-9907-1-git-send-email-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-23 10:19:12 +02:00
Paul E. McKenney
8203d6d0ee rcu: Use single-stage IPI algorithm for RCU expedited grace period
The current preemptible-RCU expedited grace-period algorithm invokes
synchronize_sched_expedited() to enqueue all tasks currently running
in a preemptible-RCU read-side critical section, then waits for all the
->blkd_tasks lists to drain.  This works, but results in both an IPI and
a double context switch even on CPUs that do not happen to be running
in a preemptible RCU read-side critical section.

This commit implements a new algorithm that causes less OS jitter.
This new algorithm IPIs all online CPUs that are not idle (from an
RCU perspective), but refrains from self-IPIs.  If a CPU receiving
this IPI is not in a preemptible RCU read-side critical section (or
is just now exiting one), it pushes quiescence up the rcu_node tree,
otherwise, it sets a flag that will be handled by the upcoming outermost
rcu_read_unlock(), which will then push quiescence up the tree.

The expedited grace period must of course wait on any pre-existing blocked
readers, and newly blocked readers must be queued carefully based on
the state of both the normal and the expedited grace periods.  This
new queueing approach also avoids the need to update boost state,
courtesy of the fact that blocked tasks are no longer ever migrated to
the root rcu_node structure.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-09-20 21:16:19 -07:00
Tejun Heo
1ed1328792 sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
Note: This commit was originally committed as d59cfc09c3 but got
      reverted by 0c986253b9 due to the performance regression from
      the percpu_rwsem write down/up operations added to cgroup task
      migration path.  percpu_rwsem changes which alleviate the
      performance issue are pending for v4.4-rc1 merge window.
      Re-apply.

The cgroup side of threadgroup locking uses signal_struct->group_rwsem
to synchronize against threadgroup changes.  This per-process rwsem
adds small overhead to thread creation, exit and exec paths, forces
cgroup code paths to do lock-verify-unlock-retry dance in a couple
places and makes it impossible to atomically perform operations across
multiple processes.

This patch replaces signal_struct->group_rwsem with a global
percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
side and contained in cgroups proper.  This patch converts one-to-one.

This does make writer side heavier and lower the granularity; however,
cgroup process migration is a fairly cold path, we do want to optimize
thread operations over it and cgroup migration operations don't take
enough time for the lower granularity to matter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2015-09-16 12:53:17 -04:00
Tejun Heo
0c986253b9 Revert "sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"
This reverts commit d59cfc09c3.

d59cfc09c3 ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem") and b5ba75b5fc ("cgroup: simplify
threadgroup locking") changed how cgroup synchronizes against task
fork and exits so that it uses global percpu_rwsem instead of
per-process rwsem; unfortunately, the write [un]lock paths of
percpu_rwsem always involve synchronize_rcu_expedited() which turned
out to be too expensive.

Improvements for percpu_rwsem are scheduled to be merged in the coming
v4.4-rc1 merge window which alleviates this issue.  For now, revert
the two commits to restore per-process rwsem.  They will be re-applied
for the v4.4-rc1 merge window.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org # v4.2+
2015-09-16 11:51:12 -04:00
Dietmar Eggemann
e3279a2e6d sched/fair: Make utilization tracking CPU scale-invariant
Besides the existing frequency scale-invariance correction factor, apply
CPU scale-invariance correction factor to utilization tracking to
compensate for any differences in compute capacity. This could be due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
maximum frequency supported by individual cpus in SMP systems. In the
existing implementation utilization isn't comparable between cpus as it
is relative to the capacity of each individual CPU.

Each segment of the sched_avg.util_sum geometric series is now scaled
by the CPU performance factor too so the sched_avg.util_avg of each
sched entity will be invariant from the particular CPU of the HMP/SMP
system on which the sched entity is scheduled.

With this patch, the utilization of a CPU stays relative to the max CPU
performance of the fastest CPU in the system.

In contrast to utilization (sched_avg.util_sum), load
(sched_avg.load_sum) should not be scaled by compute capacity. The
utilization metric is based on running time which only makes sense when
cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
more tasks are added), where load is runnable time which isn't limited
by the capacity of the CPU and therefore is a better metric for
overloaded scenarios. If we run two nice-0 busy loops on two cpus with
different compute capacity their load should be similar since their
compute demands are the same. We have to assume that the compute demand
of any task running on a fully utilized CPU (no spare cycles = 100%
utilization) is high and the same no matter of the compute capacity of
its current CPU, hence we shouldn't scale load by CPU capacity.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/55CE7409.1000700@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:56 +02:00