Pull locking fixes from Ingo Molnar:
"Spinlock performance regression fix, plus documentation fixes"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/static_keys: Fix up the static keys documentation
locking/qspinlock/x86: Only emit the test-and-set fallback when building guest support
locking/qspinlock/x86: Fix performance regression under unaccelerated VMs
locking/static_keys: Fix a silly typo
Note: This commit was originally committed as b5ba75b5fc but got
reverted by f9f9e7b776 due to the performance regression from
the percpu_rwsem write down/up operations added to cgroup task
migration path. percpu_rwsem changes which alleviate the
performance issue are pending for v4.4-rc1 merge window.
Re-apply.
Now that threadgroup locking is made global, code paths around it can
be simplified.
* lock-verify-unlock-retry dancing removed from __cgroup_procs_write().
* Race protection against de_thread() removed from
cgroup_update_dfl_csses().
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
Note: This commit was originally committed as d59cfc09c3 but got
reverted by 0c986253b9 due to the performance regression from
the percpu_rwsem write down/up operations added to cgroup task
migration path. percpu_rwsem changes which alleviate the
performance issue are pending for v4.4-rc1 merge window.
Re-apply.
The cgroup side of threadgroup locking uses signal_struct->group_rwsem
to synchronize against threadgroup changes. This per-process rwsem
adds small overhead to thread creation, exit and exec paths, forces
cgroup code paths to do lock-verify-unlock-retry dance in a couple
places and makes it impossible to atomically perform operations across
multiple processes.
This patch replaces signal_struct->group_rwsem with a global
percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
side and contained in cgroups proper. This patch converts one-to-one.
This does make writer side heavier and lower the granularity; however,
cgroup process migration is a fairly cold path, we do want to optimize
thread operations over it and cgroup migration operations don't take
enough time for the lower granularity to matter.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
This reverts commit d59cfc09c3.
d59cfc09c3 ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem") and b5ba75b5fc ("cgroup: simplify
threadgroup locking") changed how cgroup synchronizes against task
fork and exits so that it uses global percpu_rwsem instead of
per-process rwsem; unfortunately, the write [un]lock paths of
percpu_rwsem always involve synchronize_rcu_expedited() which turned
out to be too expensive.
Improvements for percpu_rwsem are scheduled to be merged in the coming
v4.4-rc1 merge window which alleviates this issue. For now, revert
the two commits to restore per-process rwsem. They will be re-applied
for the v4.4-rc1 merge window.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org # v4.2+
This reverts commit b5ba75b5fc.
d59cfc09c3 ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem") and b5ba75b5fc ("cgroup: simplify
threadgroup locking") changed how cgroup synchronizes against task
fork and exits so that it uses global percpu_rwsem instead of
per-process rwsem; unfortunately, the write [un]lock paths of
percpu_rwsem always involve synchronize_rcu_expedited() which turned
out to be too expensive.
Improvements for percpu_rwsem are scheduled to be merged in the coming
v4.4-rc1 merge window which alleviates this issue. For now, revert
the two commits to restore per-process rwsem. They will be re-applied
for the v4.4-rc1 merge window.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org # v4.2+
Most interrupt flow handlers do not use the irq argument. Those few
which use it can retrieve the irq number from the irq descriptor.
Remove the argument.
Search and replace was done with coccinelle and some extra helper
scripts around it. Thanks to Julia for her help!
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
MSI descriptors are per-irq instead of per irqchip, so move it into
struct irq_common_data.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1433145945-789-35-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Irq affinity mask is per-irq instead of per irqchip, so move it into
struct irq_common_data.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/1433303281-27688-1-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Handler data (handler_data) is per-irq instead of per irqchip, so move
it into struct irq_common_data.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1433145945-789-13-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
NUMA node information is per-irq instead of per-irqchip, so move it into
struct irq_common_data. Also use CONFIG_NUMA to guard irq_common_data.node.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/1433145945-789-8-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a sysfs attribute, /sys/power/pm_wakeup_irq, reporting the IRQ
number of the first wakeup interrupt (that is, the first interrupt
from an IRQ line armed for system wakeup) seen by the kernel during
the most recent system suspend/resume cycle.
This feature will be useful for system wakeup diagnostics of
spurious wakeup interrupts.
Signed-off-by: Alexandra Yates <alexandra.yates@linux.intel.com>
[ rjw: Fixed up pm_wakeup_irq definition ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Define a new PERF_PMU_TXN_READ interface to read a group of counters
at once.
pmu->start_txn() // Initialize before first event
for each event in group
pmu->read(event); // Queue each event to be read
rc = pmu->commit_txn() // Read/update all queued counters
Note that we use this interface with all PMUs. PMUs that implement this
interface use the ->read() operation to _queue_ the counters to be read
and use ->commit_txn() to actually read all the queued counters at once.
PMUs that don't implement PERF_PMU_TXN_READ ignore ->start_txn() and
->commit_txn() and continue to read counters one at a time.
Thanks to input from Peter Zijlstra.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1441336073-22750-9-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we implement the ability to read several counters at once (using
the PERF_PMU_TXN_READ transaction interface), perf_event_read() can
fail when the 'group' parameter is true (eg: trying to read too many
events at once).
For now, have perf_event_read() return an integer. Ignore the return
value when the 'group' parameter is false.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1441336073-22750-8-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to enable the use of perf_event_read(.group = true), we need
to invert the sibling-child loop nesting of perf_read_group().
Currently we iterate the child list for each sibling, this precludes
using group reads. Flip things around so we iterate each group for
each child.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Made the patch compile and things. ]
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1441336073-22750-7-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Enable perf_event_read() to update entire groups at once, this will be
useful for read transactions.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20150723080435.GE25159@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf_event_read() does two things:
- call the PMU to read/update the counter value, and
- compute the total count of the event and its children
Not all callers need both. perf_event_reset() for instance needs the
first piece but doesn't need the second. Similarly, when we implement
the ability to read a group of events using the transaction interface,
we would need the two pieces done independently.
Break up perf_event_read() and have it just read/update the counter
and have the callers compute the total count if necessary.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1441336073-22750-4-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the PMU interface allows reading only one counter at a time.
But some PMUs like the 24x7 counters in Power, support reading several
counters at once. To leveage this functionality, extend the transaction
interface to support a "transaction type".
The first type, PERF_PMU_TXN_ADD, refers to the existing transactions,
i.e. used to _schedule_ all the events on the PMU as a group. A second
transaction type, PERF_PMU_TXN_READ, will be used in a follow-on patch,
by the 24x7 counters to read several counters at once.
Extend the transaction interfaces to the PMU to accept a 'txn_flags'
parameter and use this parameter to ignore any transactions that are
not of type PERF_PMU_TXN_ADD.
Thanks to Peter Zijlstra for his input.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
[peterz: s390 compile fix]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1441336073-22750-3-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cgroup_exit() is not called from copy_process() after commit:
e8604cb436 ("cgroup: fix spurious lockdep warning in cgroup_exit()")
from do_exit(). So this check is useless and the comment is obsolete.
Signed-off-by: Kirill Tkhai <ktkhai@odin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/55E444C8.3020402@odin.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The internal clocksteering done for fine-grained error
correction uses a logarithmic approximation, so any time
adjtimex() adjusts the clock steering, timekeeping_freqadjust()
quickly approximates the correct clock frequency over a series
of ticks.
Unfortunately, the logic in timekeeping_freqadjust(), introduced
in commit:
dc491596f6 ("timekeeping: Rework frequency adjustments to work better w/ nohz")
used the abs() function with a s64 error value to calculate the
size of the approximated adjustment to be made.
Per include/linux/kernel.h:
"abs() should not be used for 64-bit types (s64, u64, long long) - use abs64()".
Thus on 32-bit platforms, this resulted in the clocksteering to
take a quite dampended random walk trying to converge on the
proper frequency, which caused the adjustments to be made much
slower then intended (most easily observed when large
adjustments are made).
This patch fixes the issue by using abs64() instead.
Reported-by: Nuno Gonçalves <nunojpg@gmail.com>
Tested-by: Nuno Goncalves <nunojpg@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: <stable@vger.kernel.org> # v3.17+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441840051-20244-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the load_{sum,avg} and util_{sum,avg} tracking is asymmetric
in that load tracking gets a 2^10 unit from the weight, but util gets
no such factor.
This results in more lost bits for util scaling and asymmetric scaling
rules.
Fix this by removing shifts, such that we gain the 2^10 factor from
scaling. There is no risk of overflowing the u32 as the max value is
now LOAD_AVG_MAX << 10, which is still well below UINT_MAX.
This further entangles the assumption that both LOAD and CAPACITY
shifts are the same (and 10) so put in an assertion for that.
This fixes the math for the LOAD_RESOLUTION != 0 case.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Do not call the scaling functions in case time goes backwards or the
last update of the sched_avg structure has happened less than 1024ns
ago.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org <daniel.lezcano@linaro.org>
Cc: mturquette@baylibre.com <mturquette@baylibre.com>
Cc: pang.xunlei@zte.com.cn <pang.xunlei@zte.com.cn>
Cc: rjw@rjwysocki.net <rjw@rjwysocki.net>
Cc: sgurrappadi@nvidia.com <sgurrappadi@nvidia.com>
Cc: vincent.guittot@linaro.org <vincent.guittot@linaro.org>
Cc: yuyang.du@intel.com <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/55EDA2E9.8040900@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prior to this patch; the line:
scaled_delta_w = (delta_w * 1024) >> 10;
which is the result of the default arch_scale_freq_capacity()
function, turns into:
1b03: 49 89 d1 mov %rdx,%r9
1b06: 49 c1 e1 0a shl $0xa,%r9
1b0a: 49 c1 e9 0a shr $0xa,%r9
Which is silly; when made unsigned int, GCC recognises this as
pointless ops and fails to emit them (confirmed on 4.9.3 and 5.1.1).
Furthermore, afaict unsigned is actually the correct type for these
fields anyway, as we've explicitly ruled out negative delta's earlier
in this function.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename scale() to cap_scale() to better reflect its purpose, it is
after all not a general purpose scale function, it has
SCHED_CAPACITY_SHIFT hardcoded in it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Utilization is currently scaled by capacity_orig, but since we now have
frequency and cpu invariant cfs_rq.avg.util_avg, frequency and cpu scaling
now happens as part of the utilization tracking itself.
So cfs_rq.avg.util_avg should no longer be scaled in cpu_util().
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steve Muckle <steve.muckle@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org <daniel.lezcano@linaro.org>
Cc: mturquette@baylibre.com <mturquette@baylibre.com>
Cc: pang.xunlei@zte.com.cn <pang.xunlei@zte.com.cn>
Cc: rjw@rjwysocki.net <rjw@rjwysocki.net>
Cc: sgurrappadi@nvidia.com <sgurrappadi@nvidia.com>
Cc: vincent.guittot@linaro.org <vincent.guittot@linaro.org>
Cc: yuyang.du@intel.com <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/55EDAF43.30500@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the advent of the per-entity load tracking rewrite to streamline the
naming of utilization related data and functions by using
{prefix_}util{_suffix} consistently. Moreover call both signals
({se,cfs}.avg.util_avg) utilization.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Besides the existing frequency scale-invariance correction factor, apply
CPU scale-invariance correction factor to utilization tracking to
compensate for any differences in compute capacity. This could be due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
maximum frequency supported by individual cpus in SMP systems. In the
existing implementation utilization isn't comparable between cpus as it
is relative to the capacity of each individual CPU.
Each segment of the sched_avg.util_sum geometric series is now scaled
by the CPU performance factor too so the sched_avg.util_avg of each
sched entity will be invariant from the particular CPU of the HMP/SMP
system on which the sched entity is scheduled.
With this patch, the utilization of a CPU stays relative to the max CPU
performance of the fastest CPU in the system.
In contrast to utilization (sched_avg.util_sum), load
(sched_avg.load_sum) should not be scaled by compute capacity. The
utilization metric is based on running time which only makes sense when
cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
more tasks are added), where load is runnable time which isn't limited
by the capacity of the CPU and therefore is a better metric for
overloaded scenarios. If we run two nice-0 busy loops on two cpus with
different compute capacity their load should be similar since their
compute demands are the same. We have to assume that the compute demand
of any task running on a fully utilized CPU (no spare cycles = 100%
utilization) is high and the same no matter of the compute capacity of
its current CPU, hence we shouldn't scale load by CPU capacity.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/55CE7409.1000700@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bring arch_scale_cpu_capacity() in line with the recent change of its
arch_scale_freq_capacity() sibling in commit dfbca41f34 ("sched:
Optimize freq invariant accounting") from weak function to #define to
allow inlining of the function.
While at it, remove the ARCH_CAPACITY sched_feature as well. With the
change to #define there isn't a straightforward way to allow runtime
switch between an arch implementation and the default implementation of
arch_scale_cpu_capacity() using sched_feature. The default was to use
the arch-specific implementation, but only the arm architecture provides
one and that is essentially equivalent to the default implementation.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Apply frequency scaling correction factor to per-entity load tracking to
make it frequency invariant. Currently, load appears bigger when the CPU
is running slower which affects load-balancing decisions.
Each segment of the sched_avg.load_sum geometric series is now scaled by
the current frequency so that the sched_avg.load_avg of each sched entity
will be invariant from frequency scaling.
Moreover, cfs_rq.runnable_load_sum is scaled by the current frequency as
well.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Variable sched_numa_balancing is available for both CONFIG_SCHED_DEBUG
and !CONFIG_SCHED_DEBUG. All code paths now check for
sched_numa_balancing. Hence remove sched_feat(NUMA).
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439290813-6683-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 2a1ed24 ("sched/numa: Prefer NUMA hotness over cache hotness")
sets sched feature NUMA to true. However this can enable NUMA hinting
faults on a UMA system.
This commit ensures that NUMA hinting faults occur only on a NUMA system
by setting/resetting sched_numa_balancing.
This commit:
- Makes sched_numa_balancing common to CONFIG_SCHED_DEBUG and
!CONFIG_SCHED_DEBUG. Earlier it was only in !CONFIG_SCHED_DEBUG.
- Checks for sched_numa_balancing instead of sched_feat(NUMA).
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439290813-6683-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit:
d4573c3e1c ("sched: Improve load balancing in the presence of idle CPUs")
the ILB CPU starts with the idle load balancing of other idle CPUs and
finishes with itself in order to speed up the spread of tasks in all
idle CPUs.
The this_rq->next_balance is still used in nohz_idle_balance() as an
intermediate step to gather the shortest next balance before updating
nohz.next_balance. But the former has not been updated yet and is likely to
be set with the current jiffies. As a result, the nohz.next_balance will be
set with current jiffies instead of the real next balance date. This
generates spurious kicks of nohz ilde balance.
nohz_idle_balance() must set the nohz.next_balance without taking into
account this_rq->next_balance which is not updated yet. Then, this_rq will
update nohz.next_update with its next_balance once updated and if necessary.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1438595750-20455-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cgroup_exit() is not called from copy_process() after commit:
e8604cb436 ("cgroup: fix spurious lockdep warning in cgroup_exit()")
from do_exit(). So this check is useless and the comment is obsolete.
Signed-off-by: Kirill Tkhai <ktkhai@odin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/55E444C8.3020402@odin.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The previous patches made the second argument go unused, remove it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
By observing that switched_from_fair() detaches from a runqueue, and
switched_to_fair() attaches to a runqueue, we can see that
task_move_group_fair() is one followed by the other with flipping the
runqueue in between.
Therefore extract all the common bits and implement all three
functions in terms of them.
This should fix a few corner cases wrt. vruntime normalization; where,
when we take a task off of a runqueue we convert to an approximation
of lag by subtracting min_vruntime, and when placing a task on the a
runqueue to the reverse.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[peterz: Changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-6-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In case there are problems with the aging on attach, provide a debug
knob to turn it off.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Where switched_from_fair() will remove the entity's load from the
runqueue, switched_to_fair() does not currently add it back. This
means that when a task leaves the fair class for a short duration; say
because of PI; we loose its load contribution.
This can ripple forward and disturb the load tracking because other
operations (enqueue, dequeue) assume its factored in. Only once the
runqueue empties will the load tracking recover.
When we add it back in, age the per entity average to match up with
the runqueue age. This has the obvious problem that if the task leaves
the fair class for a significant time, the load will age to 0.
Employ the normal migration rule for inter-runqueue moves in
task_move_group_fair(). Again, there is the obvious problem of the
task migrating while not in the fair class.
The alternative solution would be to to omit the chunk in
attach_entity_load_avg(), which would effectively reset the timestamp
and use whatever avg there was.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog and comments. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-5-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we attach the entity load to the new runqueue, we should also
detatch the entity load from the old runqueue, otherwise load can
accumulate.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-4-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we conditionally add the entity load to the rq when moving
the task between cgroups.
This doesn't make sense as we always 'migrate' the task between
cgroups, so we should always migrate the load too.
[ The history here is that we used to only migrate the blocked load
which was only meaningfull when !queued. ]
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-3-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we open-code the addition/subtraction of the per entity load
to/from the runqueue, factor this out into helper functions.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge fourth patch-bomb from Andrew Morton:
- sys_membarier syscall
- seq_file interface changes
- a few misc fixups
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
revert "ocfs2/dlm: use list_for_each_entry instead of list_for_each"
mm/early_ioremap: add explicit #include of asm/early_ioremap.h
fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void
selftests: enhance membarrier syscall test
selftests: add membarrier syscall test
sys_membarrier(): system-wide memory barrier (generic, x86)
MODSIGN: fix a compilation warning in extract-cert
- Build fix for the new Mediatek MT8173 cpufreq driver (Guenter Roeck).
- Generic power domains framework fixes (power on error code
path, subdomain removal) and cleanup of a deprecated API user
(Geert Uytterhoeven, Jon Hunter, Ulf Hansson).
- cpufreq-dt driver fixes including two fixes for bugs related to
the new Operating Performance Points Device Tree bindings
introduced recently (Viresh Kumar).
- Suspend frequency support for the cpufreq-dt driver
(Bartlomiej Zolnierkiewicz, Viresh Kumar).
- cpufreq core cleanups (Viresh Kumar).
- intel_pstate driver fixes (Chen Yu, Kristen Carlson Accardi).
- Additional sanity check in the cpuidle core (Xunlei Pang).
- Fix for a comment related to CPU power management (Lina Iyer).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJV80XwAAoJEILEb/54YlRxIAkP/RrT4mDO9QZkocO+XYDBB739
oi1aaSDmmVVKlFtFiyX4njS7S+U5UC8PYn5o/fFtVnP1kD9yuZkoak65U3t7bsKK
+rjun1SzJeooaLXW4ZAJKyPldi/RrsfRJ2JOaEt2vK03ppjx2mouK1VpjNj+qmYd
wwFb4q5S5Rcpph0Sx+xxhpZKoOgBmXk7nWIe5bWDxuORdx4yoxp4lus/P223wFJy
r0kJnNyY55mnY4Yx5e3L8W1rMC/Sf6mt8pyMvqRpsKdBHaEbhmGsiERSi0eCZJwr
AmL666TVgPk6zNM/yc3mFOTRZeAF3soSWF5R6+n2TswWvWFH+ksM14g9ndb7fTv3
FUuOU3jyfSUj2T4xr0qUZLaQUFB3xNrnWBNnm/CzPwjJtH1u1SK8uBSRqdZnEdGq
EeqwnwDLFeSAbG6lQVsKVVAPZHolDCnJPsvZebKdnJH09kjgUdm60BLOnif4QDgw
ULPZ6NToFZkrQSMAX1Uqm42iv/qWcHIgS5x43jAmblY3TsuByqRrI7gBCy+hbGSN
cQfeCxTdVPi/tUPYlJFmWn43dgEsnceSFKvlQRXGrZJM29BFvhVzJQjTFC8Oj79q
OP/Rv28+VeV5J531BQu+w8cvg73GJ3DDab+i91ZkIsd7ZNjRYQKXBfR9TImTSD4l
BPjvaPXwDg6EPhwjcpK/
=C7+4
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management and ACPI updates from Rafael Wysocki:
"These are mostly fixes and cleanups on top of the previous PM+ACPI
pull request (cpufreq core and drivers, cpuidle, generic power domains
framework). Some of them didn't make to that pull request and some
fix issues introduced by it.
The only really new thing is the support for suspend frequency in the
cpufreq-dt driver, but it is needed to fix an issue with Exynos
platforms.
Specifics:
- build fix for the new Mediatek MT8173 cpufreq driver (Guenter
Roeck).
- generic power domains framework fixes (power on error code path,
subdomain removal) and cleanup of a deprecated API user (Geert
Uytterhoeven, Jon Hunter, Ulf Hansson).
- cpufreq-dt driver fixes including two fixes for bugs related to the
new Operating Performance Points Device Tree bindings introduced
recently (Viresh Kumar).
- suspend frequency support for the cpufreq-dt driver (Bartlomiej
Zolnierkiewicz, Viresh Kumar).
- cpufreq core cleanups (Viresh Kumar).
- intel_pstate driver fixes (Chen Yu, Kristen Carlson Accardi).
- additional sanity check in the cpuidle core (Xunlei Pang).
- fix for a comment related to CPU power management (Lina Iyer)"
* tag 'pm+acpi-4.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
intel_pstate: fix PCT_TO_HWP macro
intel_pstate: Fix user input of min/max to legal policy region
PM / OPP: Return suspend_opp only if it is enabled
cpufreq-dt: add suspend frequency support
cpufreq: allow cpufreq_generic_suspend() to work without suspend frequency
PM / OPP: add dev_pm_opp_get_suspend_opp() helper
staging: board: Migrate away from __pm_genpd_name_add_device()
cpufreq: Use __func__ to print function's name
cpufreq: staticize cpufreq_cpu_get_raw()
PM / Domains: Ensure subdomain is not in use before removing
cpufreq: Add ARM_MT8173_CPUFREQ dependency on THERMAL
cpuidle/coupled: Add sanity check for safe_state_index
PM / Domains: Try power off masters in error path of __pm_genpd_poweron()
cpufreq: dt: Tolerance applies on both sides of target voltage
cpufreq: dt: Print error on failing to mark OPPs as shared
cpufreq: dt: Check OPP count before marking them shared
kernel/cpu_pm: fix cpu_cluster_pm_exit comment
Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system. It is
implemented by calling synchronize_sched(). It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier. For synchronization primitives
that distinguish between read-side and write-side (e.g. userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.
The existing applications of which I am aware that would be improved by
this system call are as follows:
* Through Userspace RCU library (http://urcu.so)
- DNS server (Knot DNS) https://www.knot-dns.cz/
- Network sniffer (http://netsniff-ng.org/)
- Distributed object storage (https://sheepdog.github.io/sheepdog/)
- User-space tracing (http://lttng.org)
- Network storage system (https://www.gluster.org/)
- Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
- Financial software (https://lkml.org/lkml/2015/3/23/189)
Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking. Especially in the case of RCU used by
libraries, sys_membarrier can speed up the read-side by moving the bulk of
the memory barrier cost to synchronize_rcu().
* Direct users of sys_membarrier
- core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux. They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.
To explain the benefit of this scheme, let's introduce two example threads:
Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())
In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".
Before the change, we had, for each smp_mb() pairs:
Thread A Thread B
previous mem accesses previous mem accesses
smp_mb() smp_mb()
following mem accesses following mem accesses
After the change, these pairs become:
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).
1) Non-concurrent Thread A vs Thread B accesses:
Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses
In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.
2) Concurrent Thread A vs Thread B accesses
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().
* Benchmarks
On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)
1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
* User-space user of this system call: Userspace RCU library
Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.
Results in liburcu:
Operations in 10s, 6 readers, 2 writers:
memory barriers in reader: 1701557485 reads, 2202847 writes
signal-based scheme: 9830061167 reads, 6700 writes
sys_membarrier: 9952759104 reads, 425 writes
sys_membarrier (dyn. check): 7970328887 reads, 425 writes
The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.
Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.
An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.
This patch adds the system call to x86 and to asm-generic.
[1] http://urcu.so
membarrier(2) man page:
MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)
NAME
membarrier - issue memory barriers on a set of threads
SYNOPSIS
#include <linux/membarrier.h>
int membarrier(int cmd, int flags);
DESCRIPTION
The cmd argument is one of the following:
MEMBARRIER_CMD_QUERY
Query the set of supported commands. It returns a bitmask of
supported commands.
MEMBARRIER_CMD_SHARED
Execute a memory barrier on all threads running on the system.
Upon return from system call, the caller thread is ensured that
all running threads have passed through a state where all memory
accesses to user-space addresses match program order between
entry to and return from the system call (non-running threads
are de facto in such a state). This covers threads from all pro=E2=80=90
cesses running on the system. This command returns 0.
The flags argument needs to be 0. For future extensions.
All memory accesses performed in program order from each targeted
thread is guaranteed to be ordered with respect to sys_membarrier(). If
we use the semantic "barrier()" to represent a compiler barrier forcing
memory accesses to be performed in program order across the barrier,
and smp_mb() to represent explicit memory barriers forcing full memory
ordering across the barrier, we have the following ordering table for
each pair of barrier(), sys_membarrier() and smp_mb():
The pair ordering is detailed as (O: ordered, X: not ordered):
barrier() smp_mb() sys_membarrier()
barrier() X X O
smp_mb() X O O
sys_membarrier() O O O
RETURN VALUE
On success, these system calls return zero. On error, -1 is returned,
and errno is set appropriately. For a given command, with flags
argument set to 0, this system call is guaranteed to always return the
same value until reboot.
ERRORS
ENOSYS System call is not implemented.
EINVAL Invalid arguments.
Linux 2015-04-15 MEMBARRIER(2)
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nicholas Miell <nmiell@comcast.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pm-cpu:
kernel/cpu_pm: fix cpu_cluster_pm_exit comment
* pm-cpuidle:
cpuidle/coupled: Add sanity check for safe_state_index
* pm-domains:
staging: board: Migrate away from __pm_genpd_name_add_device()
PM / Domains: Ensure subdomain is not in use before removing
PM / Domains: Try power off masters in error path of __pm_genpd_poweron()