Commit graph

18401 commits

Author SHA1 Message Date
Masanari Iida
cd3bd4e628 sched/fair: Fix 'make xmldocs' warning caused by missing description
This patch fix following warning caused by missing description
"overload" in kernel/sched/fair.c

Warning(.//kernel/sched/fair.c:5906): No description found for
parameter 'overload'

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1406518686-7274-1-git-send-email-standby24x7@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28 10:04:14 +02:00
Steven Rostedt
c13db6b131 sched: Use macro for magic number of -1 for setparam
Instead of passing around a magic number -1 for the sched_setparam()
policy, use a more descriptive macro name like SETPARAM_POLICY.

[ based on top of Daniel's sched_setparam() fix ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Daniel Bristot de Oliveira<bristot@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140723112826.6ed6cbce@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28 10:04:14 +02:00
Peter Zijlstra
6ae72dff37 sched: Robustify topology setup
We hard assume that higher topology levels are supersets of lower
levels.

Detect, warn and try to fixup when we encounter this violated.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Bruno Wolff III <bruno@wolff.to>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140722094740.GJ12054@laptop.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28 10:04:13 +02:00
Ingo Molnar
ca5bc6cd5d Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28 10:03:00 +02:00
Daniel Bristot de Oliveira
d8d28c8f00 sched: Fix sched_setparam() policy == -1 logic
The scheduler uses policy == -1 to preserve the current policy state to
implement sched_setparam(). But, as (int) -1 is equals to 0xffffffff,
it's matching the if (policy & SCHED_RESET_ON_FORK) on
_sched_setscheduler(). This match changes the policy value to an
invalid value, breaking the sched_setparam() syscall.

This patch checks policy == -1 before check the SCHED_RESET_ON_FORK flag.

The following program shows the bug:

int main(void)
{
	struct sched_param param = {
		.sched_priority = 5,
	};

	sched_setscheduler(0, SCHED_FIFO, &param);
	param.sched_priority = 1;
	sched_setparam(0, &param);
	param.sched_priority = 0;
	sched_getparam(0, &param);
	if (param.sched_priority != 1)
		printf("failed priority setting (found %d instead of 1)\n",
			param.sched_priority);
	else
		printf("priority setting fine\n");
}

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org> # 3.14+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Fixes: 7479f3c9cf "sched: Move SCHED_RESET_ON_FORK into attr::sched_flags"
Link: http://lkml.kernel.org/r/9ebe0566a08dbbb3999759d3f20d6004bb2dbcfa.1406079891.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28 10:00:08 +02:00
Linus Torvalds
9dae0a3fc4 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A bunch of fixes for perf and kprobes:
   - revert a commit that caused a perf group regression
   - silence dmesg spam
   - fix kprobe probing errors on ia64 and ppc64
   - filter kprobe faults from userspace
   - lockdep fix for perf exit path
   - prevent perf #GP in KVM guest
   - correct perf event and filters"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64
  kprobes/x86: Don't try to resolve kprobe faults from userspace
  perf/x86/intel: Avoid spamming kernel log for BTS buffer failure
  perf/x86/intel: Protect LBR and extra_regs against KVM lying
  perf: Fix lockdep warning on process exit
  perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappings
  perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
  perf: Revert ("perf: Always destroy groups on exit")
2014-07-27 09:57:16 -07:00
Tony Luck
58d4e21e50 tracing: Fix wraparound problems in "uptime" trace clock
The "uptime" trace clock added in:

    commit 8aacf017b0
    tracing: Add "uptime" trace clock that uses jiffies

has wraparound problems when the system has been up more
than 1 hour 11 minutes and 34 seconds. It converts jiffies
to nanoseconds using:
        (u64)jiffies_to_usecs(jiffy) * 1000ULL
but since jiffies_to_usecs() only returns a 32-bit value, it
truncates at 2^32 microseconds.  An additional problem on 32-bit
systems is that the argument is "unsigned long", so fixing the
return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
system).

Avoid these problems by using jiffies_64 as our basis, and
not converting to nanoseconds (we do convert to clock_t because
user facing API must not be dependent on internal kernel
HZ values).

Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com

Cc: stable@vger.kernel.org # 3.10+
Fixes: 8aacf017b0 "tracing: Add "uptime" trace clock that uses jiffies"
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-21 09:56:12 -04:00
Linus Torvalds
d057190925 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "The locking department delivers:

   - A rather large and intrusive bundle of fixes to address serious
     performance regressions introduced by the new rwsem / mcs
     technology.  Simpler solutions have been discussed, but they would
     have been ugly bandaids with more risk than doing the right thing.

   - Make the rwsem spin on owner technology opt-in for architectures
     and enable it only on the known to work ones.

   - A few fixes to the lockdep userspace library"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNER
  locking/mutex: Disable optimistic spinning on some architectures
  locking/rwsem: Reduce the size of struct rw_semaphore
  locking/rwsem: Rename 'activity' to 'count'
  locking/spinlocks/mcs: Micro-optimize osq_unlock()
  locking/spinlocks/mcs: Introduce and use init macro and function for osq locks
  locking/spinlocks/mcs: Convert osq lock to atomic_t to reduce overhead
  locking/spinlocks/mcs: Rename optimistic_spin_queue() to optimistic_spin_node()
  locking/rwsem: Allow conservative optimistic spinning when readers have lock
  tools/liblockdep: Account for bitfield changes in lockdeps lock_acquire
  tools/liblockdep: Remove debug print left over from development
  tools/liblockdep: Fix comparison of a boolean value with a value of 2
2014-07-19 06:27:55 -10:00
Linus Torvalds
d1743b810d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "Prevent a possible divide by zero in the debugging code"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix possible divide by zero in avg_atom() calculation
2014-07-19 06:26:43 -10:00
Linus Torvalds
b495c23cd4 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "A single fix for a long standing issue in the alarm timer subsystem,
  which was noticed recently when people finally started to use alarm
  timers for serious work"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  alarmtimer: Fix bug where relative alarm timers were treated as absolute
2014-07-19 06:25:03 -10:00
Linus Torvalds
da5b99b454 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fixes from Thomas Gleixner:
 "Two RCU patches:
   - Address a serious performance regression on open/close caused by
     commit ac1bea8578 ("Make cond_resched() report RCU quiescent
     states")
   - Export RCU debug functions.  Not a regression, but enablement to
     address a serious recursion bug in the sl*b allocators in 3.17"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Reduce overhead of cond_resched() checks for RCU
  rcu: Export debug_init_rcu_head() and and debug_init_rcu_head()
2014-07-19 06:23:27 -10:00
Linus Torvalds
084c9cac39 ACPI and power management fixes for 3.16-rc6
- Fix for a recently introduced NULL pointer dereference in the core
    system suspend code occuring when platforms without ACPI attempt to
    use the "freeze" sleep state from Zhang Rui.
 
  - Fix for a recently introduced build warning in cpufreq headers from
    Brian W Hart.
 
  - Fix for a 3.13 cpufreq regression related to sysem resume that
    triggers on some systems with multiple CPU clusters from Viresh Kumar.
 
  - Fix for a 3.4 regression in request_firmware() resulting in
    WARN_ON()s on some systems during system resume from Takashi Iwai.
 
  - Revert of the ACPI video commit that changed the default value of
    the video.brightness_switch_enabled command line argument to 0 as
    it has been reported to break existing setups.
 
  - ACPI device enumeration documentation update to take recent code
    changes into account and make the documentation match the code again
    from Darren Hart.
 
  - Fixes for the sa1110, imx6q, kirkwood, and cpu0 cpufreq drivers
    from Linus Walleij, Nicolas Del Piano, Quentin Armitage, Viresh Kumar.
 
  - New ACPI video blacklist entry for HP ProBook 4540s from Hans de Goede.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTyHqVAAoJEILEb/54YlRxsV4P/0wiCc/VDxC614uCRcNVv0mH
 4BtAvhV5Oe32M5MNYrDO4GefRqFSgz6xZ2B55ioTyNML2b7rcXJqlS4r6P6ESlog
 ccwOSsEvMmmOpMRynoQJq9vGybPdhUiA8G4ABeC/Y57MfKEbS58dw9Fi/BXpII+A
 RZOzHPNvzFExkw6bBxO3l66uDTeZPVZQOs5hLRSrxa1s99+E/N4aa99nE26YPR6b
 U15cfuYMigfpziO3k4OtCka3sZYMO0XzVZjH62rJJBwh9G24uTzoL/lcOpLbZyaW
 DGLm/3LoH46WGGIrkIiWKplzhiSQ0BeOzlK9bAX6L11UMLMO0DOtKoZl6H37bS/J
 XLbSpdH22WGaF+QNWu3yPqLOpSZZxjYFQ0u8jDL00EzbIePpx4ynKbfCG2Q+EhcT
 siDpXsC084rznWaYPKAyTitrhrqsWCzuI12BLO5AgJUvQ/jdyGgim8UWLp7AV1K7
 Tc40MJdR7hUpJigg3+ORj88dYJvQXRYifbfTTbmQ0VEf9g5PXV5wLDvWY5Uo99SA
 pYnY+jLcc13K7eDd+dm3MhSiW6H+jvzNq2WCFHqLzCG64Dj11HMHevhIRWEueekY
 fF3JwqDByvj/f4PW8wwqlAnbNzhVEDzkJHHsCXj90KFCoBG3KIKU2G0ISwMbVZkJ
 y9vflHb4Vh/WK9Puyffz
 =kuyY
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI and power management fixes from Rafael Wysocki:
 "These are a few recent regression fixes, a revert of the ACPI video
  commit I promised, a system resume fix related to request_firmware(),
  an ACPI video quirk for one more Win8-oriented BIOS, an ACPI device
  enumeration documentation update and a few fixes for ARM cpufreq
  drivers.

  Specifics:

   - Fix for a recently introduced NULL pointer dereference in the core
     system suspend code occuring when platforms without ACPI attempt to
     use the "freeze" sleep state from Zhang Rui.

   - Fix for a recently introduced build warning in cpufreq headers from
     Brian W Hart.

   - Fix for a 3.13 cpufreq regression related to sysem resume that
     triggers on some systems with multiple CPU clusters from Viresh
     Kumar.

   - Fix for a 3.4 regression in request_firmware() resulting in
     WARN_ON()s on some systems during system resume from Takashi Iwai.

   - Revert of the ACPI video commit that changed the default value of
     the video.brightness_switch_enabled command line argument to 0 as
     it has been reported to break existing setups.

   - ACPI device enumeration documentation update to take recent code
     changes into account and make the documentation match the code
     again from Darren Hart.

   - Fixes for the sa1110, imx6q, kirkwood, and cpu0 cpufreq drivers
     from Linus Walleij, Nicolas Del Piano, Quentin Armitage, Viresh
     Kumar.

   - New ACPI video blacklist entry for HP ProBook 4540s from Hans de
     Goede"

* tag 'pm+acpi-3.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: make table sentinel macros unsigned to match use
  cpufreq: move policy kobj to policy->cpu at resume
  cpufreq: cpu0: OPPs can be populated at runtime
  cpufreq: kirkwood: Reinstate cpufreq driver for ARCH_KIRKWOOD
  cpufreq: imx6q: Select PM_OPP
  cpufreq: sa1110: set memory type for h3600
  ACPI / video: Add use_native_backlight quirk for HP ProBook 4540s
  PM / sleep: fix freeze_ops NULL pointer dereferences
  PM / sleep: Fix request_firmware() error at resume
  Revert "ACPI / video: change acpi-video brightness_switch_enabled default to 0"
  ACPI / documentation: Remove reference to acpi_platform_device_ids from enumeration.txt
2014-07-18 20:28:27 -10:00
Masami Hiramatsu
d81b4253b0 kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64
On ia64 and ppc64, function pointers do not point to the
entry address of the function, but to the address of a
function descriptor (which contains the entry address and misc
data).

Since the kprobes code passes the function pointer stored
by NOKPROBE_SYMBOL() to kallsyms_lookup_size_offset() for
initalizing its blacklist, it fails and reports many errors,
such as:

  Failed to find blacklist 0001013168300000
  Failed to find blacklist 0001013000f0a000
  [...]

To fix this bug, use arch_deref_entry_point() to get the
function entry address for kallsyms_lookup_size_offset()
instead of the raw function pointer.

Suzuki also pointed out that blacklist entries should also
be updated as well.

Reported-by: Tony Luck <tony.luck@gmail.com>
Fixed-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (for powerpc)
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: sparse@chrisli.org
Cc: Paul Mackerras <paulus@samba.org>
Cc: akataria@vmware.com
Cc: anil.s.keshavamurthy@intel.com
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: rdunlap@infradead.org
Cc: dl9pf@gmx.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-ia64@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20140717114411.13401.2632.stgit@kbuild-fedora.novalocal
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-18 06:23:40 +02:00
Linus Torvalds
22d368544b A few more fixes for ftrace infrastructure.
I was cleaning out my INBOX and found two fixes from zhangwei from
 a year ago that were lost in my mail. These fix an inconsistency between
 trace_puts() and the way trace_printk() works. The reason this is
 important to fix is because when trace_printk() doesn't have any
 arguments, it turns into a trace_puts(). Not being able to enable a
 stack trace against trace_printk() because it does not have any arguments
 is quite confusing. Also, the fix is rather trivial and low risk.
 
 While porting some changes to PowerPC I discovered that it still has
 the function graph tracer filter bug that if you also enable stack tracing
 the function graph tracer filter is ignored. I fixed that up.
 
 Finally, Martin Lau, fixed a bug that would cause readers of the
 ftrace ring buffer to block forever even though it was suppose to be
 NONBLOCK.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTxfAHAAoJEKQekfcNnQGueDkH/jQ/FgNPw+GbKuNpZwWqMQds
 ivGM8yb6NgmlDQx60oBOL9TduKN2LCqW3HWDoyzduSozuLUxpRTPWtVThKo0Q9Fp
 FPHWewdrX7DwWkyEqeG7FAAhjqPedgEZoDOwy7BArzJYpHcpex42trDY4BhjPrlM
 ESA+/M+Xm1vGCJzcNSqNHYE/mF/tHptdM1mlLZ3g9ql49MiTbQxEMbHX7lih24sv
 AzNP1SlisYN5CkX4hbRGhCYkZdmdVZt/xiyBAQRWCGFx+jsdJbMXDJnevDCdU9Mv
 GNtKwqsi1/GeE5nvKxZyQKsDEE98Pd8LyQ0kDqKpAfTutqNE4mpWJkCxUqTmE4Q=
 =aFqO
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.16-rc5-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "A few more fixes for ftrace infrastructure.

  I was cleaning out my INBOX and found two fixes from zhangwei from a
  year ago that were lost in my mail.  These fix an inconsistency
  between trace_puts() and the way trace_printk() works.  The reason
  this is important to fix is because when trace_printk() doesn't have
  any arguments, it turns into a trace_puts().  Not being able to enable
  a stack trace against trace_printk() because it does not have any
  arguments is quite confusing.  Also, the fix is rather trivial and low
  risk.

  While porting some changes to PowerPC I discovered that it still has
  the function graph tracer filter bug that if you also enable stack
  tracing the function graph tracer filter is ignored.  I fixed that up.

  Finally, Martin Lau, fixed a bug that would cause readers of the
  ftrace ring buffer to block forever even though it was suppose to be
  NONBLOCK"

This also includes the fix from an earlier pull request:

 "Oleg Nesterov fixed a memory leak that happens if a user creates a
  tracing instance, sets up a filter in an event, and then removes that
  instance.  The filter allocates memory that is never freed when the
  instance is destroyed"

* tag 'trace-fixes-v3.16-rc5-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Fix polling on trace_pipe
  tracing: Add TRACE_ITER_PRINTK flag check in __trace_puts/__trace_bputs
  tracing: Fix graph tracer with stack tracer on other archs
  tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs
  tracing: instance_rmdir() leaks ftrace_event_file->filter
2014-07-17 07:57:33 -10:00
Linus Torvalds
d14aef3872 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Tooling fixes and an Intel PMU driver fixlet"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Do not allow optimized switch for non-cloned events
  perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
  perf symbols: Get kernel start address by symbol name
  perf tools: Fix segfault in cumulative.callchain report
2014-07-16 10:10:27 -10:00
NeilBrown
c1221321b7 sched: Allow wait_on_bit_action() functions to support a timeout
It is currently not possible for various wait_on_bit functions
to implement a timeout.

While the "action" function that is called to do the waiting
could certainly use schedule_timeout(), there is no way to carry
forward the remaining timeout after a false wake-up.
As false-wakeups a clearly possible at least due to possible
hash collisions in bit_waitqueue(), this is a real problem.

The 'action' function is currently passed a pointer to the word
containing the bit being waited on.  No current action functions
use this pointer.  So changing it to something else will be a
little noisy but will have no immediate effect.

This patch changes the 'action' function to take a pointer to
the "struct wait_bit_key", which contains a pointer to the word
containing the bit so nothing is really lost.

It also adds a 'private' field to "struct wait_bit_key", which
is initialized to zero.

An action function can now implement a timeout with something
like

static int timed_out_waiter(struct wait_bit_key *key)
{
	unsigned long waited;
	if (key->private == 0) {
		key->private = jiffies;
		if (key->private == 0)
			key->private -= 1;
	}
	waited = jiffies - key->private;
	if (waited > 10 * HZ)
		return -EAGAIN;
	schedule_timeout(waited - 10 * HZ);
	return 0;
}

If any other need for context in a waiter were found it would be
easy to use ->private for some other purpose, or even extend
"struct wait_bit_key".

My particular need is to support timeouts in nfs_release_page()
to avoid deadlocks with loopback mounted NFS.

While wait_on_bit_timeout() would be a cleaner interface, it
will not meet my need.  I need the timeout to be sensitive to
the state of the connection with the server, which could change.
 So I need to use an 'action' interface.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051604.28027.41257.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 15:10:41 +02:00
NeilBrown
743162013d sched: Remove proliferation of wait_on_bit() action functions
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().

So:
 Rename wait_on_bit and        wait_on_bit_lock to
        wait_on_bit_action and wait_on_bit_lock_action
 to make it explicit that they need an action function.

 Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
 which are *not* given an action function but implicitly use
 a standard one.
 The decision to error-out if a signal is pending is now made
 based on the 'mode' argument rather than being encoded in the action
 function.

 All instances of the old wait_on_bit and wait_on_bit_lock which
 can use the new version have been changed accordingly and their
 action functions have been discarded.
 wait_on_bit{_lock} does not return any specific error code in the
 event of a signal so the caller must check for non-zero and
 interpolate their own error code as appropriate.

The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"

The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.

A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack.  So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).

Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS.  CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 15:10:39 +02:00
Ingo Molnar
d26fad5b38 Linux 3.16-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTwvRvAAoJEHm+PkMAQRiG8CoIAJucWkj+MJFFoDXjR9hfI8U7
 /WeQLJP0GpWGMXd2KznX9epCuw5rsuaPAxCy1HFDNOa7OtNYacWrsIhByxOIDLwL
 YjDB9+fpMMPFWsr+LPJa8Ombh/TveCS77w6Pt5VMZFwvIKujiNK/C3MdxjReH5Gr
 iTGm8x7nEs2D6L2+5sQVlhXot/97phxIlBSP6wPXEiaztNZ9/JZi905Xpgq+WU16
 ZOA8MlJj1TQD4xcWyUcsQ5REwIOdQ6xxPF00wv/12RFela+Puy4JLAilnV6Mc12U
 fwYOZKbUNBS8rjfDDdyX3sljV1L5iFFqKkW3WFdnv/z8ZaZSo5NupWuavDnifKw=
 =6Q8o
 -----END PGP SIGNATURE-----

Merge tag 'v3.16-rc5' into sched/core, to refresh the branch before applying bigger tree-wide changes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 15:10:07 +02:00
Davidlohr Bueso
5db6c6fefb locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNER
Just like with mutexes (CONFIG_MUTEX_SPIN_ON_OWNER),
encapsulate the dependencies for rwsem optimistic spinning.
No logical changes here as it continues to depend on both
SMP and the XADD algorithm variant.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Jason Low <jason.low2@hp.com>
[ Also make it depend on ARCH_SUPPORTS_ATOMIC_RMW. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1405112406-13052-2-git-send-email-davidlohr@hp.com
Cc: aswin@hp.com
Cc: Chris Mason <clm@fb.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 14:57:13 +02:00
Peter Zijlstra
4badad352a locking/mutex: Disable optimistic spinning on some architectures
The optimistic spin code assumes regular stores and cmpxchg() play nice;
this is found to not be true for at least: parisc, sparc32, tile32,
metag-lock1, arc-!llsc and hexagon.

There is further wreckage, but this in particular seemed easy to
trigger, so blacklist this.

Opt in for known good archs.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: stable@vger.kernel.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 14:57:07 +02:00
Peter Zijlstra
13b9a962a2 locking/rwsem: Rename 'activity' to 'count'
There are two definitions of struct rw_semaphore, one in linux/rwsem.h
and one in linux/rwsem-spinlock.h.

For some reason they have different names for the initial field. This
makes it impossible to use C99 named initialization for
__RWSEM_INITIALIZER() -- or we have to duplicate that entire thing
along with the structure definitions.

The simpler patch is renaming the rwsem-spinlock variant to match the
regular rwsem.

This allows us to switch to C99 named initialization.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-bmrZolsbGmautmzrerog27io@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 14:56:55 +02:00
Peter Zijlstra
e720fff634 sched/numa: Revert "Use effective_load() to balance NUMA loads"
Due to divergent trees, Rik find that this patch is no longer
required.

Requested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-u6odkgkw8wz3m7orgsjfo5pi@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:38:23 +02:00
Jason Baron
5cd08fbfdb sched: Fix static_key race with sched_feat()
As pointed out by Andi Kleen, the usage of static keys can be racy in
sched_feat_disable() vs. sched_feat_enable(). Currently, we first check the
value of keys->enabled, and subsequently update the branch direction. This,
can be racy and can potentially leave the keys in an inconsistent state.

Take the i_mutex around these calls to resolve the race.

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/9d7780c83db26683955cd01e6bc654ee2586e67f.1404315388.git.jbaron@akamai.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:38:21 +02:00
Jason Baron
6e76ea8a82 sched: Remove extra static_key*() function indirection
I think its a bit simpler without having to follow an extra layer of static
inline fuctions. No functional change just cosmetic.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rostedt@goodmis.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/2ce52233ce200faad93b6029d90f1411cd926667.1404315388.git.jbaron@akamai.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:38:20 +02:00
xiaofeng.yan
1b09d29bc0 sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
Signed-off-by: xiaofeng.yan <xiaofeng.yan@huawei.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1404712744-16986-1-git-send-email-xiaofeng.yan@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:38:20 +02:00
Kirill Tkhai
8875125efe sched: Transform resched_task() into resched_curr()
We always use resched_task() with rq->curr argument.
It's not possible to reschedule any task but rq's current.

The patch introduces resched_curr(struct rq *) to
replace all of the repeating patterns. The main aim
is cleanup, but there is a little size profit too:

  (before)
	$ size kernel/sched/built-in.o
	   text	   data	    bss	    dec	    hex	filename
	155274	  16445	   7042	 178761	  2ba49	kernel/sched/built-in.o

	$ size vmlinux
	   text	   data	    bss	    dec	    hex	filename
	7411490	1178376	 991232	9581098	 92322a	vmlinux

  (after)
	$ size kernel/sched/built-in.o
	   text	   data	    bss	    dec	    hex	filename
	155130	  16445	   7042	 178617	  2b9b9	kernel/sched/built-in.o

	$ size vmlinux
	   text	   data	    bss	    dec	    hex	filename
	7411362	1178376	 991232	9580970	 9231aa	vmlinux

	I was choosing between resched_curr() and resched_rq(),
	and the first name looks better for me.

A little lie in Documentation/trace/ftrace.txt. I have not
actually collected the tracing again. With a hope the patch
won't make execution times much worse :)

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:38:19 +02:00
Oleg Nesterov
466af29bf4 sched/deadline: Kill task_struct->pi_top_task
Remove task_struct->pi_top_task. The only user, rt_mutex_setprio(),
can use a local.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Dempsky <mdempsky@chromium.org>
Cc: Michal Simek <michal.simek@xilinx.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20140606165206.GB29465@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:38:18 +02:00
Mateusz Guzik
b0ab99e773 sched: Fix possible divide by zero in avg_atom() calculation
proc_sched_show_task() does:

  if (nr_switches)
	do_div(avg_atom, nr_switches);

nr_switches is unsigned long and do_div truncates it to 32 bits, which
means it can test non-zero on e.g. x86-64 and be truncated to zero for
division.

Fix the problem by using div64_ul() instead.

As a side effect calculations of avg_atom for big nr_switches are now correct.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:36:07 +02:00
Jason Low
33ecd2083a locking/spinlocks/mcs: Micro-optimize osq_unlock()
In the unlock function of the cancellable MCS spinlock, the first
thing we do is to retrive the current CPU's osq node. However, due to
the changes made in the previous patch, in the common case where the
lock is not contended, we wouldn't need to access the current CPU's
osq node anymore.

This patch optimizes this by only retriving this CPU's osq node
after we attempt the initial cmpxchg to unlock the osq and found
that its contended.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1405358872-3732-5-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:28:06 +02:00
Jason Low
4d9d951e6b locking/spinlocks/mcs: Introduce and use init macro and function for osq locks
Currently, we initialize the osq lock by directly setting the lock's values. It
would be preferable if we use an init macro to do the initialization like we do
with other locks.

This patch introduces and uses a macro and function for initializing the osq lock.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Link: http://lkml.kernel.org/r/1405358872-3732-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:28:05 +02:00
Jason Low
90631822c5 locking/spinlocks/mcs: Convert osq lock to atomic_t to reduce overhead
The cancellable MCS spinlock is currently used to queue threads that are
doing optimistic spinning. It uses per-cpu nodes, where a thread obtaining
the lock would access and queue the local node corresponding to the CPU that
it's running on. Currently, the cancellable MCS lock is implemented by using
pointers to these nodes.

In this patch, instead of operating on pointers to the per-cpu nodes, we
store the CPU numbers in which the per-cpu nodes correspond to in atomic_t.
A similar concept is used with the qspinlock.

By operating on the CPU # of the nodes using atomic_t instead of pointers
to those nodes, this can reduce the overhead of the cancellable MCS spinlock
by 32 bits (on 64 bit systems).

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chris Mason <clm@fb.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Link: http://lkml.kernel.org/r/1405358872-3732-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:28:04 +02:00
Jason Low
046a619d8e locking/spinlocks/mcs: Rename optimistic_spin_queue() to optimistic_spin_node()
Currently, the per-cpu nodes structure for the cancellable MCS spinlock is
named "optimistic_spin_queue". However, in a follow up patch in the series
we will be introducing a new structure that serves as the new "handle" for
the lock. It would make more sense if that structure is named
"optimistic_spin_queue". Additionally, since the current use of the
"optimistic_spin_queue" structure are  "nodes", it might be better if we
rename them to "node" anyway.

This preparatory patch renames all current "optimistic_spin_queue"
to "optimistic_spin_node".

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chris Mason <clm@fb.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Link: http://lkml.kernel.org/r/1405358872-3732-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:28:03 +02:00
Jason Low
37e9562453 locking/rwsem: Allow conservative optimistic spinning when readers have lock
Commit 4fc828e24c ("locking/rwsem: Support optimistic spinning")
introduced a major performance regression for workloads such as
xfs_repair which mix read and write locking of the mmap_sem across
many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
than on 3.15 and using 20x more system CPU time.

Perf profiles indicate in some workloads that significant time can
be spent spinning on !owner. This is because we don't set the lock
owner when readers(s) obtain the rwsem.

In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
return false if there is no lock owner. The rationale is that if we
just entered the slowpath, yet there is no lock owner, then there is
a possibility that a reader has the lock. To be conservative, we'll
avoid spinning in these situations.

This patch reduced the total run time of the xfs_repair workload from
about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
back to close to the same performance as on 3.15.

Retesting of AIM7, which were some of the workloads used to test the
original optimistic spinning code, confirmed that we still get big
performance gains with optimistic spinning, even with this additional
regression fix. Davidlohr found that while the 'custom' workload took
a performance hit of ~-14% to throughput for >300 users with this
additional patch, the overall gain with optimistic spinning is
still ~+45%. The 'disk' workload even improved by ~+15% at >1000 users.

Tested-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1404532172.2572.30.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:28:02 +02:00
Peter Zijlstra
4a1c0f262f perf: Fix lockdep warning on process exit
Sasha Levin reported:

> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel I've stumbled on the following spew:
>
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.15.0-next-20140613-sasha-00026-g6dd125d-dirty #654 Not tainted
> -------------------------------------------------------
> trinity-c578/9725 is trying to acquire lock:
> (&(&pool->lock)->rlock){-.-...}, at: __queue_work (kernel/workqueue.c:1346)
>
> but task is already holding lock:
> (&ctx->lock){-.....}, at: perf_event_exit_task (kernel/events/core.c:7471 kernel/events/core.c:7533)
>
> which lock already depends on the new lock.

> 1 lock held by trinity-c578/9725:
> #0: (&ctx->lock){-.....}, at: perf_event_exit_task (kernel/events/core.c:7471 kernel/events/core.c:7533)
>
>  Call Trace:
>  dump_stack (lib/dump_stack.c:52)
>  print_circular_bug (kernel/locking/lockdep.c:1216)
>  __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
>  lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
>  _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
>  __queue_work (kernel/workqueue.c:1346)
>  queue_work_on (kernel/workqueue.c:1424)
>  free_object (lib/debugobjects.c:209)
>  __debug_check_no_obj_freed (lib/debugobjects.c:715)
>  debug_check_no_obj_freed (lib/debugobjects.c:727)
>  kmem_cache_free (mm/slub.c:2683 mm/slub.c:2711)
>  free_task (kernel/fork.c:221)
>  __put_task_struct (kernel/fork.c:250)
>  put_ctx (include/linux/sched.h:1855 kernel/events/core.c:898)
>  perf_event_exit_task (kernel/events/core.c:907 kernel/events/core.c:7478 kernel/events/core.c:7533)
>  do_exit (kernel/exit.c:766)
>  do_group_exit (kernel/exit.c:884)
>  get_signal_to_deliver (kernel/signal.c:2347)
>  do_signal (arch/x86/kernel/signal.c:698)
>  do_notify_resume (arch/x86/kernel/signal.c:751)
>  int_signal (arch/x86/kernel/entry_64.S:600)

Urgh.. so the only way I can make that happen is through:

  perf_event_exit_task_context()
    raw_spin_lock(&child_ctx->lock);
    unclone_ctx(child_ctx)
      put_ctx(ctx->parent_ctx);
    raw_spin_unlock_irqrestore(&child_ctx->lock);

And we can avoid this by doing the change below.

I can't immediately see how this changed recently, but given that you
say it's easy to reproduce, lets fix this.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140623141242.GB19860@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:18:42 +02:00
Peter Zijlstra
1903d50cba perf: Revert ("perf: Always destroy groups on exit")
Vince reported that commit 15a2d4de0e ("perf: Always destroy groups
on exit") causes a regression with grouped events. In particular his
read_group_attached.c test fails.

  https://github.com/deater/perf_event_tests/blob/master/tests/bugs/read_group_attached.c

Because of the context switch optimization in
perf_event_context_sched_out() the 'original' event may end up in the
child process and when that exits the change in the patch in question
destroys the actual grouping.

Therefore revert that change and only destroy inherited groups.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-zedy3uktcp753q8fw8dagx7a@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:18:39 +02:00
Martin Lau
97b8ee8453 ring-buffer: Fix polling on trace_pipe
ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available.  Otherwise, the following epoll and
read sequence will eventually hang forever:

1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever

~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
  ring_buffer_poll_wait() returns immediately without adding poll_table,
  which has poll_table->_qproc pointing to ep_poll_callback(), to its
  wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
  ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
  because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
  it has to call ep_poll_callback() because it is not in its wait queue.
  Hence, block forever.

Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do.  For example, tcp_poll() in tcp.c.

Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com

Cc: stable@vger.kernel.org # 2.6.27
Fixes: 2a2cc8f7c4 "ftrace: allow the event pipe to be polled"
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Martin Lau <kafai@fb.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-15 17:07:47 -04:00
zhangwei(Jovi)
f0160a5a29 tracing: Add TRACE_ITER_PRINTK flag check in __trace_puts/__trace_bputs
The TRACE_ITER_PRINTK check in __trace_puts/__trace_bputs is missing,
so add it, to be consistent with __trace_printk/__trace_bprintk.
Those functions are all called by the same function: trace_printk().

Link: http://lkml.kernel.org/p/51E7A7D6.8090900@huawei.com

Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-15 11:58:40 -04:00
Steven Rostedt (Red Hat)
5f8bf2d263 tracing: Fix graph tracer with stack tracer on other archs
Running my ftrace tests on PowerPC, it failed the test that checks
if function_graph tracer is affected by the stack tracer. It was.
Looking into this, I found that the update_function_graph_func()
must be called even if the trampoline function is not changed.
This is because archs like PowerPC do not support ftrace_ops being
passed by assembly and instead uses a helper function (what the
trampoline function points to). Since this function is not changed
even when multiple ftrace_ops are added to the code, the test that
falls out before calling update_function_graph_func() will miss that
the update must still be done.

Call update_function_graph_function() for all calls to
update_ftrace_function()

Cc: stable@vger.kernel.org # 3.3+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-15 11:10:25 -04:00
zhangwei(Jovi)
8abfb8727f tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs
Currently trace option stacktrace is not applicable for
trace_printk with constant string argument, the reason is
in __trace_puts/__trace_bputs ftrace_trace_stack is missing.

In contrast, when using trace_printk with non constant string
argument(will call into __trace_printk/__trace_bprintk), then
trace option stacktrace is workable, this inconstant result
will confuses users a lot.

Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-15 11:04:40 -04:00
Zhang Rui
653a3538f8 PM / sleep: fix freeze_ops NULL pointer dereferences
This patch fixes a NULL pointer dereference issue introduced by
commit 1f0b63866f (ACPI / PM: Hold ACPI scan lock over the "freeze"
sleep state).

Fixes: 1f0b63866f (ACPI / PM: Hold ACPI scan lock over the "freeze" sleep state)
Link: http://marc.info/?l=linux-pm&m=140541182017443&w=2
Reported-and-tested-by: Alexander Stein <alexander.stein@systec-electronic.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-07-15 14:27:30 +02:00
Takashi Iwai
4320f6b1d9 PM / sleep: Fix request_firmware() error at resume
The commit [247bc037: PM / Sleep: Mitigate race between the freezer
and request_firmware()] introduced the finer state control, but it
also leads to a new bug; for example, a bug report regarding the
firmware loading of intel BT device at suspend/resume:
  https://bugzilla.novell.com/show_bug.cgi?id=873790

The root cause seems to be a small window between the process resume
and the clear of usermodehelper lock.  The request_firmware() function
checks the UMH lock and gives up when it's in UMH_DISABLE state.  This
is for avoiding the invalid  f/w loading during suspend/resume phase.
The problem is, however, that usermodehelper_enable() is called at the
end of thaw_processes().  Thus, a thawed process in between can kick
off the f/w loader code path (in this case, via btusb_setup_intel())
even before the call of usermodehelper_enable().  Then
usermodehelper_read_trylock() returns an error and request_firmware()
spews WARN_ON() in the end.

This oneliner patch fixes the issue just by setting to UMH_FREEZING
state again before restarting tasks, so that the call of
request_firmware() will be blocked until the end of this function
instead of returning an error.

Fixes: 247bc03742 (PM / Sleep: Mitigate race between the freezer and request_firmware())
Link: https://bugzilla.novell.com/show_bug.cgi?id=873790
Cc: 3.4+ <stable@vger.kernel.org> # 3.4+
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-07-15 14:27:29 +02:00
Oleg Nesterov
2448e3493c tracing: instance_rmdir() leaks ftrace_event_file->filter
instance_rmdir() path destroys the event files but forgets to free
file->filter. Change remove_event_file_dir() to free_event_filter().

Link: http://lkml.kernel.org/p/20140711190638.GA19517@redhat.com

Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com>
Cc: stable@vger.kernel.org # 3.11+
Fixes: f6a84bdc75 "tracing: Introduce remove_event_file_dir()"
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-14 14:41:13 -04:00
Linus Torvalds
40f6123737 Merge branch 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Mostly fixes for the fallouts from the recent cgroup core changes.

  The decoupled nature of cgroup dynamic hierarchy management
  (hierarchies are created dynamically on mount but may or may not be
  reused once unmounted depending on remaining usages) led to more
  ugliness being added to kernfs.

  Hopefully, this is the last of it"

* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: break kernfs active protection in cpuset_write_resmask()
  cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
  kernfs: introduce kernfs_pin_sb()
  cgroup: fix mount failure in a corner case
  cpuset,mempolicy: fix sleeping function called from invalid context
  cgroup: fix broken css_has_online_children()
2014-07-10 11:38:23 -07:00
Linus Torvalds
a805cbf4c4 Merge branch 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
 "Two workqueue fixes.  Both are one liners.  One fixes missing uevent
  for workqueue files on sysfs.  The other one fixes missing zeroing of
  NUMA cpu masks which can lead to oopses among other things"

* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: zero cpumask of wq_numa_possible_cpumask on init
  workqueue: fix dev_set_uevent_suppress() imbalance
2014-07-10 11:37:41 -07:00
John Stultz
16927776ae alarmtimer: Fix bug where relative alarm timers were treated as absolute
Sharvil noticed with the posix timer_settime interface, using the
CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
tried to specify a relative time timer, it would incorrectly be
treated as absolute regardless of the state of the flags argument.

This patch corrects this, properly checking the absolute/relative flag,
as well as adds further error checking that no invalid flag bits are set.

Reported-by: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Cc: stable <stable@vger.kernel.org> #3.0+
Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-07-08 10:49:36 +02:00
Yasuaki Ishimatsu
5a6024f160 workqueue: zero cpumask of wq_numa_possible_cpumask on init
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.

  BUG: unable to handle kernel paging request at 0000000000001d08
  IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
  PGD 0
  Oops: 0000 [#1] SMP
  ...
  Call Trace:
   [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
   [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
   [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
   [<ffffffff811926f1>] new_slab+0x91/0x300
   [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
   [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
   [<ffffffff810a3c78>] ? load_balance+0x218/0x890
   [<ffffffff8101a679>] ? sched_clock+0x9/0x10
   [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
   [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
   [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
   [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
   [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
   [<ffffffff8105d0ec>] do_fork+0xbc/0x360
   [<ffffffff8105d3b6>] kernel_thread+0x26/0x30
   [<ffffffff81086652>] kthreadd+0x2c2/0x300
   [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
   [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
   [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60

In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.

When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:

3592         /* if cpumask is contained inside a NUMA node, we belong to that node */
3593         if (wq_numa_enabled) {
3594                 for_each_node(node) {
3595                         if (cpumask_subset(pool->attrs->cpumask,
3596                                            wq_numa_possible_cpumask[node])) {
3597                                 pool->node = node;
3598                                 break;
3599                         }
3600                 }
3601         }

But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.

By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: bce903809a ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
2014-07-07 09:56:48 -04:00
Linus Torvalds
549f11c9f0 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A few minor fixlets in ARM SoC irq drivers and a fix for a memory leak
  which I introduced in the last round of cleanups :("

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Fix memory leak when calling irq_free_hwirqs()
  irqchip: spear_shirq: Fix interrupt offset
  irqchip: brcmstb-l2: Level-2 interrupts are edge sensitive
  irqchip: armada-370-xp: Mask all interrupts during initialization.
2014-07-05 16:56:14 -07:00
Keith Busch
8844aad89e genirq: Fix memory leak when calling irq_free_hwirqs()
irq_free_hwirqs() always calls irq_free_descs() with a cnt == 0
which makes it a no-op since the interrupt count to free is
decremented in itself.

Fixes: 7b6ef12625

Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1404167084-8070-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-07-05 21:42:08 +02:00
Kirill Tkhai
b728ca0602 sched: Rework check_for_tasks()
1) Iterate thru all of threads in the system.
   Check for all threads, not only for group leaders.

2) Check for p->on_rq instead of p->state and cputime.
   Preempted task in !TASK_RUNNING state  OR just
   created task may be queued, that we want to be
   reported too.

3) Use read_lock() instead of write_lock().
   This function does not change any structures, and
   read_lock() is enough.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Cc: Konstantin Khorenko <khorenko@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael wang <wangyun@linux.vnet.ibm.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Turner <pjt@google.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Todd E Brandt <todd.e.brandt@linux.intel.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403684395.3462.44.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:45 +02:00
Kirill Tkhai
99b625670f sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
Make rt_rq available for pick_next_task(). Otherwise, their tasks
stay prisoned long time till dead cpu becomes alive again.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
CC: Konstantin Khorenko <khorenko@parallels.com>
CC: Ben Segall <bsegall@google.com>
CC: Paul Turner <pjt@google.com>
CC: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403684388.3462.43.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:44 +02:00