Commit graph

20350 commits

Author SHA1 Message Date
Nicholas Mc Guire
58ac93e4f2 sched: Fix function declaration return type mismatch
static code checking was unhappy with:

  ./kernel/sched/fair.c:162 WARNING: return of wrong type
                int != unsigned int

get_update_sysctl_factor() is declared to return int but is
currently  returning an unsigned int. The first few preprocessed
lines are:

 static int get_update_sysctl_factor(void)
 {
 unsigned int cpus = ({ int __min1 = (cpumask_weight(cpu_online_mask));
 int __min2 = (8); __min1 < __min2 ? __min1: __min2; });
 unsigned int factor;

The type used by min_t() should be 'unsigned int' and the return type
of get_update_sysctl_factor() should also be 'unsigned int' as its
call-site update_sysctl() is expecting 'unsigned int' and the values
utilizing:

  'factor'
  'sysctl_sched_min_granularity'
  'sched_nr_latency'
  'sysctl_sched_wakeup_granularity'

... are also all 'unsigned int', plus cpumask_weight() is also
returning 'unsigned int'.

So the natural type to use around here is 'unsigned int'.

( Patch was compile tested with x86_64_defconfig +
  CONFIG_SCHED_DEBUG=y and the changed sections in
  kernel/sched/fair.i were reviewed. )

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
[ Improved the changelog a bit. ]
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431716742-11077-1-git-send-email-hofrat@osadl.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-17 06:47:46 +02:00
Linus Torvalds
14db1e8dc0 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two fixes: a suspend/resume related regression fix, and an RT priority
  boosting fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix regression in cpuset_cpu_inactive() for suspend
  sched: Handle priority boosted tasks proper in setscheduler()
2015-05-15 12:42:33 -07:00
Linus Torvalds
ef4a293a44 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Mostly tooling fixes, but also a lockdep annotation fix, a PMU event
  list fix and a new model addition"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tools/liblockdep: Fix compilation error
  tools/liblockdep: Fix linker error in case of cross compile
  perf tools: Use getconf to determine number of online CPUs
  tools: Fix tools/vm build
  perf/x86/rapl: Enable Broadwell-U RAPL support
  perf/x86/intel: Fix SLM cache event list
  perf: Annotate inherited event ctx->mutex recursion
2015-05-15 12:38:21 -07:00
John Stultz
f7bcb70eba ktime: Fix ktime_divns to do signed division
It was noted that the 32bit implementation of ktime_divns()
was doing unsigned division and didn't properly handle
negative values.

And when a ktime helper was changed to utilize
ktime_divns, it caused a regression on some IR blasters.
See the following bugzilla for details:
  https://bugzilla.redhat.com/show_bug.cgi?id=1200353

This patch fixes the problem in ktime_divns by checking
and preserving the sign bit, and then reapplying it if
appropriate after the division, it also changes the return
type to a s64 to make it more obvious this is expected.

Nicolas also pointed out that negative dividers would
cause infinite loops on 32bit systems, negative dividers
is unlikely for users of this function, but out of caution
this patch adds checks for negative dividers for both
32-bit (BUG_ON) and 64-bit(WARN_ON) versions to make sure
no such use cases creep in.

[ tglx: Hand an u64 to do_div() to avoid the compiler warning ]

Fixes: 166afb6451 'ktime: Sanitize ktime_to_us/ms conversion'
Reported-and-tested-by: Trevor Cordes <trevor@tecnopolis.ca>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1431118043-23452-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-13 10:19:35 +02:00
Linus Torvalds
9d88f22a81 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "Two patches from the irq departement:

   - a simple fix to make dummy_irq_chip usable for wakeup scenarios

   - removal of the gic arch_extn hackery.  Now that all users are
     converted we really want to get rid of the interface so people wont
     come up with new use cases"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: gic: Drop support for gic_arch_extn
  genirq: Set IRQCHIP_SKIP_SET_WAKE flag for dummy_irq_chip
2015-05-09 14:59:05 -07:00
Linus Torvalds
95f3b1f4b1 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "A simple fix to actually shut down a detached device instead of
  keeping it active"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clockevents: Shutdown detached clockevent device
2015-05-09 14:57:49 -07:00
Linus Torvalds
26b293e854 The newly added ftrace_print_array_seq() function had a bug in it. Luckily,
the only user of it didn't make the 4.1 merge window. But the helper
 function should be fixed before 4.2 when the users start coming in.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVTBNUAAoJEEjnJuOKh9ld0VQIAJWPLivGbGJyjSqFd1NXLidS
 ytcbM0dquYjvQ94EDxoA+uBm34hk1JbvcI+FgiOihEeyGh7wrhdibEVGT40TzE2I
 XrfTVwPfN5/k2D5MeZzzRkeoTDufc33MgqTURymRQSzkmHf5GttPXxZ/ckO9Hz9A
 XqzXaHcmnauZSmUY12q8rMtbKYP/dN5hUdmR6p44bMgDJehQkmTzJkxbe6t98b+t
 8y3YAcK5HclYITC2lBVHSw5z8e9F/B7UmrNxvNkcV5kqdYg3NnVnA292kSMft5zo
 WRk1nH4eVARq2dmGQ289QpneHqtMx22RU42m/t8M/v0OUANhlPaDb/RHlyDWJF4=
 =4JGY
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "The newly added ftrace_print_array_seq() function had a bug in it.
  Luckily, the only user of it didn't make the 4.1 merge window.

  But the helper function should be fixed before 4.2 when the users
  start coming in"

* tag 'trace-fixes-v4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Make ftrace_print_array_seq compute buf_len
2015-05-08 18:22:05 -07:00
Steven Rostedt
37815bf866 module: Call module notifier on failure after complete_formation()
The module notifier call chain for MODULE_STATE_COMING was moved up before
the parsing of args, into the complete_formation() call. But if the module failed
to load after that, the notifier call chain for MODULE_STATE_GOING was
never called and that prevented the users of those call chains from
cleaning up anything that was allocated.

Link: http://lkml.kernel.org/r/554C52B9.9060700@gmail.com

Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com>
Fixes: 4982223e51 "module: set nx before marking module MODULE_STATE_COMING"
Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-05-09 03:29:24 +09:30
Davidlohr Bueso
1d0dcb3ad9 futex: Implement lockless wakeups
Given the overall futex architecture, any chance of reducing
hb->lock contention is welcome. In this particular case, using
wake-queues to enable lockless wakeups addresses very much real
world performance concerns, even cases of soft-lockups in cases
of large amounts of blocked tasks (which is not hard to find in
large boxes, using but just a handful of futex).

At the lowest level, this patch can reduce latency of a single thread
attempting to acquire hb->lock in highly contended scenarios by a
up to 2x. At lower counts of nr_wake there are no regressions,
confirming, of course, that the wake_q handling overhead is practically
non existent. For instance, while a fair amount of variation,
the extended pef-bench wakeup benchmark shows for a 20 core machine
the following avg per-thread time to wakeup its share of tasks:

	nr_thr	ms-before	ms-after
	16 	0.0590		0.0215
	32 	0.0396		0.0220
	48 	0.0417		0.0182
	64 	0.0536		0.0236
	80 	0.0414		0.0097
	96 	0.0672		0.0152

Naturally, this can cause spurious wakeups. However there is no core code
that cannot handle them afaict, and furthermore tglx does have the point
that other events can already trigger them anyway.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Mason <clm@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: George Spelvin <linux@horizon.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1430494072-30283-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:21:40 +02:00
Peter Zijlstra
7675104990 sched: Implement lockless wake-queues
This is useful for locking primitives that can effect multiple
wakeups per operation and want to avoid lock internal lock contention
by delaying the wakeups until we've released the lock internal locks.

Alternatively it can be used to avoid issuing multiple wakeups, and
thus save a few cycles, in packet processing. Queue all target tasks
and wakeup once you've processed all packets. That way you avoid
waking the target task multiple times if there were multiple packets
for the same task.

Properties of a wake_q are:
- Lockless, as queue head must reside on the stack.
- Being a queue, maintains wakeup order passed by the callers. This can
  be important for otherwise, in scenarios where highly contended locks
  could affect any reliance on lock fairness.
- A queued task cannot be added again until it is woken up.

This patch adds the needed infrastructure into the scheduler code
and uses the new wake_list to delay the futex wakeups until
after we've released the hash bucket locks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[tweaks, adjustments, comments, etc.]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Mason <clm@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: George Spelvin <linux@horizon.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1430494072-30283-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:20:45 +02:00
Jason Low
7110744516 sched, timer: Use the atomic task_cputime in thread_group_cputimer
Recent optimizations were made to thread_group_cputimer to improve its
scalability by keeping track of cputime stats without a lock. However,
the values were open coded to the structure, causing them to be at
a different abstraction level from the regular task_cputime structure.
Furthermore, any subsequent similar optimizations would not be able to
share the new code, since they are specific to thread_group_cputimer.

This patch adds the new task_cputime_atomic data structure (introduced in
the previous patch in the series) to thread_group_cputimer for keeping
track of the cputime atomically, which also helps generalize the code.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Link: http://lkml.kernel.org/r/1430251224-5764-6-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:17:46 +02:00
Jason Low
1018016c70 sched, timer: Replace spinlocks with atomics in thread_group_cputimer(), to improve scalability
While running a database workload, we found a scalability issue with itimers.

Much of the problem was caused by the thread_group_cputimer spinlock.
Each time we account for group system/user time, we need to obtain a
thread_group_cputimer's spinlock to update the timers. On larger systems
(such as a 16 socket machine), this caused more than 30% of total time
spent trying to obtain this kernel lock to update these group timer stats.

This patch converts the timers to 64-bit atomic variables and use
atomic add to update them without a lock. With this patch, the percent
of total time spent updating thread group cputimer timers was reduced
from 30% down to less than 1%.

Note: On 32-bit systems using the generic 64-bit atomics, this causes
sample_group_cputimer() to take locks 3 times instead of just 1 time.
However, we tested this patch on a 32-bit system ARM system using the
generic atomics and did not find the overhead to be much of an issue.
An explanation for why this isn't an issue is that 32-bit systems usually
have small numbers of CPUs, and cacheline contention from extra spinlocks
called periodically is not really apparent on smaller systems.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Link: http://lkml.kernel.org/r/1430251224-5764-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:15:31 +02:00
Jason Low
7e5a2c1729 sched/numa: Document usages of mm->numa_scan_seq
The p->mm->numa_scan_seq is accessed using READ_ONCE/WRITE_ONCE
and modified without exclusive access. It is not clear why it is
accessed this way. This patch provides some documentation on that.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <waiman.long@hp.com>
Link: http://lkml.kernel.org/r/1430440094.2475.61.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:13:13 +02:00
Jason Low
316c1608d1 sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()
ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Waiman Long <Waiman.Long@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:11:32 +02:00
Nicholas Mc Guire
ce2f5fe463 sched/core: Remove unnecessary down/up conversion
'rt_period_us' is automatically type converted from u64 to long and then cast
back to u64 - this down/up conversion is unnecessary and can be removed to
improve readability.

This will also help us not truncate 'rt_period_us' to 32 bits on 32-bit kernels,
should we ever have so large values. (unlikely, not the least due to procfs.)

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1430643116-24049-1-git-send-email-hofrat@osadl.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:10:07 +02:00
Palmer Dabbelt
b76808e680 signals, sched: Change all uses of JOBCTL_* from 'int' to 'long'
c56fb6564dcd ("Fix a misaligned load inside ptrace_attach()") makes
jobctl an "unsigned long".  It makes sense to have the masks applied
to it match that type.  This is currently just a cosmetic change, but
it will prevent the mask from being unexpectedly truncated if we ever
end up with masks with more bits.

One instance of "signr" is an int, but I left this alone because the
mask ensures that it will never overflow.

Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bobby.prani@gmail.com
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: richard@nod.at
Cc: vdavydov@parallels.com
Link: http://lkml.kernel.org/r/1430453997-32459-4-git-send-email-palmer@dabbelt.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:04:36 +02:00
Peter Zijlstra
3289bdb429 sched: Move the loadavg code to a more obvious location
I could not find the loadavg code.. turns out it was hidden in a file
called proc.c. It further got mingled up with the cruft per rq load
indexes (which we really want to get rid of).

Move the per rq load indexes into the fair.c load-balance code (that's
the only thing that uses them) and rename proc.c to loadavg.c so we
can find it again.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ Did minor cleanups to the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:04:12 +02:00
Ingo Molnar
bb2ebf0886 Merge branch 'sched/urgent' into sched/core, before applying new patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 11:59:57 +02:00
Peter Zijlstra
8b10c5e2b5 perf: Annotate inherited event ctx->mutex recursion
While fuzzing Sasha tripped over another ctx->mutex recursion lockdep
splat. Annotate this.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 11:59:40 +02:00
Paul Gortmaker
6a82b60da2 sched/core: Remove __cpuinit section tag that crept back in
We removed __cpuinit support (leaving no-op stubs) quite some time
ago.  However this one crept back in as of commit a803f0261b
("sched: Initialize rq->age_stamp on processor start")

Since we want to clobber the stubs too, get this removed now.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Corey Minyard <cminyard@mvista.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1430174880-27958-2-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 11:54:43 +02:00
Omar Sandoval
533445c6e5 sched/core: Fix regression in cpuset_cpu_inactive() for suspend
Commit 3c18d447b3 ("sched/core: Check for available DL bandwidth in
cpuset_cpu_inactive()"), a SCHED_DEADLINE bugfix, had a logic error that
caused a regression in setting a CPU inactive during suspend. I ran into
this when a program was failing pthread_setaffinity_np() with EINVAL after
a suspend+wake up.

A simple reproducer:

	$ ./a.out
	sched_setaffinity: Success
	$ systemctl suspend
	$ ./a.out
	sched_setaffinity: Invalid argument

... where ./a.out is:

	#define _GNU_SOURCE
	#include <errno.h>
	#include <sched.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>
	#include <unistd.h>

	int main(void)
	{
		long num_cores;
		cpu_set_t cpu_set;
		int ret;

		num_cores = sysconf(_SC_NPROCESSORS_ONLN);
		CPU_ZERO(&cpu_set);
		CPU_SET(num_cores - 1, &cpu_set);
		errno = 0;
		ret = sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set);
		perror("sched_setaffinity");
		return ret ? EXIT_FAILURE : EXIT_SUCCESS;
	}

The mistake is that suspend is handled in the action ==
CPU_DOWN_PREPARE_FROZEN case of the switch statement in
cpuset_cpu_inactive().

However, the commit in question masked out CPU_TASKS_FROZEN
from the action, making this case dead.

The fix is straightforward.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 3c18d447b3 ("sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()")
Link: http://lkml.kernel.org/r/1cb5ecb3d6543c38cce5790387f336f54ec8e2bc.1430733960.git.osandov@osandov.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 11:53:56 +02:00
Thomas Gleixner
0782e63bc6 sched: Handle priority boosted tasks proper in setscheduler()
Ronny reported that the following scenario is not handled correctly:

	T1 (prio = 10)
	   lock(rtmutex);

	T2 (prio = 20)
	   lock(rtmutex)
	      boost T1

	T1 (prio = 20)
	   sys_set_scheduler(prio = 30)
	   T1 prio = 30
	   ....
	   sys_set_scheduler(prio = 10)
	   T1 prio = 30

The last step is wrong as T1 should now be back at prio 20.

Commit c365c292d0 ("sched: Consider pi boosting in setscheduler()")
only handles the case where a boosted tasks tries to lower its
priority.

Fix it by taking the new effective priority into account for the
decision whether a change of the priority is required.

Reported-by: Ronny Meeus <ronny.meeus@gmail.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Fixes: c365c292d0 ("sched: Consider pi boosting in setscheduler()")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 11:53:55 +02:00
Ingo Molnar
ed7b40c90e Merge branch 'sched/urgent' into sched/core
So this isn't really a fix but a cleanup that can wait for v4.2.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 11:50:24 +02:00
Thomas Gleixner
6b442bc813 nohz: Fix !HIGH_RES_TIMERS hang
Simon Horman reported this crash on a system with
high-res timers disabled but nohz enabled:

  > ------------[ cut here ]------------
  > kernel BUG at kernel/irq_work.c:135!

    BUG_ON(!irqs_disabled());

So something enabled interrupts in the periodic tick handling machinery,
and that code path indeed has a local_irq_disable()/enable pair in
tick_nohz_switch_to_nohz() which causes havoc. Fix it.

This patch also fixes a +nohz -hrtimers hang reported by Ingo Molnar.

Reported-by: Simon Horman <horms@verge.net.au>
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: LAK <linux-arm-kernel@lists.infradead.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505071425520.4225@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-07 16:15:50 +02:00
Alex Bennée
ac01ce1410 tracing: Make ftrace_print_array_seq compute buf_len
The only caller to this function (__print_array) was getting it wrong by
passing the array length instead of buffer length. As the element size
was already being passed for other reasons it seems reasonable to push
the calculation of buffer length into the function.

Link: http://lkml.kernel.org/r/1430320727-14582-1-git-send-email-alex.bennee@linaro.org

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-06 23:03:23 -04:00
Linus Torvalds
02f0f5721e Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fix from Ingo Molnar:
 "An RCU Kconfig fix that eliminates an annoying interactive kconfig
  question for CONFIG_RCU_TORTURE_TEST_SLOW_INIT"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Control grace-period delays directly from value
2015-05-06 10:26:37 -07:00
Andreas Sandberg
38d23a6cc1 tick: hrtimer-broadcast: Prevent endless restarting when broadcast device is unused
The hrtimer callback in the hrtimer's tick broadcast code sometimes
incorrectly ends up scheduling events at the current tick causing the
kernel to hang servicing the same hrtimer forever. This typically
happens when a device is swapped out by
tick_install_broadcast_device(), which replaces the event handler with
clock_events_handle_noop() and sets the device mode to
CLOCK_EVT_MODE_UNUSED. If the timer is scheduled when this happens,
the next_event field will not be updated and the hrtimer ends up being
restarted at the current tick. To prevent this from happening, only
try to restart the hrtimer if the broadcast clock event device is in
one of the active modes and try to cancel the timer when entering the
CLOCK_EVT_MODE_UNUSED mode.

Signed-off-by: Andreas Sandberg <andreas.sandberg@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1429880765-5558-1-git-send-email-andreas.sandberg@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-05 15:34:21 +02:00
Joonwoo Park
781978e6e1 timer: Use timer->base for flag checks
At present, internal_add_timer() examines flags with 'base' which doesn't
contain flags.  Examine with 'timer->base' to avoid unnecessary waking up
of nohz CPU when timer base has TIMER_DEFERRABLE set.

Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Cc: sboyd@codeaurora.org
Cc: skannan@codeaurora.org
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1430187709-21087-1-git-send-email-joonwoop@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-05 10:40:43 +02:00
Preeti U Murthy
1ef09cd713 tick-broadcast: Fix the printing of broadcast masks
Today the number of bits of the broadcast masks that is output into
/proc/timer_list is sizeof(unsigned long). This means that on machines
with a larger number of CPUs, the bitmasks of CPUs beyond this range do
not appear.

Fix this by using bitmap printing through "%*pb" instead, so as to
output the broadcast masks for the range of nr_cpu_ids into
/proc/timer_list.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: peterz@infradead.org
Cc: linuxppc-dev@ozlabs.org
Cc: john.stultz@linaro.org
Link: http://lkml.kernel.org/r/20150428084520.3314.62668.stgit@preeti.in.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-05 10:35:58 +02:00
Thomas Gleixner
298dbd1c5c tick: broadcast: Simplify oneshot logic and shorten lock region
Simplify the oneshot logic by avoiding the reprogramming loops. That
also allows to call the cpu local handler outside of the
broadcast_lock held region.

Tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-05 10:25:23 +02:00
Thomas Gleixner
2951d5c031 tick: broadcast: Prevent livelock from event handler
With the removal of the hrtimer softirq the switch to highres/nohz
mode happens in the tick interrupt. That leads to a livelock when the
per cpu event handler is directly called from the broadcast handler
under broadcast lock because broadcast lock needs to be taken for the
highres/nohz switch as well.

Solve this by calling the cpu local handler outside the broadcast_lock
held region.

Fixes: c6eb3f70d4 "hrtimer: Get rid of hrtimer softirq"
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-05 10:25:23 +02:00
Thomas Gleixner
30fbd59057 perf: Remove unused function perf_mux_hrtimer_cancel()
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-04 13:51:40 +02:00
Linus Torvalds
6c3c1eb3c3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Receive packet length needs to be adjust by 2 on RX to accomodate
    the two padding bytes in altera_tse driver.  From Vlastimil Setka.

 2) If rx frame is dropped due to out of memory in macb driver, we leave
    the receive ring descriptors in an undefined state.  From Punnaiah
    Choudary Kalluri

 3) Some netlink subsystems erroneously signal NLM_F_MULTI.  That is
    only for dumps.  Fix from Nicolas Dichtel.

 4) Fix mis-use of raw rt->rt_pmtu value in ipv4, one must always go via
    the ipv4_mtu() helper.  From Herbert Xu.

 5) Fix null deref in bridge netfilter, and miscalculated lengths in
    jump/goto nf_tables verdicts.  From Florian Westphal.

 6) Unhash ping sockets properly.

 7) Software implementation of BPF divide did 64/32 rather than 64/64
    bit divide.  The JITs got it right.  Fix from Alexei Starovoitov.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits)
  ipv4: Missing sk_nulls_node_init() in ping_unhash().
  net: fec: Fix RGMII-ID mode
  net/mlx4_en: Schedule napi when RX buffers allocation fails
  netxen_nic: use spin_[un]lock_bh around tx_clean_lock
  net/mlx4_core: Fix unaligned accesses
  mlx4_en: Use correct loop cursor in error path.
  cxgb4: Fix MC1 memory offset calculation
  bnx2x: Delay during kdump load
  net: Fix Kernel Panic in bonding driver debugfs file: rlb_hash_table
  net: dsa: Fix scope of eeprom-length property
  net: macb: Fix race condition in driver when Rx frame is dropped
  hv_netvsc: Fix a bug in netvsc_start_xmit()
  altera_tse: Correct rx packet length
  mlx4: Fix tx ring affinity_mask creation
  tipc: fix problem with parallel link synchronization mechanism
  tipc: remove wrong use of NLM_F_MULTI
  bridge/nl: remove wrong use of NLM_F_MULTI
  bridge/mdb: remove wrong use of NLM_F_MULTI
  net: sched: act_connmark: don't zap skb->nfct
  trivial: net: systemport: bcmsysport.h: fix 0x0x prefix
  ...
2015-05-01 20:51:04 -07:00
Linus Torvalds
4a152c3913 Power management and ACPI fixes for v4.1-rc2
- Fix for a regression in the cpuidle core introduced by one of
    the recent commits in the clockevents_notify() removal series
    that put a call to a function which had to be executed with
    disabled interrupts into a code path running with enabled
    interrupts (Rafael J Wysocki).
 
  - Fix for a build problem in ACPICA (with GCC 4.5) introduced by one
    of the recent ACPICA tools commits that added a duplicate typedef
    to one of the ACPICA's header files by mistake (Olaf Hering).
 
  - Fix for a regression in the ACPI SBS (Smart Battery Subsystem)
    driver introduced during the 3.18 development cycle causing the
    smart battery manager to be marked as not present when it should
    be marked as present (Chris Bainbridge).
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJVQoeSAAoJEILEb/54YlRxuscP/3Vz98JY0P7sMLnxVU8RPBxW
 cBN7G7F1gZup66/yV3MzvYHbgOFwNdQ2aMVxN+qV7rVMw27uwHljVr3svB+/GS15
 TeyKxvFWZYwhPKdsNAoHdkGBsptAK4DdyA+N2wH3MZ4dd8HepxpBG6QvzaEt4s43
 wFhhFrzEP3HDjrPoF/7TwKbsSrFuU5/U5PTd3dIuukj6JAl0PWnjczuVsg1PRyCW
 9kvefs3U56s8GbWOhDC+QyQ3eYJ2y35j/XnCKFWwljcdnis6HpsKj+oxSG2o0K4o
 9LLL3MgnTQuFbenuFIJUsMdFFdFrFkEeiV1EAdKCzQcXdlipxvi2cPttdNWVd6eZ
 JobXyhq7iYUQK5E5Dp6TCokxhzOyjvTo7VmEkkkqLVmfeO0FnggUAduRMAIacYXS
 ZML45m4UtvCJRoQXXxyzwYQXFeRD/nJQlOavC95jGGqtAij0Xk4xqTauQKIiqiaE
 zobOJ5ZxJMP+njFfzyDxjm68LjobYm1fUJTpWzpUSExjEKImRcw/QMzw23MyHHUL
 IJ7hP8FcELglkZWk/0ZwoNhbwM5Q2qGX5WMtqyunNFcf83aBsCr8bSjOrm4zmuL7
 NWjjeXuYwzPuC7SnFqf2K/CCfdkrdnU1frKLTpOm8t8W0P4mr1ckYrPemNemr8sN
 e5GozK5m13T2gTT4+q6M
 =v78a
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI fixes from Rafael Wysocki:
 "Three regression fixes this time, one for a recent regression in the
  cpuidle core affecting multiple systems, one for an inadvertently
  added duplicate typedef in ACPICA that breaks compilation with GCC 4.5
  and one for an ACPI Smart Battery Subsystem driver regression
  introduced during the 3.18 cycle (stable-candidate).

  Specifics:

   - Fix for a regression in the cpuidle core introduced by one of the
     recent commits in the clockevents_notify() removal series that put
     a call to a function which had to be executed with disabled
     interrupts into a code path running with enabled interrupts (Rafael
     J Wysocki)

   - Fix for a build problem in ACPICA (with GCC 4.5) introduced by one
     of the recent ACPICA tools commits that added a duplicate typedef
     to one of the ACPICA's header files by mistake (Olaf Hering)

   - Fix for a regression in the ACPI SBS (Smart Battery Subsystem)
     driver introduced during the 3.18 development cycle causing the
     smart battery manager to be marked as not present when it should be
     marked as present (Chris Bainbridge)"

* tag 'pm+acpi-4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpuidle: Run tick_broadcast_exit() with disabled interrupts
  ACPI / SBS: Enable battery manager when present
  ACPICA: remove duplicate u8 typedef
2015-04-30 14:23:31 -07:00
Linus Torvalds
9dbbe3cfc3 Remove from guest code the handling of task migration during a
pvclock read; instead use the correct protocol in KVM.
 
 This removes the need for task migration notifiers in core
 scheduler code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJVQkUWAAoJEL/70l94x66DhfcH/A8RTHUOELtoy+v2weahn21m
 FFWEnEUlCWzYgmiddgFdlr6+ub386W3ryFsXKPqjrn/8LVv3yS7tK1NJF8d03LQw
 n7HtIsrF01E9UI8CIWO4S/mUxWQev6vEJ9NXtNrsJcRmhSeLaIZkPjTH8Zqyx4i9
 ZvG4731WHXmxvbJ03bfJU9Y8OwHXe55GMi614aTxPndVBGdvIRu2Oj6aTfQTeab/
 7tEujub0MKWp74a7eyNU4GItcvIAXZCQt2wMc5dN1VK3ma5FTOnHIOuhAb8mACFF
 qEeGhtxAnOf7W+s9J8i7zVBdA5MOS0vUKng361ZOVGDb0OLqcVADW7GpuTZfRAM=
 =2A7v
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm changes from Paolo Bonzini:
 "Remove from guest code the handling of task migration during a pvclock
  read; instead use the correct protocol in KVM.

  This removes the need for task migration notifiers in core scheduler
  code"

[ The scheduler people really hated the migration notifiers, so this was
  kind of required  - Linus ]

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  x86: pvclock: Really remove the sched notifier for cross-cpu migrations
  kvm: x86: fix kvmclock update protocol
2015-04-30 09:44:04 -07:00
David Howells
9c4249c8e0 modsign: change default key details
Change default key details to be more obviously unspecified.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-30 09:35:41 -07:00
Rafael J. Wysocki
df8d9eeadd cpuidle: Run tick_broadcast_exit() with disabled interrupts
Commit 335f49196f (sched/idle: Use explicit broadcast oneshot
control function) replaced clockevents_notify() invocations in
cpuidle_idle_call() with direct calls to tick_broadcast_enter()
and tick_broadcast_exit(), but it overlooked the fact that
interrupts were already enabled before calling the latter which
led to functional breakage on systems using idle states with the
CPUIDLE_FLAG_TIMER_STOP flag set.

Fix that by moving the invocations of tick_broadcast_enter()
and tick_broadcast_exit() down into cpuidle_enter_state() where
interrupts are still disabled when tick_broadcast_exit() is
called.  Also ensure that interrupts will be disabled before
running tick_broadcast_exit() even if they have been enabled by
the idle state's ->enter callback.  Trigger a WARN_ON_ONCE() in
that case, as we generally don't want that to happen for states
with CPUIDLE_FLAG_TIMER_STOP set.

Fixes: 335f49196f (sched/idle: Use explicit broadcast oneshot control function)
Reported-and-tested-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-04-29 15:19:21 +02:00
Linus Torvalds
14bc84ce0b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
 "One additional new feature for 4.1, a new PRNG based on SHA-512 for
  the zcrypt driver.

  Two memory management related changes, the page table reallocation for
  KVM is removed, and with file ptes gone the encoding of page table
  entries is improved.

  And three bug fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/zcrypt: Introduce new SHA-512 based Pseudo Random Generator.
  s390/mm: change swap pte encoding and pgtable cleanup
  s390/mm: correct transfer of dirty & young bits in __pmd_to_pte
  s390/bpf: add dependency to z196 features
  s390/3215: free memory in error path
  s390/kvm: remove delayed reallocation of page tables for KVM
  kexec: allocate the kexec control page with KEXEC_CONTROL_MEMORY_GFP
2015-04-28 09:58:46 -07:00
Alexei Starovoitov
876a7ae65b bpf: fix 64-bit divide
ALU64_DIV instruction should be dividing 64-bit by 64-bit,
whereas do_div() does 64-bit by 32-bit divide.
x64 and arm64 JITs correctly implement 64 by 64 unsigned divide.
llvm BPF backend emits code assuming that ALU64_DIV does 64 by 64.

Fixes: 89aa075832 ("net: sock: allow eBPF programs to be attached to sockets")
Reported-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-27 23:11:49 -04:00
Paolo Bonzini
73459e2a1a x86: pvclock: Really remove the sched notifier for cross-cpu migrations
This reverts commits 0a4e6be9ca
and 80f7fdb1c7.

The task migration notifier was originally introduced in order to support
the pvclock vsyscall with non-synchronized TSC, but KVM only supports it
with synchronized TSC.  Hence, on KVM the race condition is only needed
due to a bad implementation on the host side, and even then it's so rare
that it's mostly theoretical.

As far as KVM is concerned it's possible to fix the host, avoiding the
additional complexity in the vDSO and the (re)introduction of the task
migration notifier.

Xen, on the other hand, hasn't yet implemented vsyscall support at
all, so we do not care about its plans for non-synchronized TSC.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-27 15:49:30 +02:00
Linus Torvalds
9ec3a646fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fourth vfs update from Al Viro:
 "d_inode() annotations from David Howells (sat in for-next since before
  the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RCU pathwalk breakage when running into a symlink overmounting something
  fix I_DIO_WAKEUP definition
  direct-io: only inc/dec inode->i_dio_count for file systems
  fs/9p: fix readdir()
  VFS: assorted d_backing_inode() annotations
  VFS: fs/inode.c helpers: d_inode() annotations
  VFS: fs/cachefiles: d_backing_inode() annotations
  VFS: fs library helpers: d_inode() annotations
  VFS: assorted weird filesystems: d_inode() annotations
  VFS: normal filesystems (and lustre): d_inode() annotations
  VFS: security/: d_inode() annotations
  VFS: security/: d_backing_inode() annotations
  VFS: net/: d_inode() annotations
  VFS: net/unix: d_backing_inode() annotations
  VFS: kernel/: d_inode() annotations
  VFS: audit: d_backing_inode() annotations
  VFS: Fix up some ->d_inode accesses in the chelsio driver
  VFS: Cachefiles should perform fs modifications on the top layer only
  VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26 17:22:07 -07:00
Viresh Kumar
149aabcc44 clockevents: Shutdown detached clockevent device
A clockevent device is marked DETACHED when it is replaced by another
clockevent device.

The device is shutdown properly for drivers that implement legacy
->set_mode() callback, as we call ->set_mode() for CLOCK_EVT_MODE_UNUSED
as well.

But for the new per-state callback interface, we skip shutting down the
device, as we thought its an internal state change. That wasn't correct.

The effect is that the device is left programmed in oneshot or periodic
mode.

Fall-back to 'case CLOCK_EVT_STATE_SHUTDOWN', to shutdown the device.

Fixes: bd624d75db "clockevents: Introduce mode specific callbacks"
Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/eef0a91c51b74d4e52c8e5a95eca27b5a0563f07.1428650683.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-24 21:01:48 +02:00
Roger Quadros
10a50f1ab5 genirq: Set IRQCHIP_SKIP_SET_WAKE flag for dummy_irq_chip
Without this system suspend is broken on systems that have
drivers calling enable/disable_irq_wake() for interrupts based off
the dummy irq hook. (e.g. drivers/gpio/gpio-pcf857x.c)

Signed-off-by: Roger Quadros <rogerq@ti.com>
Cc: <cw00.choi@samsung.com>
Cc: <balbi@ti.com>
Cc: <tony@atomide.com>
Cc: Gregory Clement <gregory.clement@free-electrons.com>
Link: http://lkml.kernel.org/r/552E1DD3.4040106@ti.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-24 20:57:06 +02:00
Martin Schwidefsky
7e01b5acd8 kexec: allocate the kexec control page with KEXEC_CONTROL_MEMORY_GFP
Introduce KEXEC_CONTROL_MEMORY_GFP to allow the architecture code
to override the gfp flags of the allocation for the kexec control
page. The loop in kimage_alloc_normal_control_pages allocates pages
with GFP_KERNEL until a page is found that happens to have an
address smaller than the KEXEC_CONTROL_MEMORY_LIMIT. On systems
with a large memory size but a small KEXEC_CONTROL_MEMORY_LIMIT
the loop will keep allocating memory until the oom killer steps in.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-04-23 16:52:01 +02:00
Thomas Gleixner
b484403b9a sched: debug: Remove the cfs bandwidth timer_active printout
The struct member is gone.

Reported-by: fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2015-04-23 13:58:09 +02:00
Linus Torvalds
27cf3a16b2 Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
 "Seven audit patches for v4.1, all bug fixes.

  The largest, and perhaps most significant commit helps resolve some
  memory pressure issues related to the inode cache and audit, there are
  also a few small commits which help resolve some timing issues with
  the audit log queue, and the rest fall into the always popular "code
  clean-up" category.

  In general, nothing really substantial, just a nice set of maintenance
  patches"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
  audit: Remove condition which always evaluates to false
  audit: reduce mmap_sem hold for mm->exe_file
  audit: consolidate handling of mm->exe_file
  audit: code clean up
  audit: don't reset working wait time accidentally with auditd
  audit: don't lose set wait time on first successful call to audit_log_start()
  audit: move the tree pruning to a dedicated thread
2015-04-22 14:49:23 -07:00
kbuild test robot
9183034879 perf: perf_mux_hrtimer_cancel() can be static
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: kbuild-all@01.org
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150422200000.GA122603@lkp-sb04
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 22:06:34 +02:00
Linus Torvalds
4f2112351b This adds three fixes for the tracing code.
The first is a bug when ftrace_dump_on_oops is triggered in atomic context
 and function graph tracer is the tracer that is being reported.
 
 The second fix is bad parsing of the trace_events from the kernel
 command line, where it would ignore specific events if the system
 name is used with defining the event(it enables all events within the
 system).
 
 The last one is a fix to the TRACE_DEFINE_ENUM(), where a check was missing
 to see if the ptr was incremented to the end of the string, but the loop
 increments it again and can miss the nul delimiter to stop processing.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVN7g7AAoJEEjnJuOKh9ldLCkIAJj6wsj59wR8aIxtxnD5bJZZ
 Y9dKkas6BOaUCGp0MVBCkpm3uMgh0uPO10jOuthgc3LgQy0piwKJrIzGkI5gcRuA
 JvZw2X08/jCSNu8BHydes2A0XMkZobMuWFAeWz3CSzNBfbI3sDDqgnRQ9eyMM66R
 +sPV7ZELQLK/ZFs93gFoaZe/OKGYJcamhMtG2v7p9h1qBJZUDakRFo478bAwu5SB
 Kh6xpXA76WnmkeQekAAsWGdqIQzoNy3IoVePmdhVpZvoiKLUrf1JWVYFoNkO39Ik
 kC1gqJro0EmHQFOo5rt8yQmqXjvSU0sS/sAW3HZ0Szl5OtiUNnag+Wt3ku7lr8U=
 =VORa
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "This adds three fixes for the tracing code.

  The first is a bug when ftrace_dump_on_oops is triggered in atomic
  context and function graph tracer is the tracer that is being
  reported.

  The second fix is bad parsing of the trace_events from the kernel
  command line, where it would ignore specific events if the system name
  is used with defining the event(it enables all events within the
  system).

  The last one is a fix to the TRACE_DEFINE_ENUM(), where a check was
  missing to see if the ptr was incremented to the end of the string,
  but the loop increments it again and can miss the nul delimiter to
  stop processing"

* tag 'trace-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix possible out of bounds memory access when parsing enums
  tracing: Fix incorrect enabling of trace events by boot cmdline
  tracing: Handle ftrace_dump() atomic context in graph_trace_open()
2015-04-22 11:27:36 -07:00
Linus Torvalds
15ce2658dd Quentin opened a can of worms by adding extable entry checking to modpost,
but most architectures seem fixed now.  Thanks to all involved.
 
 Last minute rebase because I noticed a "[PATCH]" had snuck into a commit
 message somehow.
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVN1X3AAoJENkgDmzRrbjxmBEP/1Kv1wCrUGg07TfqNTJgr5Pe
 R81C+8P2wSEsA94lBhLNbn2DsN7uWsygcSXZMQ+5CrmbKAG45kRlMUzSlg6IdGXo
 usA9rOlH4hR+Dqn3zyx543xMJyp0a0LoSCLWGMmr/U71AXNp5kehvbIv64gJHrwg
 4CALI9qCwHH4FXF7WBYdLknlfpmD7KzCjIVckkfUfwf2A8mZXKdqBFQ+MPqbChEm
 vpFWkgp1oJgK4OcvvBeiNT2PwFZy1vmJtQWc2lieJ6HCjbqvwgM0g3EYTIQmccIm
 O40mEFMNkUuliMtrCZgOJBP2bmWy79JydiUq5Mu5Cb4bL6jGl2r6Z+w+iQ7Sggv9
 DTqTrk0/Cn02krJXDXAHVOXXgTqggf1xev6wr+3dbZn5sytQ8ddImuuVai8KALoC
 aEXAsnAK+3GnkAn9qqKVFhnrnCcI69I6srUAn5DKVWu36+Ye05FsZboIFYcxcK1F
 NZ+31xA17xA+yoo62ly/Auz7DXMa2QZfJcoQLMx2fMgL/+wMqzfGbcrZ5ZaVIj4w
 ePig1SbozAweAAfX7+W+QCQ3wqNN/nJgMO8vPldlcq7eaXH0/Dqu72cC2d82dK+H
 xl/1pEj83W0AkUkfNS8j3Z4zSzg31g8bK5SmPPRID00mx8roCfsP/Cfx5hRiO8U2
 ECtbKDxSyUWwF9vnOLVR
 =EYrW
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module updates from Rusty Russell:
 "Quentin opened a can of worms by adding extable entry checking to
  modpost, but most architectures seem fixed now.  Thanks to all
  involved.

  Last minute rebase because I noticed a "[PATCH]" had snuck into a
  commit message somehow"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  modpost: don't emit section mismatch warnings for compiler optimizations
  modpost: expand pattern matching to support substring matches
  modpost: do not try to match the SHT_NUL section.
  modpost: fix extable entry size calculation.
  modpost: fix inverted logic in is_extable_fault_address().
  modpost: handle -ffunction-sections
  modpost: Whitelist .text.fixup and .exception.text
  params: handle quotes properly for values not of form foo="bar".
  modpost: document the use of struct section_check.
  modpost: handle relocations mismatch in __ex_table.
  scripts: add check_extable.sh script.
  modpost: mismatch_handler: retrieve tosym information only when needed.
  modpost: factorize symbol pretty print in get_pretty_name().
  modpost: add handler function pointer to sectioncheck.
  modpost: add .sched.text and .kprobes.text to the TEXT_SECTIONS list.
  modpost: add strict white-listing when referencing sections.
  module: do not print allocation-fail warning on bogus user buffer size
  kernel/module.c: fix typos in message about unused symbols
2015-04-22 09:49:24 -07:00
Peter Zijlstra
272325c482 perf: Fix mux_interval hrtimer wreckage
Thomas stumbled over the hrtimer_forward_now() in
perf_event_mux_interval_ms_store() and noticed its broken-ness.

You cannot just change the expiry time of an active timer, it will
destroy the red-black tree order and cause havoc.

Change it to (re)start the timer instead, (re)starting a timer will
dequeue and enqueue a timer and therefore preserve rb-tree order.

Since we cannot enqueue remotely, wrap the thing in
cpu_function_call(), this however mandates that we restrict ourselves
to online cpus. Also serialize the entire setting so we don't get
multiple concurrent threads trying to update to different values.

Also fix a problem in perf_mux_hrtimer_restart(), checking against
hrtimer_active() can actually loose us the timer when timer->state ==
HRTIMER_STATE_CALLBACK and the callback has already decided NORESTART.

Furthermore it doesn't make any sense to test
hrtimer_callback_running() when we already tested hrtimer_active(),
but with the above change, we explicitly must call it when
callback_running.

Lastly, rename a few functions:

  s/perf_cpu_hrtimer_/perf_mux_hrtimer_/ -- because I could not find
                                            the mux timer function

  s/\<hr\>/timer/ -- because that's the normal way of calling things.

Fixes: 62b8563979 ("perf: Add sysfs entry to adjust multiplexing interval per PMU")
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150415095011.863052571@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:12:22 +02:00