Commit graph

234 commits

Author SHA1 Message Date
Prateek Sood
d914882c93 locking/osq_lock: Fix osq_lock queue corruption
commit 50972fe78f24f1cd0b9d7bbf1f87d2be9e4f412e upstream.

Fix ordering of link creation between node->prev and prev->next in
osq_lock(). A case in which the status of optimistic spin queue is
CPU6->CPU2 in which CPU6 has acquired the lock.

        tail
          v
  ,-. <- ,-.
  |6|    |2|
  `-' -> `-'

At this point if CPU0 comes in to acquire osq_lock, it will update the
tail count.

  CPU2			CPU0
  ----------------------------------

				       tail
				         v
			  ,-. <- ,-.    ,-.
			  |6|    |2|    |0|
			  `-' -> `-'    `-'

After tail count update if CPU2 starts to unqueue itself from
optimistic spin queue, it will find an updated tail count with CPU0 and
update CPU2 node->next to NULL in osq_wait_next().

  unqueue-A

	       tail
	         v
  ,-. <- ,-.    ,-.
  |6|    |2|    |0|
  `-'    `-'    `-'

  unqueue-B

  ->tail != curr && !node->next

If reordering of following stores happen then prev->next where prev
being CPU2 would be updated to point to CPU0 node:

				       tail
				         v
			  ,-. <- ,-.    ,-.
			  |6|    |2|    |0|
			  `-'    `-' -> `-'

  osq_wait_next()
    node->next <- 0
    xchg(node->next, NULL)

	       tail
	         v
  ,-. <- ,-.    ,-.
  |6|    |2|    |0|
  `-'    `-'    `-'

  unqueue-C

At this point if next instruction
	WRITE_ONCE(next->prev, prev);
in CPU2 path is committed before the update of CPU0 node->prev = prev then
CPU0 node->prev will point to CPU6 node.

	       tail
    v----------. v
  ,-. <- ,-.    ,-.
  |6|    |2|    |0|
  `-'    `-'    `-'
     `----------^

At this point if CPU0 path's node->prev = prev is committed resulting
in change of CPU0 prev back to CPU2 node. CPU2 node->next is NULL
currently,

				       tail
			                 v
			  ,-. <- ,-. <- ,-.
			  |6|    |2|    |0|
			  `-'    `-'    `-'
			     `----------^

so if CPU0 gets into unqueue path of osq_lock it will keep spinning
in infinite loop as condition prev->next == node will never be true.

Signed-off-by: Prateek Sood <prsood@codeaurora.org>
[ Added pictures, rewrote comments. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sramana@codeaurora.org
Link: http://lkml.kernel.org/r/1500040076-27626-1-git-send-email-prsood@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19 22:48:56 +02:00
Prateek Sood
70cc08c44f locking/rwsem-xadd: Fix missed wakeup due to reordering of load
commit 9c29c31830a4eca724e137a9339137204bbb31be upstream.

If a spinner is present, there is a chance that the load of
rwsem_has_spinner() in rwsem_wake() can be reordered with
respect to decrement of rwsem count in __up_write() leading
to wakeup being missed:

 spinning writer                  up_write caller
 ---------------                  -----------------------
 [S] osq_unlock()                 [L] osq
  spin_lock(wait_lock)
  sem->count=0xFFFFFFFF00000001
            +0xFFFFFFFF00000000
  count=sem->count
  MB
                                   sem->count=0xFFFFFFFE00000001
                                             -0xFFFFFFFF00000001
                                   spin_trylock(wait_lock)
                                   return
 rwsem_try_write_lock(count)
 spin_unlock(wait_lock)
 schedule()

Reordering of atomic_long_sub_return_release() in __up_write()
and rwsem_has_spinner() in rwsem_wake() can cause missing of
wakeup in up_write() context. In spinning writer, sem->count
and local variable count is 0XFFFFFFFE00000001. It would result
in rwsem_try_write_lock() failing to acquire rwsem and spinning
writer going to sleep in rwsem_down_write_failed().

The smp_rmb() will make sure that the spinner state is
consulted after sem->count is updated in up_write context.

Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: longman@redhat.com
Cc: parri.andrea@gmail.com
Cc: sramana@codeaurora.org
Link: http://lkml.kernel.org/r/1504794658-15397-1-git-send-email-prsood@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19 22:48:56 +02:00
Steven Rostedt (VMware)
c40dc96f7f locking/lockdep: Do not record IRQ state within lockdep code
[ Upstream commit fcc784be837714a9173b372ff9fb9b514590dad9 ]

While debugging where things were going wrong with mapping
enabling/disabling interrupts with the lockdep state and actual real
enabling and disabling interrupts, I had to silent the IRQ
disabling/enabling in debug_check_no_locks_freed() because it was
always showing up as it was called before the splat was.

Use raw_local_irq_save/restore() for not only debug_check_no_locks_freed()
but for all internal lockdep functions, as they hide useful information
about where interrupts were used incorrectly last.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/lkml/20180404140630.3f4f4c7a@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-24 13:26:55 +02:00
Will Deacon
abd9138a1b locking/qspinlock: Ensure node->count is updated before initialising node
[ Upstream commit 11dc13224c975efcec96647a4768a6f1bb7a19a8 ]

When queuing on the qspinlock, the count field for the current CPU's head
node is incremented. This needn't be atomic because locking in e.g. IRQ
context is balanced and so an IRQ will return with node->count as it
found it.

However, the compiler could in theory reorder the initialisation of
node[idx] before the increment of the head node->count, causing an
IRQ to overwrite the initialised node and potentially corrupt the lock
state.

Avoid the potential for this harmful compiler reordering by placing a
barrier() between the increment of the head node->count and the subsequent
node initialisation.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:57 +02:00
Davidlohr Bueso
bd44e3f19d locking/mutex: Allow next waiter lockless wakeup
commit 1329ce6fbbe4536592dfcfc8d64d61bfeb598fe6 upstream.

Make use of wake-queues and enable the wakeup to occur after releasing the
wait_lock. This is similar to what we do with rtmutex top waiter,
slightly shortening the critical region and allow other waiters to
acquire the wait_lock sooner. In low contention cases it can also help
the recently woken waiter to find the wait_lock available (fastpath)
when it continues execution.

Reviewed-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Waiman Long <waiman.long@hpe.com>
Cc: Will Deacon <Will.Deacon@arm.com>
Link: http://lkml.kernel.org/r/20160125022343.GA3322@linux-uzut.site
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:35:27 +01:00
Peter Zijlstra
28eab3db72 locking/lockdep: Add nest_lock integrity test
[ Upstream commit 7fb4a2cea6b18dab56d609530d077f168169ed6b ]

Boqun reported that hlock->references can overflow. Add a debug test
for that to generate a clear error when this happens.

Without this, lockdep is likely to report a mysterious failure on
unlock.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-21 17:09:03 +02:00
Yang Shi
10863607c2 locktorture: Fix potential memory leak with rw lock test
commit f4dbba591945dc301c302672adefba9e2ec08dc5 upstream.

When running locktorture module with the below commands with kmemleak enabled:

$ modprobe locktorture torture_type=rw_lock_irq
$ rmmod locktorture

The below kmemleak got caught:

root@10:~# echo scan > /sys/kernel/debug/kmemleak
[  323.197029] kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
root@10:~# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffffffc07592d500 (size 128):
  comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 c3 7b 02 00 00 00 00 00  .........{......
    00 00 00 00 00 00 00 00 d7 9b 02 00 00 00 00 00  ................
  backtrace:
    [<ffffff80081e5a88>] create_object+0x110/0x288
    [<ffffff80086c6078>] kmemleak_alloc+0x58/0xa0
    [<ffffff80081d5acc>] __kmalloc+0x234/0x318
    [<ffffff80006fa130>] 0xffffff80006fa130
    [<ffffff8008083ae4>] do_one_initcall+0x44/0x138
    [<ffffff800817e28c>] do_init_module+0x68/0x1cc
    [<ffffff800811c848>] load_module+0x1a68/0x22e0
    [<ffffff800811d340>] SyS_finit_module+0xe0/0xf0
    [<ffffff80080836f0>] el0_svc_naked+0x24/0x28
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffffffc07592d480 (size 128):
  comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 3b 6f 01 00 00 00 00 00  ........;o......
    00 00 00 00 00 00 00 00 23 6a 01 00 00 00 00 00  ........#j......
  backtrace:
    [<ffffff80081e5a88>] create_object+0x110/0x288
    [<ffffff80086c6078>] kmemleak_alloc+0x58/0xa0
    [<ffffff80081d5acc>] __kmalloc+0x234/0x318
    [<ffffff80006fa22c>] 0xffffff80006fa22c
    [<ffffff8008083ae4>] do_one_initcall+0x44/0x138
    [<ffffff800817e28c>] do_init_module+0x68/0x1cc
    [<ffffff800811c848>] load_module+0x1a68/0x22e0
    [<ffffff800811d340>] SyS_finit_module+0xe0/0xf0
    [<ffffff80080836f0>] el0_svc_naked+0x24/0x28
    [<ffffffffffffffff>] 0xffffffffffffffff

It is because cxt.lwsa and cxt.lrsa don't get freed in module_exit, so free
them in lock_torture_cleanup() and free writer_tasks if reader_tasks is
failed at memory allocation.

Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: 石洋 <yang.s@alibaba-inc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-13 14:09:46 -07:00
Thomas Gleixner
c6a5bf4cda locking/rtmutex: Use READ_ONCE() in rt_mutex_owner()
commit 1be5d4fa0af34fb7bafa205aeb59f5c7cc7a089d upstream.

While debugging the rtmutex unlock vs. dequeue race Will suggested to use
READ_ONCE() in rt_mutex_owner() as it might race against the
cmpxchg_release() in unlock_rt_mutex_safe().

Will: "It's a minor thing which will most likely not matter in practice"

Careful search did not unearth an actual problem in todays code, but it's
better to be safe than surprised.

Suggested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Daney <ddaney@caviumnetworks.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20161130210030.431379999@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-15 08:49:22 -08:00
Thomas Gleixner
b27d9147f2 locking/rtmutex: Prevent dequeue vs. unlock race
commit dbb26055defd03d59f678cb5f2c992abe05b064a upstream.

David reported a futex/rtmutex state corruption. It's caused by the
following problem:

CPU0		CPU1		CPU2

l->owner=T1
		rt_mutex_lock(l)
		lock(l->wait_lock)
		l->owner = T1 | HAS_WAITERS;
		enqueue(T2)
		boost()
		  unlock(l->wait_lock)
		schedule()

				rt_mutex_lock(l)
				lock(l->wait_lock)
				l->owner = T1 | HAS_WAITERS;
				enqueue(T3)
				boost()
				  unlock(l->wait_lock)
				schedule()
		signal(->T2)	signal(->T3)
		lock(l->wait_lock)
		dequeue(T2)
		deboost()
		  unlock(l->wait_lock)
				lock(l->wait_lock)
				dequeue(T3)
				  ===> wait list is now empty
				deboost()
				 unlock(l->wait_lock)
		lock(l->wait_lock)
		fixup_rt_mutex_waiters()
		  if (wait_list_empty(l)) {
		    owner = l->owner & ~HAS_WAITERS;
		    l->owner = owner
		     ==> l->owner = T1
		  }

				lock(l->wait_lock)
rt_mutex_unlock(l)		fixup_rt_mutex_waiters()
				  if (wait_list_empty(l)) {
				    owner = l->owner & ~HAS_WAITERS;
cmpxchg(l->owner, T1, NULL)
 ===> Success (l->owner = NULL)
				    l->owner = owner
				     ==> l->owner = T1
				  }

That means the problem is caused by fixup_rt_mutex_waiters() which does the
RMW to clear the waiters bit unconditionally when there are no waiters in
the rtmutexes rbtree.

This can be fatal: A concurrent unlock can release the rtmutex in the
fastpath because the waiters bit is not set. If the cmpxchg() gets in the
middle of the RMW operation then the previous owner, which just unlocked
the rtmutex is set as the owner again when the write takes place after the
successfull cmpxchg().

The solution is rather trivial: verify that the owner member of the rtmutex
has the waiters bit set before clearing it. This does not require a
cmpxchg() or other atomic operations because the waiters bit can only be
set and cleared with the rtmutex wait_lock held. It's also safe against the
fast path unlock attempt. The unlock attempt via cmpxchg() will either see
the bit set and take the slowpath or see the bit cleared and release it
atomically in the fastpath.

It's remarkable that the test program provided by David triggers on ARM64
and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
problem exists there as well. That refusal might explain that this got not
discovered earlier despite the bug existing from day one of the rtmutex
implementation more than 10 years ago.

Thanks to David for meticulously instrumenting the code and providing the
information which allowed to decode this subtle problem.

Reported-by: David Daney <ddaney@caviumnetworks.com>
Tested-by: David Daney <david.daney@cavium.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 23f78d4a03 ("[PATCH] pi-futex: rt mutex core")
Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-15 08:49:22 -08:00
Peter Zijlstra
a39e660a55 locking/qspinlock: Fix spin_unlock_wait() some more
commit 2c610022711675ee908b903d242f0b90e1db661f upstream.

While this prior commit:

  54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")

... fixes spin_is_locked() and spin_unlock_wait() for the usage
in ipc/sem and netfilter, it does not in fact work right for the
usage in task_work and futex.

So while the 2 locks crossed problem:

	spin_lock(A)		spin_lock(B)
	if (!spin_is_locked(B)) spin_unlock_wait(A)
	  foo()			foo();

... works with the smp_mb() injected by both spin_is_locked() and
spin_unlock_wait(), this is not sufficient for:

	flag = 1;
	smp_mb();		spin_lock()
	spin_unlock_wait()	if (!flag)
				  // add to lockless list
	// iterate lockless list

... because in this scenario, the store from spin_lock() can be delayed
past the load of flag, uncrossing the variables and loosing the
guarantee.

This patch reworks spin_is_locked() and spin_unlock_wait() to work in
both cases by exploiting the observation that while the lock byte
store can be delayed, the contender must have registered itself
visibly in other state contained in the word.

It also allows for architectures to override both functions, as PPC
and ARM64 have an additional issue for which we currently have no
generic solution.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <waiman.long@hpe.com>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27 09:47:29 -07:00
Chris Wilson
c7f47e59c3 locking/ww_mutex: Report recursive ww_mutex locking early
commit 0422e83d84ae24b933e4b0d4c1e0f0b4ae8a0a3b upstream.

Recursive locking for ww_mutexes was originally conceived as an
exception. However, it is heavily used by the DRM atomic modesetting
code. Currently, the recursive deadlock is checked after we have queued
up for a busy-spin and as we never release the lock, we spin until
kicked, whereupon the deadlock is discovered and reported.

A simple solution for the now common problem is to move the recursive
deadlock discovery to the first action when taking the ww_mutex.

Suggested-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1464293297-19777-1-git-send-email-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27 09:47:29 -07:00
Peter Zijlstra
23a67ddd46 locking/mcs: Fix mcs_spin_lock() ordering
commit 920c720aa5aa3900a7f1689228fdfc2580a91e7e upstream.

Similar to commit b4b29f9485 ("locking/osq: Fix ordering of node
initialisation in osq_lock") the use of xchg_acquire() is
fundamentally broken with MCS like constructs.

Furthermore, it turns out we rely on the global transitivity of this
operation because the unlock path observes the pointer with a
READ_ONCE(), not an smp_load_acquire().

This is non-critical because the MCS code isn't actually used and
mostly serves as documentation, a stepping stone to the more complex
things we've build on top of the idea.

Reported-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 3552a07a9c ("locking/mcs: Use acquire/release semantics")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04 14:48:50 -07:00
Will Deacon
b4b29f9485 locking/osq: Fix ordering of node initialisation in osq_lock
The Cavium guys reported a soft lockup on their arm64 machine, caused by
commit c55a6ffa62 ("locking/osq: Relax atomic semantics"):

    mutex_optimistic_spin+0x9c/0x1d0
    __mutex_lock_slowpath+0x44/0x158
    mutex_lock+0x54/0x58
    kernfs_iop_permission+0x38/0x70
    __inode_permission+0x88/0xd8
    inode_permission+0x30/0x6c
    link_path_walk+0x68/0x4d4
    path_openat+0xb4/0x2bc
    do_filp_open+0x74/0xd0
    do_sys_open+0x14c/0x228
    SyS_openat+0x3c/0x48
    el0_svc_naked+0x24/0x28

This is because in osq_lock we initialise the node for the current CPU:

    node->locked = 0;
    node->next = NULL;
    node->cpu = curr;

and then publish the current CPU in the lock tail:

    old = atomic_xchg_acquire(&lock->tail, curr);

Once the update to lock->tail is visible to another CPU, the node is
then live and can be both read and updated by concurrent lockers.

Unfortunately, the ACQUIRE semantics of the xchg operation mean that
there is no guarantee the contents of the node will be visible before
lock tail is updated.  This can lead to lock corruption when, for
example, a concurrent locker races to set the next field.

Fixes: c55a6ffa62 ("locking/osq: Relax atomic semantics"):
Reported-by: David Daney <ddaney@caviumnetworks.com>
Reported-by: Andrew Pinski <andrew.pinski@caviumnetworks.com>
Tested-by: Andrew Pinski <andrew.pinski@caviumnetworks.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1449856001-21177-1-git-send-email-will.deacon@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-17 11:40:29 -08:00
Peter Zijlstra
90eec103b9 treewide: Remove old email address
There were still a number of references to my old Red Hat email
address in the kernel source. Remove these while keeping the
Red Hat copyright notices intact.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:44:58 +01:00
Mel Gorman
d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Linus Torvalds
53528695ff Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes in this cycle were:

   - sched/fair load tracking fixes and cleanups (Byungchul Park)

   - Make load tracking frequency scale invariant (Dietmar Eggemann)

   - sched/deadline updates (Juri Lelli)

   - stop machine fixes, cleanups and enhancements for bugs triggered by
     CPU hotplug stress testing (Oleg Nesterov)

   - scheduler preemption code rework: remove PREEMPT_ACTIVE and related
     cleanups (Peter Zijlstra)

   - Rework the sched_info::run_delay code to fix races (Peter Zijlstra)

   - Optimize per entity utilization tracking (Peter Zijlstra)

   - ... misc other fixes, cleanups and smaller updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
  sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
  sched: Start stopper early
  stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
  stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
  stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
  stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
  stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
  sched/x86: Fix typo in __switch_to() comments
  sched/core: Remove a parameter in the migrate_task_rq() function
  sched/core: Drop unlikely behind BUG_ON()
  sched/core: Fix task and run queue sched_info::run_delay inconsistencies
  sched/numa: Fix task_tick_fair() from disabling numa_balancing
  sched/core: Add preempt_count invariant check
  sched/core: More notrace annotations
  sched/core: Kill PREEMPT_ACTIVE
  sched/core, sched/x86: Kill thread_info::saved_preempt_count
  sched/core: Simplify preempt_count tests
  sched/core: Robustify preemption leak checks
  sched/core: Stop setting PREEMPT_ACTIVE
  ...
2015-11-03 18:03:50 -08:00
Linus Torvalds
d63a978865 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking changes from Ingo Molnar:
 "The main changes in this cycle were:

   - More gradual enhancements to atomic ops: new atomic*_read_ctrl()
     ops, synchronize atomic_{read,set}() ordering requirements between
     architectures, add atomic_long_t bitops.  (Peter Zijlstra)

   - Add _{relaxed|acquire|release}() variants for inc/dec atomics and
     use them in various locking primitives: mutex, rtmutex, mcs, rwsem.
     This enables weakly ordered architectures (such as arm64) to make
     use of more locking related optimizations.  (Davidlohr Bueso)

   - Implement atomic[64]_{inc,dec}_relaxed() on ARM.  (Will Deacon)

   - Futex kernel data cache footprint micro-optimization.  (Rasmus
     Villemoes)

   - pvqspinlock runtime overhead micro-optimization.  (Waiman Long)

   - misc smaller fixlets"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ARM, locking/atomics: Implement _relaxed variants of atomic[64]_{inc,dec}
  locking/rwsem: Use acquire/release semantics
  locking/mcs: Use acquire/release semantics
  locking/rtmutex: Use acquire/release semantics
  locking/mutex: Use acquire/release semantics
  locking/asm-generic: Add _{relaxed|acquire|release}() variants for inc/dec atomics
  atomic: Implement atomic_read_ctrl()
  atomic, arch: Audit atomic_{read,set}()
  atomic: Add atomic_long_t bitops
  futex: Force hot variables into a single cache line
  locking/pvqspinlock: Kick the PV CPU unconditionally when _Q_SLOW_VAL
  locking/osq: Relax atomic semantics
  locking/qrwlock: Rename ->lock to ->wait_lock
  locking/Documentation/lockstat: Fix typo - lokcing -> locking
  locking/atomics, cmpxchg: Privatize the inclusion of asm/cmpxchg.h
2015-11-03 16:10:43 -08:00
Ingo Molnar
c13dc31adb Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  - Miscellaneous fixes. (Paul E. McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)

  - Improvements to expedited grace periods. (Paul E. McKenney)

  - Performance improvements to and locktorture tests for percpu-rwsem.
    (Oleg Nesterov, Paul E. McKenney)

  - Torture-test changes. (Paul E. McKenney, Davidlohr Bueso)

  - Documentation updates. (Paul E. McKenney)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19 10:09:54 +02:00
Paul E. McKenney
39cd2dd39a Merge branches 'doc.2015.10.06a', 'percpu-rwsem.2015.10.06a' and 'torture.2015.10.06a' into HEAD
doc.2015.10.06a:  Documentation updates.
percpu-rwsem.2015.10.06a:  Optimization of per-CPU reader-writer semaphores.
torture.2015.10.06a:  Torture-test updates.
2015-10-07 16:06:25 -07:00
Paul E. McKenney
a36a99618b locktorture: Fix module unwind when bad torture_type specified
The locktorture module has a list of torture types, and specifying
a type not on this list is supposed to cleanly fail the module load.
Unfortunately, the "fail" happens without the "cleanly".  This commit
therefore adds the needed clean-up after an incorrect torture_type.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06 11:28:44 -07:00
Oleg Nesterov
cc5f730b41 locking/percpu-rwsem: Clean up the lockdep annotations in percpu_down_read()
Based on Peter Zijlstra's earlier patch.

Change percpu_down_read() to use __down_read(), this way we can
do rwsem_acquire_read() unconditionally at the start to make this
code more symmetric and clean.

Originally-From: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06 11:25:40 -07:00
Oleg Nesterov
f324a76324 locking/percpu-rwsem: Fix the comments outdated by rcu_sync
Update the comments broken by the previous change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06 11:25:36 -07:00
Oleg Nesterov
001dac627f locking/percpu-rwsem: Make use of the rcu_sync infrastructure
Currently down_write/up_write calls synchronize_sched_expedited()
twice, which is evil.  Change this code to rely on rcu-sync primitives.
This avoids the _expedited "big hammer", and this can be faster in
the contended case or even in the case when a single thread does
down_write/up_write in a loop.

Of course, a single down_write() will take more time, but otoh it
will be much more friendly to the whole system.

To simplify the review this patch doesn't update the comments, fixed
by the next change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06 11:25:31 -07:00
Oleg Nesterov
95b19f684c locking/percpu-rwsem: Make percpu_free_rwsem() after kzalloc() safe
This is the temporary ugly hack which will be reverted later. We only
need it to ensure that the next patch will not break "change sb_writers
to use percpu_rw_semaphore" patches routed via the VFS tree.

The alloc_super()->destroy_super() error path assumes that it is safe
to call percpu_free_rwsem() after kzalloc() without percpu_init_rwsem(),
so let's not disappoint it.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06 11:25:26 -07:00
Paul E. McKenney
617783dd99 locktorture: Add torture tests for percpu_rwsem
This commit adds percpu_rwsem tests based on the earlier rwsem tests.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06 11:24:56 -07:00
Paul E. McKenney
302707fd7c locking/percpu-rwsem: Export symbols for locktorture
This commit exports percpu_down_read(), percpu_down_write(),
__percpu_init_rwsem(), percpu_up_read(), and percpu_up_write() to allow
locktorture to test them when built as a module.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06 11:24:51 -07:00
Davidlohr Bueso
095777c417 locktorture: Support rtmutex torturing
Real time mutexes is one of the few general primitives
that we do not have in locktorture. Address this -- a few
considerations:

o To spice things up, enable competing thread(s) to become
rt, such that we can stress different prio boosting paths
in the rtmutex code. Introduce a ->task_boost callback,
only used by rtmutex-torturer. Tasks will boost/deboost
around every 50k (arbitrarily) lock/unlock operations.

o Hold times are similar to what we have for other locks:
only occasionally having longer hold times (per ~200k ops).
So we roughly do two full rt boost+deboosting ops with
short hold times.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06 11:24:40 -07:00
Davidlohr Bueso
00eb4bab69 locking/rwsem: Use acquire/release semantics
As of 654672d4ba (locking/atomics: Add _{acquire|release|relaxed}()
variants of some atomic operations) and 6d79ef2d30 (locking, asm-generic:
Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
ordered archs can benefit from more relaxed use of barriers when locking
and unlocking, instead of regular full barrier semantics. While currently
only arm64 supports such optimizations, updating corresponding locking
primitives serves for other archs to immediately benefit as well, once the
necessary machinery is implemented of course.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E.McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443643395-17016-6-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:28:24 +02:00
Davidlohr Bueso
3552a07a9c locking/mcs: Use acquire/release semantics
As of 654672d4ba (locking/atomics: Add _{acquire|release|relaxed}()
variants of some atomic operations) and 6d79ef2d30 (locking, asm-generic:
Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
ordered archs can benefit from more relaxed use of barriers when locking
and unlocking, instead of regular full barrier semantics. While currently
only arm64 supports such optimizations, updating corresponding locking
primitives serves for other archs to immediately benefit as well, once the
necessary machinery is implemented of course.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E.McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443643395-17016-5-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:28:23 +02:00
Davidlohr Bueso
700318d1d7 locking/rtmutex: Use acquire/release semantics
As of 654672d4ba (locking/atomics: Add _{acquire|release|relaxed}()
variants of some atomic operations) and 6d79ef2d30 (locking, asm-generic:
Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
ordered archs can benefit from more relaxed use of barriers when locking
and unlocking, instead of regular full barrier semantics. While currently
only arm64 supports such optimizations, updating corresponding locking
primitives serves for other archs to immediately benefit as well, once the
necessary machinery is implemented of course.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E.McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443643395-17016-4-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:28:22 +02:00
Davidlohr Bueso
81a43adae3 locking/mutex: Use acquire/release semantics
As of 654672d4ba (locking/atomics: Add _{acquire|release|relaxed}()
variants of some atomic operations) and 6d79ef2d30 (locking, asm-generic:
Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
ordered archs can benefit from more relaxed use of barriers when locking
and unlocking, instead of regular full barrier semantics. While currently
only arm64 supports such optimizations, updating corresponding locking
primitives serves for other archs to immediately benefit as well, once the
necessary machinery is implemented of course.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E.McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443643395-17016-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:28:20 +02:00
Ingo Molnar
fe19159225 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:05:36 +02:00
Ingo Molnar
4bbffe718f Merge branch 'locking/urgent' into locking/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-23 09:52:03 +02:00
Juri Lelli
f52405757e sched/deadline, locking/rtmutex: Fix open coded check in rt_mutex_waiter_less()
rt_mutex_waiter_less() check of task deadlines is open coded. Since this
is subject to wraparound bugs, make it use the correct helper.

Reported-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441188096-23021-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-23 09:51:25 +02:00
Peter Zijlstra
21199f27b4 locking/lockdep: Fix hlock->pin_count reset on lock stack rebuilds
Various people reported hitting the "unpinning an unpinned lock"
warning. As it turns out there are 2 places where we take a lock out
of the middle of a stack, and in those cases it would fail to preserve
the pin_count when rebuilding the lock stack.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Tim Spriggs <tspriggs@apple.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davej@codemonkey.org.uk
Link: http://lkml.kernel.org/r/20150916141040.GA11639@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-23 09:48:53 +02:00
Waiman Long
93edc8bd77 locking/pvqspinlock: Kick the PV CPU unconditionally when _Q_SLOW_VAL
If _Q_SLOW_VAL has been set, the vCPU state must have been vcpu_hashed.
The extra check at the end of __pv_queued_spin_unlock() is unnecessary
and can be removed.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441996658-62854-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:27:29 +02:00
Davidlohr Bueso
c55a6ffa62 locking/osq: Relax atomic semantics
... by using acquire/release for ops around the lock->tail. As such,
weakly ordered archs can benefit from more relaxed use of barriers
when issuing atomics.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hpe.com>
Link: http://lkml.kernel.org/r/1442216244-4409-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:27:29 +02:00
Davidlohr Bueso
6e1e519697 locking/qrwlock: Rename ->lock to ->wait_lock
... trivial, but reads a little nicer when we name our
actual primitive 'lock'.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hpe.com>
Link: http://lkml.kernel.org/r/1442216244-4409-1-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:27:29 +02:00
Linus Torvalds
9786cff38a Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Spinlock performance regression fix, plus documentation fixes"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/static_keys: Fix up the static keys documentation
  locking/qspinlock/x86: Only emit the test-and-set fallback when building guest support
  locking/qspinlock/x86: Fix performance regression under unaccelerated VMs
  locking/static_keys: Fix a silly typo
2015-09-17 08:45:23 -07:00
Ingo Molnar
c7ef92cea9 Linux 4.3-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJV9LbmAAoJEHm+PkMAQRiGb40IAJWcETZb6hoCUIrGZX+4Znqy
 UXYY9BwybF+3yPsTKWRUWQGifNhUiW7ejNgMO3QYG+E1RgJ6uj8Mym9I11+x3a9D
 beIem8Ftf1Zwt71zg6DpUCNhlRIfa3TTnbQMIYmoIihVwYWVve1/rMPD5kgafF6P
 Xnp7QSUh7uCK/G06sksK9aB2GkRgvoMKfAgTHmj094f24udl87NyUo8O8mP5QWX2
 b0S5ZwlDRL64sio59QyxZK87f0TGnquDBLe6Gcl3wJQx/g3RzRpSxEkumylwx+S4
 u9xeHlorOkg8a+k62TgbC6GP0Y6Ptk+yMF6UFCPsifwQTRvJubrA2ofdfPuggCk=
 =aqcb
 -----END PGP SIGNATURE-----

Merge tag 'v4.3-rc1' into locking/core, to refresh the tree

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 10:01:24 +02:00
Peter Zijlstra
43b3f02899 locking/qspinlock/x86: Fix performance regression under unaccelerated VMs
Dave ran into horrible performance on a VM without PARAVIRT_SPINLOCKS
set and Linus noted that the test-and-set implementation was retarded.

One should spin on the variable with a load, not a RMW.

While there, remove 'queued' from the name, as the lock isn't queued
at all, but a simple test-and-set.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: stable@vger.kernel.org # v4.2+
Link: http://lkml.kernel.org/r/20150904152523.GR18673@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-11 07:49:42 +02:00
Linus Torvalds
7d9071a095 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "In this one:

   - d_move fixes (Eric Biederman)

   - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
     handling ought to be fixed)

   - switch of sb_writers to percpu rwsem (Oleg Nesterov)

   - superblock scalability (Josef Bacik and Dave Chinner)

   - swapon(2) race fix (Hugh Dickins)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
  vfs: Test for and handle paths that are unreachable from their mnt_root
  dcache: Reduce the scope of i_lock in d_splice_alias
  dcache: Handle escaped paths in prepend_path
  mm: fix potential data race in SyS_swapon
  inode: don't softlockup when evicting inodes
  inode: rename i_wb_list to i_io_list
  sync: serialise per-superblock sync operations
  inode: convert inode_sb_list_lock to per-sb
  inode: add hlist_fake to avoid the inode hash lock in evict
  writeback: plug writeback at a high level
  change sb_writers to use percpu_rw_semaphore
  shift percpu_counter_destroy() into destroy_super_work()
  percpu-rwsem: kill CONFIG_PERCPU_RWSEM
  percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
  percpu-rwsem: introduce percpu_down_read_trylock()
  document rwsem_release() in sb_wait_write()
  fix the broken lockdep logic in __sb_start_write()
  introduce __sb_writers_{acquired,release}() helpers
  ufs_inode_get{frag,block}(): get rid of 'phys' argument
  ufs_getfrag_block(): tidy up a bit
  ...
2015-09-05 20:34:28 -07:00
Oleg Nesterov
bf3eac84c4 percpu-rwsem: kill CONFIG_PERCPU_RWSEM
Remove CONFIG_PERCPU_RWSEM, the next patch adds the unconditional
user of percpu_rw_semaphore.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2015-08-15 13:52:11 +02:00
Oleg Nesterov
9287f6925a percpu-rwsem: introduce percpu_down_read_trylock()
Add percpu_down_read_trylock(), it will have the user soon.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2015-08-15 13:52:10 +02:00
Will Deacon
77e430e3e4 locking/qrwlock: Make use of _{acquire|release|relaxed}() atomics
The qrwlock implementation is slightly heavy in its use of memory
barriers, mainly through the use of _cmpxchg() and _return() atomics, which
imply full barrier semantics.

This patch modifies the qrwlock code to use the more relaxed atomic
routines so that we can reduce the unnecessary barrier overhead on
weakly-ordered architectures.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman.Long@hp.com
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1438880084-18856-7-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 11:59:06 +02:00
Waiman Long
75d2270280 locking/pvqspinlock: Only kick CPU at unlock time
For an over-committed guest with more vCPUs than physical CPUs
available, it is possible that a vCPU may be kicked twice before
getting the lock - once before it becomes queue head and once again
before it gets the lock. All these CPU kicking and halting (VMEXIT)
can be expensive and slow down system performance.

This patch adds a new vCPU state (vcpu_hashed) which enables the code
to delay CPU kicking until at unlock time. Once this state is set,
the new lock holder will set _Q_SLOW_VAL and fill in the hash table
on behalf of the halted queue head vCPU. The original vcpu_halted
state will be used by pv_wait_node() only to differentiate other
queue nodes from the qeue head.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1436647018-49734-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 10:57:11 +02:00
Waiman Long
ffffeaf318 locking/qrwlock: Reduce reader/writer to reader lock transfer latency
Currently, a reader will check first to make sure that the writer mode
byte is cleared before incrementing the reader count. That waiting is
not really necessary. It increases the latency in the reader/writer
to reader transition and reduces readers performance.

This patch eliminates that waiting. It also has the side effect
of reducing the chance of writer lock stealing and improving the
fairness of the lock. Using a locking microbenchmark, a 10-threads 5M
locking loop of mostly readers (RW ratio = 10,000:1) has the following
performance numbers in a Haswell-EX box:

        Kernel          Locking Rate (Kops/s)
        ------          ---------------------
        4.1.1               15,063,081
        4.1.1+patch         17,241,552  (+14.4%)

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1436459543-29126-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 10:57:10 +02:00
Will Deacon
3b3fdf10a8 locking/pvqspinlock: Order pv_unhash() after cmpxchg() on unlock slowpath
When we unlock in __pv_queued_spin_unlock(), a failed cmpxchg() on the lock
value indicates that we need to take the slow-path and unhash the
corresponding node blocked on the lock.

Since a failed cmpxchg() does not provide any memory-ordering guarantees,
it is possible that the node data could be read before the cmpxchg() on
weakly-ordered architectures and therefore return a stale value, leading
to hash corruption and/or a BUG().

This patch adds an smb_rmb() following the failed cmpxchg operation, so
that the unhashing is ordered after the lock has been checked.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
[ Added more comments]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by:  Waiman Long <Waiman.Long@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steve Capper <Steve.Capper@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150713155830.GL2632@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 10:57:09 +02:00
Peter Zijlstra
0b792bf519 locking: Clean up pvqspinlock warning
- Rename the on-stack variable to match the datastructure variable,

 - place the cmpxchg back under the comment that explains it,

 - clean up the WARN() statement to avoid superfluous conditionals
   and line-breaks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 10:57:08 +02:00
Ingo Molnar
3a7651e683 Linux 4.2-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVvsVPAAoJEHm+PkMAQRiGkMAH/AqQUjzltvIwDq39vlbrwpc1
 DIgpm15zaIHKVThQRA69cBIDOprckk6pChFhA/aZVhRBsVva/Z3k8vIjaAzW7eDs
 OK3zE1VsQ0QSK9FYo/8DJoy8844DF5beVwZVE4/xc8TFbabA6BgWawAgVxdpgzVQ
 LQb6jMHQPGGpAQrdPJJcfkeQRi9GBpyXLX6x7nO4jKQAPQGVUqT1QLFN/XYMNp7n
 xmdWogyNfis+c/Vx2OIQUmS/kFO5oyGaSWB1pK2MKeTG5XJ7AITzeHOGfRPmVinn
 x9ozeMLPjTMNFlzPYYrTL+xnqdCPHzKW7KP2LBvNb9PRl7j1vtvNKNNzcD8cAbI=
 =Zmjn
 -----END PGP SIGNATURE-----

Merge branch 'locking/urgent', tag 'v4.2-rc5' into locking/core, to pick up fixes before applying new changes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 10:52:25 +02:00