-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAljuA8EACgkQONu9yGCS
aT5smg//fcD0laNCo+dhbbadB2utsxnDRD0diRusmvJfmRYXysW0amxbdvxRI5+t
bVhGRRaSr+XIpmUYC3p7QHbJ3/ct1Ikee3aK1yyTNwyd8/EGhl++1F7nnQ7FU5nb
iGV09kDvddsX9SbZqkPyB1yosXfzQbSu5G5eQX+lqHsXU9gCLdmaq73NQBygSUq8
EVQivUvLlvRz8zQGKA5hUqz71G8V1mLmc2b1s9r6e5mUuPXBM+UdxbvlLA+iOFRT
WuPTU8xNlFj55CckaGGwLTXSfIYmPl8UCgSdvOTo/TPbBEE2TIaQGn/0jvuqVns7
sDs9s9c3rNWVMc0KMZPJ6b7WIuGBgiDjSFGu2hqqNvG+X33s6qCvmnq2ZqLSVxs/
iXqKr8eC1YP9Sr6okhdMbUcS8jqqD99YDvH94ulvfC3nx9WvMS/2JY7SBbdh4nyN
Jb4j3BeS4C4TXRtWuPo7ks3PbRj8mvrpKdAJ74zoKZNcjXd8PvtZem2P9UzYM5K9
9PS4T0Ne5eYHbOehWMC4t95Ijl/mYSKYCygltl2Fer29gEMGCJ4dGt3evfyaFfFZ
2l43A+WSeYdzQRsuPnFN/oMr/Q4o1U1+ZC5HCe/1Qx/FyfSonw5/hagVWzR6IxyJ
LsbwmxQrZrZRy3vT4gBnoEe7xdwUgenuIoeGMJfjgpLaQiC0osU=
=00n+
-----END PGP SIGNATURE-----
Merge 4.4.61 into android-4.4
Changes in 4.4.61:
drm/vmwgfx: Type-check lookups of fence objects
drm/vmwgfx: NULL pointer dereference in vmw_surface_define_ioctl()
drm/vmwgfx: avoid calling vzalloc with a 0 size in vmw_get_cap_3d_ioctl()
drm/ttm, drm/vmwgfx: Relax permission checking when opening surfaces
drm/vmwgfx: Remove getparam error message
drm/vmwgfx: fix integer overflow in vmw_surface_define_ioctl()
sysfs: be careful of error returns from ops->show()
staging: android: ashmem: lseek failed due to no FMODE_LSEEK.
arm/arm64: KVM: Take mmap_sem in stage2_unmap_vm
arm/arm64: KVM: Take mmap_sem in kvm_arch_prepare_memory_region
iio: bmg160: reset chip when probing
Reset TreeId to zero on SMB2 TREE_CONNECT
ptrace: fix PTRACE_LISTEN race corrupting task->state
ring-buffer: Fix return value check in test_ringbuffer()
metag/usercopy: Drop unused macros
metag/usercopy: Fix alignment error checking
metag/usercopy: Add early abort to copy_to_user
metag/usercopy: Zero rest of buffer from copy_from_user
metag/usercopy: Set flags before ADDZ
metag/usercopy: Fix src fixup in from user rapf loops
metag/usercopy: Add missing fixups
powerpc/mm: Add missing global TLB invalidate if cxl is active
powerpc: Don't try to fix up misaligned load-with-reservation instructions
nios2: reserve boot memory for device tree
s390/decompressor: fix initrd corruption caused by bss clear
s390/uaccess: get_user() should zero on failure (again)
MIPS: Force o32 fp64 support on 32bit MIPS64r6 kernels
MIPS: ralink: Fix typos in rt3883 pinctrl
MIPS: End spinlocks with .insn
MIPS: Lantiq: fix missing xbar kernel panic
MIPS: Flush wrong invalid FTLB entry for huge page
mm/mempolicy.c: fix error handling in set_mempolicy and mbind.
Linux 4.4.61
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 62277de758b155dc04b78f195a1cb5208c37b2df upstream.
In case of error, the function kthread_run() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check
should be replaced with IS_ERR().
Link: http://lkml.kernel.org/r/1466184839-14927-1-git-send-email-weiyj_lk@163.com
Fixes: 6c43e554a ("ring-buffer: Add ring buffer startup selftest")
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5402e97af667e35e54177af8f6575518bf251d51 upstream.
In PT_SEIZED + LISTEN mode STOP/CONT signals cause a wakeup against
__TASK_TRACED. If this races with the ptrace_unfreeze_traced at the end
of a PTRACE_LISTEN, this can wake the task /after/ the check against
__TASK_TRACED, but before the reset of state to TASK_TRACED. This
causes it to instead clobber TASK_WAKING, allowing a subsequent wakeup
against TRACED while the task is still on the rq wake_list, corrupting
it.
Oleg said:
"The kernel can crash or this can lead to other hard-to-debug problems.
In short, "task->state = TASK_TRACED" in ptrace_unfreeze_traced()
assumes that nobody else can wake it up, but PTRACE_LISTEN breaks the
contract. Obviusly it is very wrong to manipulate task->state if this
task is already running, or WAKING, or it sleeps again"
[akpm@linux-foundation.org: coding-style fixes]
Fixes: 9899d11f ("ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL")
Link: http://lkml.kernel.org/r/xm26y3vfhmkp.fsf_-_@bsegall-linux.mtv.corp.google.com
Signed-off-by: Ben Segall <bsegall@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.4.60:
libceph: force GFP_NOIO for socket allocations
xen/setup: Don't relocate p2m over existing one
scsi: mpt3sas: fix hang on ata passthrough commands
scsi: sg: check length passed to SG_NEXT_CMD_LEN
scsi: libsas: fix ata xfer length
ALSA: seq: Fix race during FIFO resize
ALSA: hda - fix a problem for lineout on a Dell AIO machine
ASoC: atmel-classd: fix audio clock rate
ACPI: Fix incompatibility with mcount-based function graph tracing
ACPI: Do not create a platform_device for IOAPIC/IOxAPIC
tty/serial: atmel: fix race condition (TX+DMA)
tty/serial: atmel: fix TX path in atmel_console_write()
USB: fix linked-list corruption in rh_call_control()
KVM: x86: clear bus pointer when destroyed
drm/radeon: Override fpfn for all VRAM placements in radeon_evict_flags
mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()
MIPS: Lantiq: Fix cascaded IRQ setup
rtc: s35390a: fix reading out alarm
rtc: s35390a: make sure all members in the output are set
rtc: s35390a: implement reset routine as suggested by the reference
rtc: s35390a: improve irq handling
KVM: kvm_io_bus_unregister_dev() should never fail
power: reset: at91-poweroff: timely shutdown LPDDR memories
blk: improve order of bio handling in generic_make_request()
blk: Ensure users for current->bio_list can see the full list.
padata: avoid race in reordering
Linux 4.4.60
Change-Id: I705c78ccae62ca59f922164085e7ca03ad4ecc6b
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
(cherry picked from commit ea00f4f4f00cc2bc3b63ad512a4e6df3b20832b9)
This makes pm notifier PREPARE/POST symmetrical: if PREPARE
fails, we will only undo what ever happened on PREPARE.
It fixes the unbalanced CPU hotplug enable in CPU PM notifier.
Change-Id: I01dce3cc95c5d6b8913b7b6be301f2909258c745
Signed-off-by: Lianwei Wang <lianwei.wang@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
commit de5540d088fe97ad583cc7d396586437b32149a5 upstream.
Under extremely heavy uses of padata, crashes occur, and with list
debugging turned on, this happens instead:
[87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
__list_add+0xae/0x130
[87487.301868] list_add corruption. prev->next should be next
(ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00).
[87487.339011] [<ffffffff9a53d075>] dump_stack+0x68/0xa3
[87487.342198] [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0
[87487.345364] [<ffffffff99d6b91f>] __warn+0xff/0x140
[87487.348513] [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50
[87487.351659] [<ffffffff9a58b5de>] __list_add+0xae/0x130
[87487.354772] [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70
[87487.357915] [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420
[87487.361084] [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120
padata_reorder calls list_add_tail with the list to which its adding
locked, which seems correct:
spin_lock(&squeue->serial.lock);
list_add_tail(&padata->list, &squeue->serial.list);
spin_unlock(&squeue->serial.lock);
This therefore leaves only place where such inconsistency could occur:
if padata->list is added at the same time on two different threads.
This pdata pointer comes from the function call to
padata_get_next(pd), which has in it the following block:
next_queue = per_cpu_ptr(pd->pqueue, cpu);
padata = NULL;
reorder = &next_queue->reorder;
if (!list_empty(&reorder->list)) {
padata = list_entry(reorder->list.next,
struct padata_priv, list);
spin_lock(&reorder->lock);
list_del_init(&padata->list);
atomic_dec(&pd->reorder_objects);
spin_unlock(&reorder->lock);
pd->processed++;
goto out;
}
out:
return padata;
I strongly suspect that the problem here is that two threads can race
on reorder list. Even though the deletion is locked, call to
list_entry is not locked, which means it's feasible that two threads
pick up the same padata object and subsequently call list_add_tail on
them at the same time. The fix is thus be hoist that lock outside of
that block.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.4.59:
xfrm: policy: init locks early
xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window
xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder
virtio_balloon: init 1st buffer in stats vq
pinctrl: qcom: Don't clear status bit on irq_unmask
c6x/ptrace: Remove useless PTRACE_SETREGSET implementation
h8300/ptrace: Fix incorrect register transfer count
mips/ptrace: Preserve previous registers for short regset write
sparc/ptrace: Preserve previous registers for short regset write
metag/ptrace: Preserve previous registers for short regset write
metag/ptrace: Provide default TXSTATUS for short NT_PRSTATUS
metag/ptrace: Reject partial NT_METAG_RPIPE writes
fscrypt: remove broken support for detecting keyring key revocation
sched/rt: Add a missing rescheduling point
Linux 4.4.59
Change-Id: Ifa35307b133cbf29d0a0084bb78a7b0436182b53
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 619bd4a71874a8fd78eb6ccf9f272c5e98bcc7b7 upstream.
Since the change in commit:
fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks")
... we don't reschedule a task under certain circumstances:
Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on
CPU0) and holds a PI lock. This task is removed from the CPU because it
used up its time slice and another SCHED_OTHER task is running. Task-B on
CPU1 runs at RT priority and asks for the lock owned by task-A. This
results in a priority boost for task-A. Task-B goes to sleep until the
lock has been made available. Task-A is already runnable (but not active),
so it receives no wake up.
The reality now is that task-A gets on the CPU once the scheduler decides
to remove the current task despite the fact that a high priority task is
enqueued and waiting. This may take a long time.
The desired behaviour is that CPU0 immediately reschedules after the
priority boost which made task-A the task with the lowest priority.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks")
Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAljXlGkACgkQONu9yGCS
aT6/mw/9G7QpBoLEwnQbw2NVeboOiM0E9iejUkwsZQzlWspREh43qW0x5Nwk9rxl
y+OAgiYzF6z2hxV6hHNaswEYdIzOBkSjMq2Xbjmjrbj3H8sv5GWT8yD9Cxmaoerx
oBJ21Pe7tMK5IQnThOLRef8ZVtCLKPlr789ifCzg7iuRUnzCdV2eyrthzgkfmt4y
rSHjoSGji1RaC9O7/7DmBvQAosfzr/eSopHz0cbLWLS17OfJ+Xa7+6xb42uzENq6
3mZUCyT0kg8Abz3e9E2wAmKyODkGnX7fPl97Mop5vwflrZTajWMqeCTi75SMIOgj
TONSTi5NIASjS9AKB/UTphXrGEmQV/tU+GaUB3eYqsJQygFQQgllL2S+nLaSQ2u4
LguWDltAfz0mY3/zv5bmf3C7LmpkBxJceaEAMYhsLmJsENsbPO1rRt3plSu9dNGv
f1g3p4xktE2BZMbsKbMZ78CsCe5gYitx/nEzCqpQsqNasw/C99N/I24nAF7g5OOa
Kwo9mY+hjamiqPdiII5rYiPnta/358xITLoLzemLbgjtfuLC5NGO3SppUZvW5DXW
bmn1MwChSqdNRGLeOpdlQ7lrE4DFUtIzA78WHdj7jsJgUpJGFKyZSbhAhXPX3ryV
Jqcngw/eSRtrkU6P7ZpZzFVUun98eLpIfbKgR/UMROjZIGmCrlA=
=sriX
-----END PGP SIGNATURE-----
Merge 4.4.57 to android-4.4
Changes in 4.4.57:
usb: core: hub: hub_port_init lock controller instead of bus
USB: don't free bandwidth_mutex too early
crypto: ghash-clmulni - Fix load failure
crypto: cryptd - Assign statesize properly
crypto: mcryptd - Fix load failure
cxlflash: Increase cmd_per_lun for better throughput
ACPI / video: skip evaluating _DOD when it does not exist
pinctrl: cherryview: Do not mask all interrupts in probe
Drivers: hv: balloon: don't crash when memory is added in non-sorted order
Drivers: hv: avoid vfree() on crash
xen/qspinlock: Don't kick CPU if IRQ is not initialized
KVM: PPC: Book3S PR: Fix illegal opcode emulation
s390/pci: fix use after free in dma_init
drm/amdgpu: add missing irq.h include
tpm_tis: Use devm_free_irq not free_irq
hv_netvsc: use skb_get_hash() instead of a homegrown implementation
kernek/fork.c: allocate idle task for a CPU always on its local node
give up on gcc ilog2() constant optimizations
perf/core: Fix event inheritance on fork()
cpufreq: Fix and clean up show_cpuinfo_cur_freq()
powerpc/boot: Fix zImage TOC alignment
md/raid1/10: fix potential deadlock
target/pscsi: Fix TYPE_TAPE + TYPE_MEDIMUM_CHANGER export
scsi: lpfc: Add shutdown method for kexec
scsi: libiscsi: add lock around task lists to fix list corruption regression
target: Fix VERIFY_16 handling in sbc_parse_cdb
isdn/gigaset: fix NULL-deref at probe
gfs2: Avoid alignment hole in struct lm_lockname
percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pages
ext4: fix fencepost in s_first_meta_bg validation
Linux 4.4.57
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit e7cc4865f0f31698ef2f7aac01a50e78968985b7 upstream.
While hunting for clues to a use-after-free, Oleg spotted that
perf_event_init_context() can loose an error value with the result
that fork() can succeed even though we did not fully inherit the perf
event context.
Spotted-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: oleg@redhat.com
Fixes: 889ff01506 ("perf/core: Split context's event group list into pinned and non-pinned lists")
Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 725fc629ff2545b061407305ae51016c9f928fce upstream.
Linux preallocates the task structs of the idle tasks for all possible
CPUs. This currently means they all end up on node 0. This also
implies that the cache line of MWAIT, which is around the flags field in
the task struct, are all located in node 0.
We see a noticeable performance improvement on Knights Landing CPUs when
the cache lines used for MWAIT are located in the local nodes of the
CPUs using them. I would expect this to give a (likely slight)
improvement on other systems too.
The patch implements placing the idle task in the node of its CPUs, by
passing the right target node to copy_process()
[akpm@linux-foundation.org: use NUMA_NO_NODE, not a bare -1]
Link: http://lkml.kernel.org/r/1463492694-15833-1-git-send-email-andi@firstfloor.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c236c8e95a3d395b0494e7108f0d41cf36ec107c upstream.
While working on the futex code, I stumbled over this potential
use-after-free scenario. Dmitry triggered it later with syzkaller.
pi_mutex is a pointer into pi_state, which we drop the reference on in
unqueue_me_pi(). So any access to that pointer after that is bad.
Since other sites already do rt_mutex_unlock() with hb->lock held, see
for example futex_lock_pi(), simply move the unlock before
unqueue_me_pi().
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
EAS uses "const struct sched_group_energy * const" fairly consistently.
But a couple of places swap the "*" and second "const", making the
pointer mutable.
In the case of struct sched_group, "* const" would have been an error,
since init_sched_energy() writes to sd->groups->sge.
Change-Id: Ic6a8fcf99e65c0f25d9cc55c32625ef3ca5c9aca
Signed-off-by: Greg Hackmann <ghackmann@google.com>
commit 907565337ebf998a68cb5c5b2174ce5e5da065eb upstream.
Userspace applications should be allowed to expect the membarrier system
call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
nohz_full CPUs, but synchronize_sched() does not take those into
account.
Given that we do not want unrelated processes to be able to affect
real-time sensitive nohz_full CPUs, simply return ENOSYS when membarrier
is invoked on a kernel with enabled nohz_full CPUs.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fc98c3c8c9dcafd67adcce69e6ce3191d5306c9c upstream.
Use rcuidle console tracepoint because, apparently, it may be issued
from an idle CPU:
hw-breakpoint: Failed to enable monitor mode on CPU 0.
hw-breakpoint: CPU 0 failed to disable vector catch
===============================
[ ERR: suspicious RCU usage. ]
4.10.0-rc8-next-20170215+ #119 Not tainted
-------------------------------
./include/trace/events/printk.h:32 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
RCU used illegally from idle CPU!
rcu_scheduler_active = 2, debug_locks = 0
RCU used illegally from extended quiescent state!
2 locks held by swapper/0/0:
#0: (cpu_pm_notifier_lock){......}, at: [<c0237e2c>] cpu_pm_exit+0x10/0x54
#1: (console_lock){+.+.+.}, at: [<c01ab350>] vprintk_emit+0x264/0x474
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-rc8-next-20170215+ #119
Hardware name: Generic OMAP4 (Flattened Device Tree)
console_unlock
vprintk_emit
vprintk_default
printk
reset_ctrl_regs
dbg_cpu_pm_notify
notifier_call_chain
cpu_pm_exit
omap_enter_idle_coupled
cpuidle_enter_state
cpuidle_enter_state_coupled
do_idle
cpu_startup_entry
start_kernel
This RCU warning, however, is suppressed by lockdep_off() in printk().
lockdep_off() increments the ->lockdep_recursion counter and thus
disables RCU_LOCKDEP_WARN() and debug_lockdep_rcu_enabled(), which want
lockdep to be enabled "current->lockdep_recursion == 0".
Link: http://lkml.kernel.org/r/20170217015932.11898-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Tony Lindgren <tony@atomide.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Russell King <rmk@armlinux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 25f71d1c3e98ef0e52371746220d66458eac75bc upstream.
The UEVENT user mode helper is enabled before the initcalls are executed
and is available when the root filesystem has been mounted.
The user mode helper is triggered by device init calls and the executable
might use the futex syscall.
futex_init() is marked __initcall which maps to device_initcall, but there
is no guarantee that futex_init() is invoked _before_ the first device init
call which triggers the UEVENT user mode helper.
If the user mode helper uses the futex syscall before futex_init() then the
syscall crashes with a NULL pointer dereference because the futex subsystem
has not been initialized yet.
Move futex_init() to core_initcall so futexes are initialized before the
root filesystem is mounted and the usermode helper becomes available.
[ tglx: Rewrote changelog ]
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: jiang.biao2@zte.com.cn
Cc: jiang.zhengxiong@zte.com.cn
Cc: zhong.weidong@zte.com.cn
Cc: deng.huali@zte.com.cn
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1483085875-6130-1-git-send-email-yang.yang29@zte.com.cn
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0b3589be9b98994ce3d5aeca52445d1f5627c4ba upstream.
Andres reported that MMAP2 records for anonymous memory always have
their protection field 0.
Turns out, someone daft put the prot/flags generation code in the file
branch, leaving them unset for anonymous memory.
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Don Zickus <dzickus@redhat.com
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: anton@ozlabs.org
Cc: namhyung@kernel.org
Fixes: f972eb63b1 ("perf: Pass protection and flags bits through mmap2 interface")
Link: http://lkml.kernel.org/r/20170126221508.GF6536@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ff9f8a7cf935468a94d9927c68b00daae701667e upstream.
We perform the conversion between kernel jiffies and ms only when
exporting kernel value to user space.
We need to do the opposite operation when value is written by user.
Only matters when HZ != 1000
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b6416e61012429e0277bd15a229222fd17afc1c1 upstream.
Modules that use static_key_deferred need a way to synchronize with
any delayed work that is still pending when the module is unloaded.
Introduce static_key_deferred_flush() which flushes any pending
jump label updates.
Signed-off-by: David Matlack <dmatlack@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f931ab479dd24cf7a2c6e2df19778406892591fb upstream.
Both arch_add_memory() and arch_remove_memory() expect a single threaded
context.
For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
not hold any locks over this check and branch:
if (pgd_val(*pgd)) {
pud = (pud_t *)pgd_page_vaddr(*pgd);
paddr_last = phys_pud_init(pud, __pa(vaddr),
__pa(vaddr_end),
page_size_mask);
continue;
}
pud = alloc_low_page();
paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
page_size_mask);
The result is that two threads calling devm_memremap_pages()
simultaneously can end up colliding on pgd initialization. This leads
to crash signatures like the following where the loser of the race
initializes the wrong pgd entry:
BUG: unable to handle kernel paging request at ffff888ebfff0000
IP: memcpy_erms+0x6/0x10
PGD 2f8e8fc067 PUD 0 /* <---- Invalid PUD */
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 54 PID: 3818 Comm: systemd-udevd Not tainted 4.6.7+ #13
task: ffff882fac290040 ti: ffff882f887a4000 task.ti: ffff882f887a4000
RIP: memcpy_erms+0x6/0x10
[..]
Call Trace:
? pmem_do_bvec+0x205/0x370 [nd_pmem]
? blk_queue_enter+0x3a/0x280
pmem_rw_page+0x38/0x80 [nd_pmem]
bdev_read_page+0x84/0xb0
Hold the standard memory hotplug mutex over calls to
arch_{add,remove}_memory().
Fixes: 41e94a8513 ("add devm_memremap_pages")
Link: http://lkml.kernel.org/r/148357647831.9498.12606007370121652979.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current sched_load_avg_cpu event traces the load for any cfs_rq that is
updated. This is not representative of the CPU load - instead we should only
trace this event when the cfs_rq being updated is in the root_task_group.
Change-Id: I345c2f13f6b5718cb4a89beb247f7887ce97ed6b
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
update_cfs_rq_load_avg is called from update_blocked_averages without triggering
the sched_load_avg_cpu event. Move the event trigger to inside
update_cfs_rq_load_avg to avoid this missing event.
Change-Id: I6c4f66f687a644e4e7f798db122d28a8f5919b7b
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
commit c1a9eeb938b5433947e5ea22f89baff3182e7075 upstream.
When a disfunctional timer, e.g. dummy timer, is installed, the tick core
tries to setup the broadcast timer.
If no broadcast device is installed, the kernel crashes with a NULL pointer
dereference in tick_broadcast_setup_oneshot() because the function has no
sanity check.
Reported-by: Mason <slash.tmp@free.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Richard Cochran <rcochran@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
Cc: Sebastian Frias <sf84@laposte.net>
Cc: Thibaud Cornic <thibaud_cornic@sigmadesigns.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[resolves a messed up backport, so no matching upstream commit]
The backport of upstream commit 777c6e0daebb ("hotplug: Make
register and unregister notifier API symmetric") to linux-4.4.y
introduced a harmless warning in 'allnoconfig' builds as spotted by
kernelci.org:
kernel/cpu.c:226:13: warning: 'cpu_notify_nofail' defined but not used [-Wunused-function]
So far, this is the only stable tree that is affected, as linux-4.6 and
higher contain commit 984581728eb4 ("cpu/hotplug: Split out cpu down functions")
that makes the function used in all configurations, while older longterm
releases so far don't seem to have a backport of 777c6e0daebb.
The fix for the warning is trivial: move the unused function back
into the #ifdef section where it was before.
Link: https://kernelci.org/build/id/586fcacb59b514049ef6c3aa/logs/
Fixes: 1c0f4e0ebb ("hotplug: Make register and unregister notifier API symmetric") in v4.4.y
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 794de08a16cf1fc1bf785dc48f66d36218cf6d88 upstream.
Both the wakeup and irqsoff tracers can use the function graph tracer when
the display-graph option is set. The problem is that they ignore the notrace
file, and record the entry of functions that would be ignored by the
function_graph tracer. This causes the trace->depth to be recorded into the
ring buffer. The set_graph_notrace uses a trick by adding a large negative
number to the trace->depth when a graph function is to be ignored.
On trace output, the graph function uses the depth to record a stack of
functions. But since the depth is negative, it accesses the array with a
negative number and causes an out of bounds access that can cause a kernel
oops or corrupt data.
Have the print functions handle cases where a tracer still records functions
even when they are in set_graph_notrace.
Also add warnings if the depth is below zero before accessing the array.
Note, the function graph logic will still prevent the return of these
functions from being recorded, which means that they will be left hanging
without a return. For example:
# echo '*spin*' > set_graph_notrace
# echo 1 > options/display-graph
# echo wakeup > current_tracer
# cat trace
[...]
_raw_spin_lock() {
preempt_count_add() {
do_raw_spin_lock() {
update_rq_clock();
Where it should look like:
_raw_spin_lock() {
preempt_count_add();
do_raw_spin_lock();
}
update_rq_clock();
Cc: Namhyung Kim <namhyung.kim@lge.com>
Fixes: 29ad23b004 ("ftrace: Add set_graph_notrace filter")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9c1645727b8fa90d07256fdfcc45bf831242a3ab upstream.
The clocksource delta to nanoseconds conversion is using signed math, but
the delta is unsigned. This makes the conversion space smaller than
necessary and in case of a multiplication overflow the conversion can
become negative. The conversion is done with scaled math:
s64 nsec_delta = ((s64)clkdelta * clk->mult) >> clk->shift;
Shifting a signed integer right obvioulsy preserves the sign, which has
interesting consequences:
- Time jumps backwards
- __iter_div_u64_rem() which is used in one of the calling code pathes
will take forever to piecewise calculate the seconds/nanoseconds part.
This has been reported by several people with different scenarios:
David observed that when stopping a VM with a debugger:
"It was essentially the stopped by debugger case. I forget exactly why,
but the guest was being explicitly stopped from outside, it wasn't just
scheduling lag. I think it was something in the vicinity of 10 minutes
stopped."
When lifting the stop the machine went dead.
The stopped by debugger case is not really interesting, but nevertheless it
would be a good thing not to die completely.
But this was also observed on a live system by Liav:
"When the OS is too overloaded, delta will get a high enough value for the
msb of the sum delta * tkr->mult + tkr->xtime_nsec to be set, and so
after the shift the nsec variable will gain a value similar to
0xffffffffff000000."
Unfortunately this has been reintroduced recently with commit 6bd58f09e1d8
("time: Add cycles to nanoseconds translation"). It had been fixed a year
ago already in commit 35a4933a8959 ("time: Avoid signed overflow in
timekeeping_get_ns()").
Though it's not surprising that the issue has been reintroduced because the
function itself and the whole call chain uses s64 for the result and the
propagation of it. The change in this recent commit is subtle:
s64 nsec;
- nsec = (d * m + n) >> s:
+ nsec = d * m + n;
+ nsec >>= s;
d being type of cycle_t adds another level of obfuscation.
This wouldn't have happened if the previous change to unsigned computation
would have made the 'nsec' variable u64 right away and a follow up patch
had cleaned up the whole call chain.
There have been patches submitted which basically did a revert of the above
patch leaving everything else unchanged as signed. Back to square one. This
spawned a admittedly pointless discussion about potential users which rely
on the unsigned behaviour until someone pointed out that it had been fixed
before. The changelogs of said patches added further confusion as they made
finally false claims about the consequences for eventual users which expect
signed results.
Despite delta being cycle_t, aka. u64, it's very well possible to hand in
a signed negative value and the signed computation will happily return the
correct result. But nobody actually sat down and analyzed the code which
was added as user after the propably unintended signed conversion.
Though in sensitive code like this it's better to analyze it proper and
make sure that nothing relies on this than hunting the subtle wreckage half
a year later. After analyzing all call chains it stands that no caller can
hand in a negative value (which actually would work due to the s64 cast)
and rely on the signed math to do the right thing.
Change the conversion function to unsigned math. The conversion of all call
chains is done in a follow up patch.
This solves the starvation issue, which was caused by the negative result,
but it does not solve the underlying problem. It merily procrastinates
it. When the timekeeper update is deferred long enough that the unsigned
multiplication overflows, then time going backwards is observable again.
It does neither solve the issue of clocksources with a small counter width
which will wrap around possibly several times and cause random time stamps
to be generated. But those are usually not found on systems used for
virtualization, so this is likely a non issue.
I took the liberty to claim authorship for this simply because
analyzing all callsites and writing the changelog took substantially
more time than just making the simple s/s64/u64/ change and ignore the
rest.
Fixes: 6bd58f09e1d8 ("time: Add cycles to nanoseconds translation")
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Reported-by: Liav Rehana <liavr@mellanox.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Parit Bhargava <prarit@redhat.com>
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20161208204228.688545601@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2d13bb6494c807bcf3f78af0e96c0b8615a94385 upstream.
We've got a delay loop waiting for secondary CPUs. That loop uses
loops_per_jiffy. However, loops_per_jiffy doesn't actually mean how
many tight loops make up a jiffy on all architectures. It is quite
common to see things like this in the boot log:
Calibrating delay loop (skipped), value calculated using timer
frequency.. 48.00 BogoMIPS (lpj=24000)
In my case I was seeing lots of cases where other CPUs timed out
entering the debugger only to print their stack crawls shortly after the
kdb> prompt was written.
Elsewhere in kgdb we already use udelay(), so that should be safe enough
to use to implement our timeout. We'll delay 1 ms for 1000 times, which
should give us a full second of delay (just like the old code wanted)
but allow us to notice that we're done every 1 ms.
[akpm@linux-foundation.org: simplifications, per Daniel]
Link: http://lkml.kernel.org/r/1477091361-2039-1-git-send-email-dianders@chromium.org
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Brian Norris <briannorris@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4d1f0fb096aedea7bb5489af93498a82e467c480 upstream.
NMI handler doesn't call set_irq_regs(), it's set only by normal IRQ.
Thus get_irq_regs() returns NULL or stale registers snapshot with IP/SP
pointing to the code interrupted by IRQ which was interrupted by NMI.
NULL isn't a problem: in this case watchdog calls dump_stack() and
prints full stack trace including NMI. But if we're stuck in IRQ
handler then NMI watchlog will print stack trace without IRQ part at
all.
This patch uses registers snapshot passed into NMI handler as arguments:
these registers point exactly to the instruction interrupted by NMI.
Fixes: 55537871ef ("kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup")
Link: http://lkml.kernel.org/r/146771764784.86724.6006627197118544150.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f84df2a6f268de584a201e8911384a2d244876e3 upstream.
When the user namespace support was merged the need to prevent
ptrace from revealing the contents of an unreadable executable
was overlooked.
Correct this oversight by ensuring that the executed file
or files are in mm->user_ns, by adjusting mm->user_ns.
Use the new function privileged_wrt_inode_uidgid to see if
the executable is a member of the user namespace, and as such
if having CAP_SYS_PTRACE in the user namespace should allow
tracing the executable. If not update mm->user_ns to
the parent user namespace until an appropriate parent is found.
Reported-by: Jann Horn <jann@thejh.net>
Fixes: 9e4a36ece6 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 64b875f7ac8a5d60a4e191479299e931ee949b67 upstream.
When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
overlooked. This can result in incorrect behavior when an application
like strace traces an exec of a setuid executable.
Further PT_PTRACE_CAP does not have enough information for making good
security decisions as it does not report which user namespace the
capability is in. This has already allowed one mistake through
insufficient granulariy.
I found this issue when I was testing another corner case of exec and
discovered that I could not get strace to set PT_PTRACE_CAP even when
running strace as root with a full set of caps.
This change fixes the above issue with strace allowing stracing as
root a setuid executable without disabling setuid. More fundamentaly
this change allows what is allowable at all times, by using the correct
information in it's decision.
Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bfedb589252c01fa505ac9f6f2a3d5d68d707ef4 upstream.
During exec dumpable is cleared if the file that is being executed is
not readable by the user executing the file. A bug in
ptrace_may_access allows reading the file if the executable happens to
enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
This problem is fixed with only necessary userspace breakage by adding
a user namespace owner to mm_struct, captured at the time of exec, so
it is clear in which user namespace CAP_SYS_PTRACE must be present in
to be able to safely give read permission to the executable.
The function ptrace_may_access is modified to verify that the ptracer
has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
This ensures that if the task changes it's cred into a subordinate
user namespace it does not become ptraceable.
The function ptrace_attach is modified to only set PT_PTRACE_CAP when
CAP_SYS_PTRACE is held over task->mm->user_ns. The intent of
PT_PTRACE_CAP is to be a flag to note that whatever permission changes
the task might go through the tracer has sufficient permissions for
it not to be an issue. task->cred->user_ns is always the same
as or descendent of mm->user_ns. Which guarantees that having
CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
credentials.
To prevent regressions mm->dumpable and mm->user_ns are not considered
when a task has no mm. As simply failing ptrace_may_attach causes
regressions in privileged applications attempting to read things
such as /proc/<pid>/stat
Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Fixes: 8409cca705 ("userns: allow ptrace from non-init user namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 777c6e0daebb3fcefbbd6f620410a946b07ef6d0 upstream.
Yu Zhao has noticed that __unregister_cpu_notifier only unregisters its
notifiers when HOTPLUG_CPU=y while the registration might succeed even
when HOTPLUG_CPU=n if MODULE is enabled. This means that e.g. zswap
might keep a stale notifier on the list on the manual clean up during
the pool tear down and thus corrupt the list. Resulting in the following
[ 144.964346] BUG: unable to handle kernel paging request at ffff880658a2be78
[ 144.971337] IP: [<ffffffffa290b00b>] raw_notifier_chain_register+0x1b/0x40
<snipped>
[ 145.122628] Call Trace:
[ 145.125086] [<ffffffffa28e5cf8>] __register_cpu_notifier+0x18/0x20
[ 145.131350] [<ffffffffa2a5dd73>] zswap_pool_create+0x273/0x400
[ 145.137268] [<ffffffffa2a5e0fc>] __zswap_param_set+0x1fc/0x300
[ 145.143188] [<ffffffffa2944c1d>] ? trace_hardirqs_on+0xd/0x10
[ 145.149018] [<ffffffffa2908798>] ? kernel_param_lock+0x28/0x30
[ 145.154940] [<ffffffffa2a3e8cf>] ? __might_fault+0x4f/0xa0
[ 145.160511] [<ffffffffa2a5e237>] zswap_compressor_param_set+0x17/0x20
[ 145.167035] [<ffffffffa2908d3c>] param_attr_store+0x5c/0xb0
[ 145.172694] [<ffffffffa290848d>] module_attr_store+0x1d/0x30
[ 145.178443] [<ffffffffa2b2b41f>] sysfs_kf_write+0x4f/0x70
[ 145.183925] [<ffffffffa2b2a5b9>] kernfs_fop_write+0x149/0x180
[ 145.189761] [<ffffffffa2a99248>] __vfs_write+0x18/0x40
[ 145.194982] [<ffffffffa2a9a412>] vfs_write+0xb2/0x1a0
[ 145.200122] [<ffffffffa2a9a732>] SyS_write+0x52/0xa0
[ 145.205177] [<ffffffffa2ff4d97>] entry_SYSCALL_64_fastpath+0x12/0x17
This can be even triggered manually by changing
/sys/module/zswap/parameters/compressor multiple times.
Fix this issue by making unregister APIs symmetric to the register so
there are no surprises.
Fixes: 47e627bc8c ("[PATCH] hotplug: Allow modules to use the cpu hotplug notifiers even if !CONFIG_HOTPLUG_CPU")
Reported-and-tested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Link: http://lkml.kernel.org/r/20161207135438.4310-1-mhocko@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1be5d4fa0af34fb7bafa205aeb59f5c7cc7a089d upstream.
While debugging the rtmutex unlock vs. dequeue race Will suggested to use
READ_ONCE() in rt_mutex_owner() as it might race against the
cmpxchg_release() in unlock_rt_mutex_safe().
Will: "It's a minor thing which will most likely not matter in practice"
Careful search did not unearth an actual problem in todays code, but it's
better to be safe than surprised.
Suggested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Daney <ddaney@caviumnetworks.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20161130210030.431379999@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dbb26055defd03d59f678cb5f2c992abe05b064a upstream.
David reported a futex/rtmutex state corruption. It's caused by the
following problem:
CPU0 CPU1 CPU2
l->owner=T1
rt_mutex_lock(l)
lock(l->wait_lock)
l->owner = T1 | HAS_WAITERS;
enqueue(T2)
boost()
unlock(l->wait_lock)
schedule()
rt_mutex_lock(l)
lock(l->wait_lock)
l->owner = T1 | HAS_WAITERS;
enqueue(T3)
boost()
unlock(l->wait_lock)
schedule()
signal(->T2) signal(->T3)
lock(l->wait_lock)
dequeue(T2)
deboost()
unlock(l->wait_lock)
lock(l->wait_lock)
dequeue(T3)
===> wait list is now empty
deboost()
unlock(l->wait_lock)
lock(l->wait_lock)
fixup_rt_mutex_waiters()
if (wait_list_empty(l)) {
owner = l->owner & ~HAS_WAITERS;
l->owner = owner
==> l->owner = T1
}
lock(l->wait_lock)
rt_mutex_unlock(l) fixup_rt_mutex_waiters()
if (wait_list_empty(l)) {
owner = l->owner & ~HAS_WAITERS;
cmpxchg(l->owner, T1, NULL)
===> Success (l->owner = NULL)
l->owner = owner
==> l->owner = T1
}
That means the problem is caused by fixup_rt_mutex_waiters() which does the
RMW to clear the waiters bit unconditionally when there are no waiters in
the rtmutexes rbtree.
This can be fatal: A concurrent unlock can release the rtmutex in the
fastpath because the waiters bit is not set. If the cmpxchg() gets in the
middle of the RMW operation then the previous owner, which just unlocked
the rtmutex is set as the owner again when the write takes place after the
successfull cmpxchg().
The solution is rather trivial: verify that the owner member of the rtmutex
has the waiters bit set before clearing it. This does not require a
cmpxchg() or other atomic operations because the waiters bit can only be
set and cleared with the rtmutex wait_lock held. It's also safe against the
fast path unlock attempt. The unlock attempt via cmpxchg() will either see
the bit set and take the slowpath or see the bit cleared and release it
atomically in the fastpath.
It's remarkable that the test program provided by David triggers on ARM64
and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
problem exists there as well. That refusal might explain that this got not
discovered earlier despite the bug existing from day one of the rtmutex
implementation more than 10 years ago.
Thanks to David for meticulously instrumenting the code and providing the
information which allowed to decode this subtle problem.
Reported-by: David Daney <ddaney@caviumnetworks.com>
Tested-by: David Daney <david.daney@cavium.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 23f78d4a03 ("[PATCH] pi-futex: rt mutex core")
Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>