Android expects system_server to be able to move tasks between different
cgroups/cpusets, but does not want to be running as root. Let's relax
permission check so that processes can move other tasks if they have
CAP_SYS_NICE in the affected task's user namespace.
BUG=b:31790445,chromium:647994
TEST=Boot android container, examine logcat
Change-Id: Ia919c66ab6ed6a6daf7c4cf67feb38b13b1ad09b
Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/394927
Reviewed-by: Ricky Zhou <rickyz@chromium.org>
The implementation is utterly broken, resulting in all processes being
allows to move tasks between sets (as long as they have access to the
"tasks" attribute), and upstream is heading towards checking only
capability anyway, so let's get rid of this code.
BUG=b:31790445,chromium:647994
TEST=Boot android container, examine logcat
Change-Id: I2f780a5992c34e52a8f2d0b3557fc9d490da2779
Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/394967
Reviewed-by: Ricky Zhou <rickyz@chromium.org>
Reviewed-by: John Stultz <john.stultz@linaro.org>
commit 6bd58f09e1d8cc6c50a824c00bf0d617919986a1 upstream.
The timekeeping code does not currently provide a way to translate
externally provided clocksource cycles to system time. The cycle count
is always provided by the result clocksource read() method internal to
the timekeeping code. The added function timekeeping_cycles_to_ns()
calculated a nanosecond value from a cycle count that can be added to
tk_read_base.base value yielding the current system time. This allows
clocksource cycle values external to the timekeeping code to provide a
cycle count that can be transformed to system time.
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: kevin.b.stanton@intel.com
Cc: kevin.j.clarke@intel.com
Cc: hpa@zytor.com
Cc: jeffrey.t.kirsher@intel.com
Cc: netdev@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christopher S. Hall <christopher.s.hall@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 735f2770a770156100f534646158cb58cb8b2939 upstream.
Commit fec1d01152 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
exit") has caused a subtle regression in nscd which uses
CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
shared databases, so that the clients are notified when nscd is
restarted. Now, when nscd uses a non-persistent database, clients that
have it mapped keep thinking the database is being updated by nscd, when
in fact nscd has created a new (anonymous) one (for non-persistent
databases it uses an unlinked file as backend).
The original proposal for the CLONE_CHILD_CLEARTID change claimed
(https://lkml.org/lkml/2006/10/25/233):
: The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
: on behalf of pthread_create() library calls. This feature is used to
: request that the kernel clear the thread-id in user space (at an address
: provided in the syscall) when the thread disassociates itself from the
: address space, which is done in mm_release().
:
: Unfortunately, when a multi-threaded process incurs a core dump (such as
: from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
: the other threads, which then proceed to clear their user-space tids
: before synchronizing in exit_mm() with the start of core dumping. This
: misrepresents the state of process's address space at the time of the
: SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
: problems (misleading him/her to conclude that the threads had gone away
: before the fault).
:
: The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
: core dump has been initiated.
The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
seems to have a larger scope than the original patch asked for. It
seems that limitting the scope of the check to core dumping should work
for SIGSEGV issue describe above.
[Changelog partly based on Andreas' description]
Fixes: fec1d01152 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit")
Link: http://lkml.kernel.org/r/1471968749-26173-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: William Preston <wpreston@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Andreas Schwab <schwab@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e7d316a02f683864a12389f8808570e37fb90aa3 upstream.
We have scripts which write to certain fields on 3.18 kernels but this
seems to be failing on 4.4 kernels. An entry which we write to here is
xfrm_aevent_rseqth which is u32.
echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth
Commit 230633d109 ("kernel/sysctl.c: detect overflows when converting
to int") prevented writing to sysctl entries when integer overflow
occurs. However, this does not apply to unsigned integers.
Heinrich suggested that we introduce a new option to handle 64 bit
limits and set min as 0 and max as UINT_MAX. This might not work as it
leads to issues similar to __do_proc_doulongvec_minmax. Alternatively,
we would need to change the datatype of the entry to 64 bit.
static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
{
i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32)
vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.
Introduce a new proc handler proc_douintvec. Individual proc entries
will need to be updated to use the new handler.
[akpm@linux-foundation.org: coding-style fixes]
Fixes: 230633d109 ("kernel/sysctl.c:detect overflows when converting to int")
Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ae6c33ba6e37eea3012fe2640b22400ef3f2d0f3 upstream.
Commit bbeddf52ad ("printk: move braille console support into separate
braille.[ch] files") moved the parsing of braille-related options into
_braille_console_setup(), changing the type of variable str from char*
to char**. In this commit, memcmp(str, "brl,", 4) was correctly updated
to memcmp(*str, "brl,", 4) but not memcmp(str, "brl=", 4).
Update the code to make "brl=" option work again and replace memcmp()
with strncmp() to make the compiler able to detect such an issue.
Fixes: bbeddf52ad ("printk: move braille console support into separate braille.[ch] files")
Link: http://lkml.kernel.org/r/20160823165700.28952-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2c81a6477081966fe80b8c6daa68459bca896774 upstream.
The following commit:
66eb579e66 ("perf: allow for PMU-specific event filtering")
added the pmu::filter_match() callback. This was intended to
avoid HW constraints on events from resulting in extremely
pessimistic scheduling.
However, pmu::filter_match() is only called for the leader of each event
group. When the leader is a SW event, we do not filter the groups, and
may fail at pmu::add() time, and when this happens we'll give up on
scheduling any event groups later in the list until they are rotated
ahead of the failing group.
This can result in extremely sub-optimal event scheduling behaviour,
e.g. if running the following on a big.LITTLE platform:
$ taskset -c 0 ./perf stat \
-e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \
-e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \
ls
<not counted> context-switches (0.00%)
<not counted> armv8_cortex_a57/config=0x11/ (0.00%)
24 context-switches (37.36%)
57589154 armv8_cortex_a53/config=0x11/ (37.36%)
Here the 'a53' event group was always eligible to be scheduled, but
the 'a57' group never eligible to be scheduled, as the task was always
affine to a Cortex-A53 CPU. The SW (group leader) event in the 'a57'
group was eligible, but the HW event failed at pmu::add() time,
resulting in ctx_flexible_sched_in giving up on scheduling further
groups with HW events.
One way of avoiding this is to check pmu::filter_match() on siblings
as well as the group leader. If any of these fail their
pmu::filter_match() call, we must skip the entire group before
attempting to add any events.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 66eb579e66 ("perf: allow for PMU-specific event filtering")
Link: http://lkml.kernel.org/r/1465917041-15339-1-git-send-email-mark.rutland@arm.com
[ Small readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 28b89b9e6f7b6c8fef7b3af39828722bca20cfee upstream.
A discrepancy between cpu_online_mask and cpuset's effective_cpus
mask is inevitable during hotplug since cpuset defers updating of
effective_cpus mask using a workqueue, during which time nothing
prevents the system from more hotplug operations. For that reason
guarantee_online_cpus() walks up the cpuset hierarchy until it finds
an intersection under the assumption that top cpuset's effective_cpus
mask intersects with cpu_online_mask even with such a race occurring.
However a sequence of CPU hotplugs can open a time window, during which
none of the effective CPUs in the top cpuset intersect with
cpu_online_mask.
For example when there are 4 possible CPUs 0-3 and only CPU0 is online:
======================== ===========================
cpu_online_mask top_cpuset.effective_cpus
======================== ===========================
echo 1 > cpu2/online.
CPU hotplug notifier woke up hotplug work but not yet scheduled.
[0,2] [0]
echo 0 > cpu0/online.
The workqueue is still runnable.
[2] [0]
======================== ===========================
Now there is no intersection between cpu_online_mask and
top_cpuset.effective_cpus. Thus invoking sys_sched_setaffinity() at
this moment can cause following:
Unable to handle kernel NULL pointer dereference at virtual address 000000d0
------------[ cut here ]------------
Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
Modules linked in:
CPU: 2 PID: 1420 Comm: taskset Tainted: G W 4.4.8+ #98
task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
PC is at guarantee_online_cpus+0x2c/0x58
LR is at cpuset_cpus_allowed+0x4c/0x6c
<snip>
Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
Call trace:
[<ffffffc0001389b0>] guarantee_online_cpus+0x2c/0x58
[<ffffffc00013b208>] cpuset_cpus_allowed+0x4c/0x6c
[<ffffffc0000d61f0>] sched_setaffinity+0xc0/0x1ac
[<ffffffc0000d6374>] SyS_sched_setaffinity+0x98/0xac
[<ffffffc000085cb0>] el0_svc_naked+0x24/0x28
The top cpuset's effective_cpus are guaranteed to be identical to
cpu_online_mask eventually. Hence fall back to cpu_online_mask when
there is no intersection between top cpuset's effective_cpus and
cpu_online_mask.
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 924d8696751c4b9e58263bc82efdafcf875596a6 upstream.
rtree_next_node() walks the linked list of leaf nodes to find the next
block of pages in the struct memory_bitmap. If it walks off the end of
the list of nodes, it walks the list of memory zones to find the next
region of memory. If it walks off the end of the list of zones, it
returns false.
This leaves the struct bm_position's node and zone pointers pointing
at their respective struct list_heads in struct mem_zone_bm_rtree.
memory_bm_find_bit() uses struct bm_position's node and zone pointers
to avoid walking lists and trees if the next bit appears in the same
node/zone. It handles these values being stale.
Swap rtree_next_node()s 'step then test' to 'test-next then step',
this means if we reach the end of memory we return false and leave
the node and zone pointers as they were.
This fixes a panic on resume using AMD Seattle with 64K pages:
[ 6.868732] Freezing user space processes ... (elapsed 0.000 seconds) done.
[ 6.875753] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds)
[ 6.896453] PM: Using 3 thread(s) for decompression.
[ 6.896453] PM: Loading and decompressing image data (5339 pages)...
[ 7.318890] PM: Image loading progress: 0%
[ 7.323395] Unable to handle kernel paging request at virtual address 00800040
[ 7.330611] pgd = ffff000008df0000
[ 7.334003] [00800040] *pgd=00000083fffe0003, *pud=00000083fffe0003, *pmd=00000083fffd0003, *pte=0000000000000000
[ 7.344266] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[ 7.349825] Modules linked in:
[ 7.352871] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G W I 4.8.0-rc1 #4737
[ 7.360512] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1002C 04/08/2016
[ 7.369109] task: ffff8003c0220000 task.stack: ffff8003c0280000
[ 7.375020] PC is at set_bit+0x18/0x30
[ 7.378758] LR is at memory_bm_set_bit+0x24/0x30
[ 7.383362] pc : [<ffff00000835bbc8>] lr : [<ffff0000080faf18>] pstate: 60000045
[ 7.390743] sp : ffff8003c0283b00
[ 7.473551]
[ 7.475031] Process swapper/0 (pid: 1, stack limit = 0xffff8003c0280020)
[ 7.481718] Stack: (0xffff8003c0283b00 to 0xffff8003c0284000)
[ 7.800075] Call trace:
[ 7.887097] [<ffff00000835bbc8>] set_bit+0x18/0x30
[ 7.891876] [<ffff0000080fb038>] duplicate_memory_bitmap.constprop.38+0x54/0x70
[ 7.899172] [<ffff0000080fcc40>] snapshot_write_next+0x22c/0x47c
[ 7.905166] [<ffff0000080fe1b4>] load_image_lzo+0x754/0xa88
[ 7.910725] [<ffff0000080ff0a8>] swsusp_read+0x144/0x230
[ 7.916025] [<ffff0000080fa338>] load_image_and_restore+0x58/0x90
[ 7.922105] [<ffff0000080fa660>] software_resume+0x2f0/0x338
[ 7.927752] [<ffff000008083350>] do_one_initcall+0x38/0x11c
[ 7.933314] [<ffff000008b40cc0>] kernel_init_freeable+0x14c/0x1ec
[ 7.939395] [<ffff0000087ce564>] kernel_init+0x10/0xfc
[ 7.944520] [<ffff000008082e90>] ret_from_fork+0x10/0x40
[ 7.949820] Code: d2800022 8b400c21 f9800031 9ac32043 (c85f7c22)
[ 7.955909] ---[ end trace 0024a5986e6ff323 ]---
[ 7.960529] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Here struct mem_zone_bm_rtree's start_pfn has been returned instead of
struct rtree_node's addr as the node/zone pointers are corrupt after
we walked off the end of the lists during mark_unsafe_pages().
This behaviour was exposed by commit 6dbecfd345a6 ("PM / hibernate:
Simplify mark_unsafe_pages()"), which caused mark_unsafe_pages() to call
duplicate_memory_bitmap(), which uses memory_bm_find_bit() after walking
off the end of the memory bitmap.
Fixes: 3a20cb1779 (PM / Hibernate: Implement position keeping in radix tree)
Signed-off-by: James Morse <james.morse@arm.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 62822e2ec4ad091ba31f823f577ef80db52e3c2c upstream.
Restore the processor state before calling any other functions to
ensure per-CPU variables can be used with KASLR memory randomization.
Tracing functions use per-CPU variables (GS based on x86) and one was
called just before restoring the processor state fully. It resulted
in a double fault when both the tracing & the exception handler
functions tried to use a per-CPU variable.
Fixes: bb3632c610 (PM / sleep: trace events for suspend/resume)
Reported-and-tested-by: Borislav Petkov <bp@suse.de>
Reported-by: Jiri Kosina <jikos@kernel.org>
Tested-by: Rafael J. Wysocki <rafael@kernel.org>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1245800c0f96eb6ebb368593e251d66c01e61022 upstream.
The iter->seq can be reset outside the protection of the mutex. So can
reading of user data. Move the mutex up to the beginning of the function.
Fixes: d7350c3f45 ("tracing/core: make the read callbacks reentrants")
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 377ccbb483738f84400ddf5840c7dd8825716985 upstream.
With the latest gcc compilers, they give a warning if
__builtin_return_address() parameter is greater than 0. That is because if
it is used by a function called by a top level function (or in the case of
the kernel, by assembly), it can try to access stack frames outside the
stack and crash the system.
The tracing system uses __builtin_return_address() of up to 2! But it is
well aware of the dangers that it may have, and has even added precautions
to protect against it (see the thunk code in arch/x86/entry/thunk*.S)
Linus originally added KBUILD_CFLAGS that would suppress the warning for the
entire kernel, as simply adding KBUILD_CFLAGS to the tracing directory
wouldn't work. The tracing directory plays a bit with the CFLAGS and
requires a little more logic.
This adds that special logic to only suppress the warning for the tracing
directory. If it is used anywhere else outside of tracing, the warning will
still be triggered.
Link: http://lkml.kernel.org/r/20160728223043.51996267@grimm.local.home
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> 2 ../kernel/cpuset.c:2101:11: warning: initialization from incompatible pointer type [-Wincompatible-pointer-types]
> 1 ../kernel/cpuset.c:2101:2: warning: initialization from incompatible pointer type
> 1 ../kernel/cpuset.c:2101:2: warning: (near initialization for 'cpuset_cgrp_subsys.fork')
This got introduced by 06ec7a1d76 ("cpuset: make sure new tasks
conform to the current config of the cpuset"). In the upstream
kernel, the function prototype was changed as of b53202e63089
("cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends").
That patch is not suitable for stable kernels, and fortunately
the warning seems harmless as the prototypes only differ in the
second argument that is unused. Adding that argument gets rid
of the warning:
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Unfortunately we record PIDs in audit records using a variety of
methods despite the correct way being the use of task_tgid_nr().
This patch converts all of these callers, except for the case of
AUDIT_SET in audit_receive_msg() (see the comment in the code).
Reported-by: Jeff Vander Stoep <jeffv@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Bug: 28952093
(cherry picked from commit fa2bea2f5cca5b8d4a3e5520d2e8c0ede67ac108)
Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Change-Id: If6645f9de8bc58ed9755f28dc6af5fbf08d72a00
commit 4364e1a29be16b2783c0bcbc263f61236af64281 upstream.
virq is not required to be the same for all msi descs. Use the base irq number
from the desc in the debug printk.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 236dec051078a8691950f56949612b4b74107e48 upstream.
Using "make tinyconfig" produces a couple of annoying warnings that show
up for build test machines all the time:
.config:966:warning: override: NOHIGHMEM changes choice state
.config:965:warning: override: SLOB changes choice state
.config:963:warning: override: KERNEL_XZ changes choice state
.config:962:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
.config:933:warning: override: SLOB changes choice state
.config:930:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
.config:870:warning: override: SLOB changes choice state
.config:868:warning: override: KERNEL_XZ changes choice state
.config:867:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
I've made a previous attempt at fixing them and we discussed a number of
alternatives.
I tried changing the Makefile to use "merge_config.sh -n
$(fragment-list)" but couldn't get that to work properly.
This is yet another approach, based on the observation that we do want
to see a warning for conflicting 'choice' options, and that we can
simply make them non-conflicting by listing all other options as
disabled. This is a trivial patch that we can apply independent of
plans for other changes.
Link: http://lkml.kernel.org/r/20160829214952.1334674-2-arnd@arndb.de
Link: https://storage.kernelci.org/mainline/v4.7-rc6/x86-tinyconfig/build.loghttps://patchwork.kernel.org/patch/9212749/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf upstream.
The origin of the issue I've seen is related to
a missing memory barrier between check for task->state and
the check for task->on_rq.
The task being woken up is already awake from a schedule()
and is doing the following:
do {
schedule()
set_current_state(TASK_(UN)INTERRUPTIBLE);
} while (!cond);
The waker, actually gets stuck doing the following in
try_to_wake_up():
while (p->on_cpu)
cpu_relax();
Analysis:
The instance I've seen involves the following race:
CPU1 CPU2
while () {
if (cond)
break;
do {
schedule();
set_current_state(TASK_UN..)
} while (!cond);
wakeup_routine()
spin_lock_irqsave(wait_lock)
raw_spin_lock_irqsave(wait_lock) wake_up_process()
} try_to_wake_up()
set_current_state(TASK_RUNNING); ..
list_del(&waiter.list);
CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:
CPU3
wakeup_routine()
raw_spin_lock_irqsave(wait_lock)
if (!list_empty)
wake_up_process()
try_to_wake_up()
raw_spin_lock_irqsave(p->pi_lock)
..
if (p->on_rq && ttwu_wakeup())
..
while (p->on_cpu)
cpu_relax()
..
CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.
CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.
The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
Reproduction of the issue:
The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.
Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[ Updated comment to clarify matching barriers. Many
architectures do not have a full barrier in switch_to()
so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 06f4e94898918bcad00cdd4d349313a439d6911e upstream.
A new task inherits cpus_allowed and mems_allowed masks from its parent,
but if someone changes cpuset's config by writing to cpuset.cpus/cpuset.mems
before this new task is inserted into the cgroup's task list, the new task
won't be updated accordingly.
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5efc244346f9f338765da3d592f7947b0afdc4b5 upstream.
Prior to the change the function would blindly deference mm, exe_file
and exe_file->f_inode, each of which could have been NULL or freed.
Use get_task_exe_file to safely obtain stable exe_file.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cd81a9170e69e018bbaba547c1fd85a585f5697a upstream.
For more convenient access if one has a pointer to the task.
As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.
Use the helper in proc_exe_link.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 070c43eea5043e950daa423707ae3c77e2f48edb upstream.
If kexec_apply_relocations fails, kexec_load_purgatory frees pi->sechdrs
and pi->purgatory_buf. This is redundant, because in case of error
kimage_file_prepare_segments calls kimage_file_post_load_cleanup, which
will also free those buffers.
This causes two warnings like the following, one for pi->sechdrs and the
other for pi->purgatory_buf:
kexec-bzImage64: Loading purgatory failed
------------[ cut here ]------------
WARNING: CPU: 1 PID: 2119 at mm/vmalloc.c:1490 __vunmap+0xc1/0xd0
Trying to vfree() nonexistent vm area (ffffc90000e91000)
Modules linked in:
CPU: 1 PID: 2119 Comm: kexec Not tainted 4.8.0-rc3+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x4d/0x65
__warn+0xcb/0xf0
warn_slowpath_fmt+0x4f/0x60
? find_vmap_area+0x19/0x70
? kimage_file_post_load_cleanup+0x47/0xb0
__vunmap+0xc1/0xd0
vfree+0x2e/0x70
kimage_file_post_load_cleanup+0x5e/0xb0
SyS_kexec_file_load+0x448/0x680
? putname+0x54/0x60
? do_sys_open+0x190/0x1f0
entry_SYSCALL_64_fastpath+0x13/0x8f
---[ end trace 158bb74f5950ca2b ]---
Fix by setting pi->sechdrs an pi->purgatory_buf to NULL, since vfree
won't try to free a NULL pointer.
Link: http://lkml.kernel.org/r/1472083546-23683-1-git-send-email-bauerman@linux.vnet.ibm.com
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It may be useful to debug writes to the readonly sections of memory,
so provide a cmdline "rodata=off" to allow for this. This can be
expanded in the future to support "log" and "write" modes, but that
will need to be architecture-specific.
This also makes KDB software breakpoints more usable, as read-only
mappings can now be disabled on any kernel.
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Brown <david.brown@linaro.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/1455748879-21872-3-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bug: 31660652
Change-Id: I67b818ca390afdd42ab1c27cb4f8ac64bbdb3b65
(cherry picked from commit d2aa1acad22f1bdd0cfa67b3861800e392254454)
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
The ENERGY_AWARE sched feature flag cannot be set unless
CONFIG_SCHED_DEBUG is enabled.
So this patch allows the flag to default to true at build time
if the config is set.
Change-Id: I8835a571fdb7a8f8ee6a54af1e11a69f3b5ce8e6
Signed-off-by: John Stultz <john.stultz@linaro.org>
It will cause deadlock and while(1) if call printk while schedule is in
progress. The block state like as below:
cpu0(hold the console sem):
printk->console_unlock->up_sem->spin_lock(&sem->lock)->wake_up_process(cpu1)
->try_to_wake_up(cpu1)->while(p->on_cpu).
cpu1(request console sem):
console_lock->down_sem->schedule->idle_banlance->update_cpu_capacity->
printk->console_trylock->spin_lock(&sem->lock).
p->on_cpu will be 1 forever, because the task is still running on cpu1,
so cpu0 is blocked in while(p->on_cpu), but cpu1 could not get
spin_lock(&sem->lock), it is blocked too, it means the task will running
on cpu1 forever.
Signed-off-by: Caesar Wang <wxt@rock-chips.com>
On at least one platform, occasionally the timer providing the wallclock
was able to be reset/go backwards for at least some time after wakeup.
Accept that this might happen and warn the first time, but otherwise just
carry on.
Change-Id: Id3164477ba79049561af7f0889cbeebc199ead4e
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
commit 6c4687cc17a788a6dd8de3e27dbeabb7cbd3e066 upstream.
__replace_page() wronlgy calls mem_cgroup_cancel_charge() in "success" path,
it should only do this if page_check_address() fails.
This means that every enable/disable leads to unbalanced mem_cgroup_uncharge()
from put_page(old_page), it is trivial to underflow the page_counter->count
and trigger OOM.
Reported-and-tested-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Fixes: 00501b531c ("mm: memcontrol: rewrite charge API")
Link: http://lkml.kernel.org/r/20160817153629.GB29724@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 27727df240c7cc84f2ba6047c6f18d5addfd25ef upstream.
When I added some extra sanity checking in timekeeping_get_ns() under
CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns()
method was using timekeeping_get_ns().
Thus the locking added to the debug checks broke the NMI-safety of
__ktime_get_fast_ns().
This patch open-codes the timekeeping_get_ns() logic for
__ktime_get_fast_ns(), so can avoid any deadlocks in NMI.
Fixes: 4ca22c2648 "timekeeping: Add warnings when overflows or underflows are observed"
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1471993702-29148-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a4f8f6667f099036c88f231dcad4cf233652c824 upstream.
It was reported that hibernation could fail on the 2nd attempt, where the
system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
claim_dma_lock(), because the lock has already been taken.
However there is actually no other process would like to grab this lock on
that problematic platform.
Further investigation showed that the problem is triggered by setting
/sys/power/pm_trace to 1 before the 1st hibernation.
Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
than 1970) to the release date of that motherboard during POST stage, thus
after resumed, it may seem that the system had a significant long sleep time
which is a completely meaningless value.
Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
sleep time happened to be set to 1, fls() returns 32 and we add 1 to
sleep_time_bin[32], which causes an out of bounds array access and therefor
memory being overwritten.
As depicted by System.map:
0xffffffff81c9d080 b sleep_time_bin
0xffffffff81c9d100 B dma_spin_lock
the dma_spin_lock.val is set to 1, which caused this problem.
This patch adds a sanity check in tk_debug_account_sleep_time()
to ensure we don't index past the sleep_time_bin array.
[jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the
issue slightly differently, but borrowed his excelent explanation of the
issue here.]
Fixes: 5c83545f24 "power: Add option to log time spent in suspend"
Reported-by: Janek Kozicki <cosurgi@gmail.com>
Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: linux-pm@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Xunlei Pang <xpang@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Zhang Rui <rui.zhang@intel.com>
Link: http://lkml.kernel.org/r/1471993702-29148-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 568ac888215c7fb2fabe8ea739b00ec3c1f5d440 upstream.
cgroup_threadgroup_rwsem is acquired in read mode during process exit
and fork. It is also grabbed in write mode during
__cgroups_proc_write(). I've recently run into a scenario with lots
of memory pressure and OOM and I am beginning to see
systemd
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
percpu_down_write+0x114/0x170
__cgroup_procs_write.isra.12+0xb8/0x3c0
cgroup_file_write+0x74/0x1a0
kernfs_fop_write+0x188/0x200
__vfs_write+0x6c/0xe0
vfs_write+0xc0/0x230
SyS_write+0x6c/0x110
system_call+0x38/0xb4
This thread is waiting on the reader of cgroup_threadgroup_rwsem to
exit. The reader itself is under memory pressure and has gone into
reclaim after fork. There are times the reader also ends up waiting on
oom_lock as well.
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
jbd2_log_wait_commit+0xd4/0x180
ext4_evict_inode+0x88/0x5c0
evict+0xf8/0x2a0
dispose_list+0x50/0x80
prune_icache_sb+0x6c/0x90
super_cache_scan+0x190/0x210
shrink_slab.part.15+0x22c/0x4c0
shrink_zone+0x288/0x3c0
do_try_to_free_pages+0x1dc/0x590
try_to_free_pages+0xdc/0x260
__alloc_pages_nodemask+0x72c/0xc90
alloc_pages_current+0xb4/0x1a0
page_table_alloc+0xc0/0x170
__pte_alloc+0x58/0x1f0
copy_page_range+0x4ec/0x950
copy_process.isra.5+0x15a0/0x1870
_do_fork+0xa8/0x4b0
ppc_clone+0x8/0xc
In the meanwhile, all processes exiting/forking are blocked almost
stalling the system.
This patch moves the threadgroup_change_begin from before
cgroup_fork() to just before cgroup_canfork(). There is no nee to
worry about threadgroup changes till the task is actually added to the
threadgroup. This avoids having to call reclaim with
cgroup_threadgroup_rwsem held.
tj: Subject and description edits.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 98f368e9e2630a3ce3e80fb10fb2e02038cf9578 upstream.
When checking the current cred for a capability in a specific user
namespace, it isn't always desirable to have the LSMs audit the check.
This patch adds a noaudit variant of ns_capable() for when those
situations arise.
The common logic between ns_capable() and the new ns_capable_noaudit()
is moved into a single, shared function to keep duplicated code to a
minimum and ease maintainability.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5f65e5ca286126a60f62c8421b77c2018a482b8a ]
Using INVALID_[UG]ID for the LSM file creation context doesn't
make sense, so return an error if the inode passed to
set_create_file_as() has an invalid id.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit bbf66d897adf2bb0c310db96c97e8db6369f39e1 ]
Hyper-V vmbus module registers TSC page clocksource when loaded. This is
the clocksource with the highest rating and thus it becomes the watchdog
making unloading of the vmbus module impossible.
Separate clocksource_select_watchdog() from clocksource_enqueue_watchdog()
and use it on clocksource register/rating change/unregister.
After all, lobotomized monkeys may need some love too.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Link: http://lkml.kernel.org/r/1453483913-25672-1-git-send-email-vkuznets@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dd4e17ab704269bce71402285f5e8b9ac24b1eff ]
Recently, in commit 37cf4dc3370f I forgot to check if the timeval being passed
was actually a timespec (as is signaled with ADJ_NANO).
This resulted in that patch breaking ADJ_SETOFFSET users who set
ADJ_NANO, by rejecting valid timespecs that were compared with
valid timeval ranges.
This patch addresses this by checking for the ADJ_NANO flag and
using the timepsec check instead in that case.
Reported-by: Harald Hoyer <harald@redhat.com>
Reported-by: Kay Sievers <kay@vrfy.org>
Fixes: 37cf4dc3370f "time: Verify time values in adjtimex ADJ_SETOFFSET to avoid overflow"
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: David Herrmann <dh.herrmann@gmail.com>
Link: http://lkml.kernel.org/r/1453417415-19110-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 37cf4dc3370fbca0344e23bb96446eb2c3548ba7 ]
For adjtimex()'s ADJ_SETOFFSET, make sure the tv_usec value is
sane. We might multiply them later which can cause an overflow
and undefined behavior.
This patch introduces new helper functions to simplify the
checking code and adds comments to clarify
Orginally this patch was by Sasha Levin, but I've basically
rewritten it, so he should get credit for finding the issue
and I should get the blame for any mistakes made since.
Also, credit to Richard Cochran for the phrasing used in the
comment for what is considered valid here.
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1dff76b92f69051e579bdc131e01500da9fa2a91 ]
The following message can be observed on the Ubuntu v3.13.0-65 with KASan
backported:
==================================================================
BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
Read of size 8 by task qemu-system-x86/3998900
=============================================================================
BUG kmalloc-128 (Tainted: G B ): kasan: bad access detected
-----------------------------------------------------------------------------
INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
__slab_alloc+0x4f8/0x560
__kmalloc+0x1eb/0x280
task_numa_fault+0xc1b/0xed0
do_numa_page+0x192/0x200
handle_mm_fault+0x808/0x1160
__do_page_fault+0x218/0x750
do_page_fault+0x1a/0x70
page_fault+0x28/0x30
SyS_poll+0x66/0x1a0
system_call_fastpath+0x1a/0x1f
INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
__slab_free+0x2ab/0x3f0
kfree+0x161/0x170
task_numa_free+0x1d2/0x200
finish_task_switch+0x1d2/0x210
__schedule+0x5d4/0xc60
schedule_preempt_disabled+0x40/0xc0
cpu_startup_entry+0x2da/0x340
start_secondary+0x28f/0x360
Call Trace:
[<ffffffff81a6ce35>] dump_stack+0x45/0x56
[<ffffffff81244aed>] print_trailer+0xfd/0x170
[<ffffffff8124ac36>] object_err+0x36/0x40
[<ffffffff8124cbf9>] kasan_report_error+0x1e9/0x3a0
[<ffffffff8124d260>] kasan_report+0x40/0x50
[<ffffffff810dda7c>] ? task_numa_find_cpu+0x64c/0x890
[<ffffffff8124bee9>] __asan_load8+0x69/0xa0
[<ffffffff814f5c38>] ? find_next_bit+0xd8/0x120
[<ffffffff810dda7c>] task_numa_find_cpu+0x64c/0x890
[<ffffffff810de16c>] task_numa_migrate+0x4ac/0x7b0
[<ffffffff810de523>] numa_migrate_preferred+0xb3/0xc0
[<ffffffff810e0b88>] task_numa_fault+0xb88/0xed0
[<ffffffff8120ef02>] do_numa_page+0x192/0x200
[<ffffffff81211038>] handle_mm_fault+0x808/0x1160
[<ffffffff810d7dbd>] ? sched_clock_cpu+0x10d/0x160
[<ffffffff81068c52>] ? native_load_tls+0x82/0xa0
[<ffffffff81a7bd68>] __do_page_fault+0x218/0x750
[<ffffffff810c2186>] ? hrtimer_try_to_cancel+0x76/0x160
[<ffffffff81a6f5e7>] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
[<ffffffff81a7c2ba>] do_page_fault+0x1a/0x70
[<ffffffff81a772e8>] page_fault+0x28/0x30
[<ffffffff8128cbd4>] ? do_sys_poll+0x1c4/0x6d0
[<ffffffff810e64f6>] ? enqueue_task_fair+0x4b6/0xaa0
[<ffffffff810233c9>] ? sched_clock+0x9/0x10
[<ffffffff810cf70a>] ? resched_task+0x7a/0xc0
[<ffffffff810d0663>] ? check_preempt_curr+0xb3/0x130
[<ffffffff8128b5c0>] ? poll_select_copy_remaining+0x170/0x170
[<ffffffff810d3bc0>] ? wake_up_state+0x10/0x20
[<ffffffff8112a28f>] ? drop_futex_key_refs.isra.14+0x1f/0x90
[<ffffffff8112d40e>] ? futex_requeue+0x3de/0xba0
[<ffffffff8112e49e>] ? do_futex+0xbe/0x8f0
[<ffffffff81022c89>] ? read_tsc+0x9/0x20
[<ffffffff8111bd9d>] ? ktime_get_ts+0x12d/0x170
[<ffffffff8108f699>] ? timespec_add_safe+0x59/0xe0
[<ffffffff8128d1f6>] SyS_poll+0x66/0x1a0
[<ffffffff81a830dd>] system_call_fastpath+0x1a/0x1f
As commit 1effd9f193 ("sched/numa: Fix unsafe get_task_struct() in
task_numa_assign()") points out, the rcu_read_lock() cannot protect the
task_struct from being freed in the finish_task_switch(). And the bug
happens in the process of calculation of imp which requires the access of
p->numa_faults being freed in the following path:
do_exit()
current->flags |= PF_EXITING;
release_task()
~~delayed_put_task_struct()~~
schedule()
...
...
rq->curr = next;
context_switch()
finish_task_switch()
put_task_struct()
__put_task_struct()
task_numa_free()
The fix here to get_task_struct() early before end of dst_rq->lock to
protect the calculation process and also put_task_struct() in the
corresponding point if finally the dst_rq->curr somehow cannot be
assigned.
Additional credit to Liang Chen who helped fix the error logic and add the
put_task_struct() to the place it missed.
Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jay.vosburgh@canonical.com
Cc: liang.chen@canonical.com
Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9006a01829a50cfd6bbd4980910ed46e895e93d7 ]
It is way too easy to take any random clockid and feed it to
the hrtimer subsystem. At best, it gets mapped to a monotonic
base, but it would be better to just catch illegal values as
early as possible.
This patch does exactly that, mapping illegal clockids to an
illegal base index, and panicing when we detect the illegal
condition.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Tomasz Nowicki <tn@semihalf.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Link: http://lkml.kernel.org/r/1452879670-16133-3-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>