Commit graph

21691 commits

Author SHA1 Message Date
Rusty Russell
610dde5afb modules: fix longstanding /proc/kallsyms vs module insertion race.
commit 8244062ef1e54502ef55f54cced659913f244c3e upstream.

For CONFIG_KALLSYMS, we keep two symbol tables and two string tables.
There's one full copy, marked SHF_ALLOC and laid out at the end of the
module's init section.  There's also a cut-down version that only
contains core symbols and strings, and lives in the module's core
section.

After module init (and before we free the module memory), we switch
the mod->symtab, mod->num_symtab and mod->strtab to point to the core
versions.  We do this under the module_mutex.

However, kallsyms doesn't take the module_mutex: it uses
preempt_disable() and rcu tricks to walk through the modules, because
it's used in the oops path.  It's also used in /proc/kallsyms.
There's nothing atomic about the change of these variables, so we can
get the old (larger!) num_symtab and the new symtab pointer; in fact
this is what I saw when trying to reproduce.

By grouping these variables together, we can use a
carefully-dereferenced pointer to ensure we always get one or the
other (the free of the module init section is already done in an RCU
callback, so that's safe).  We allocate the init one at the end of the
module init section, and keep the core one inside the struct module
itself (it could also have been allocated at the end of the module
core, but that's probably overkill).

[ Rebased for 4.4-stable and older, because the following changes aren't
  in the older trees:
  - e0224418516b4d8a6c2160574bac18447c354ef0: adds arg to is_core_symbol
  - 7523e4dc5057e157212b4741abd6256e03404cf1: module_init/module_core/init_size/core_size
    become init_layout.base/core_layout.base/init_layout.size/core_layout.size.
]

Reported-by: Weilong Chen <chenweilong@huawei.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111541
Cc: stable@kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:56 -08:00
Steven Rostedt (Red Hat)
180c86a4f0 tracing: Do not have 'comm' filter override event 'comm' field
commit e57cbaf0eb006eaa207395f3bfd7ce52c1b5539c upstream.

Commit 9f61668073 "tracing: Allow triggers to filter for CPU ids and
process names" added a 'comm' filter that will filter events based on the
current tasks struct 'comm'. But this now hides the ability to filter events
that have a 'comm' field too. For example, sched_migrate_task trace event.
That has a 'comm' field of the task to be migrated.

 echo 'comm == "bash"' > events/sched_migrate_task/filter

will now filter all sched_migrate_task events for tasks named "bash" that
migrates other tasks (in interrupt context), instead of seeing when "bash"
itself gets migrated.

This fix requires a couple of changes.

1) Change the look up order for filter predicates to look at the events
   fields before looking at the generic filters.

2) Instead of basing the filter function off of the "comm" name, have the
   generic "comm" filter have its own filter_type (FILTER_COMM). Test
   against the type instead of the name to assign the filter function.

3) Add a new "COMM" filter that works just like "comm" but will filter based
   on the current task, even if the trace event contains a "comm" field.

Do the same for "cpu" field, adding a FILTER_CPU and a filter "CPU".

Fixes: 9f61668073 "tracing: Allow triggers to filter for CPU ids and process names"
Reported-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09 15:34:52 -08:00
Simon Guinot
f634ac98bd kernel/resource.c: fix muxed resource handling in __request_region()
commit 59ceeaaf355fa0fb16558ef7c24413c804932ada upstream.

In __request_region, if a conflict with a BUSY and MUXED resource is
detected, then the caller goes to sleep and waits for the resource to be
released.  A pointer on the conflicting resource is kept.  At wake-up
this pointer is used as a parent to retry to request the region.

A first problem is that this pointer might well be invalid (if for
example the conflicting resource have already been freed).  Another
problem is that the next call to __request_region() fails to detect a
remaining conflict.  The previously conflicting resource is passed as a
parameter and __request_region() will look for a conflict among the
children of this resource and not at the resource itself.  It is likely
to succeed anyway, even if there is still a conflict.

Instead, the parent of the conflicting resource should be passed to
__request_region().

As a fix, this patch doesn't update the parent resource pointer in the
case we have to wait for a muxed region right after.

Reported-and-tested-by: Vincent Pelletier <plr.vincent@gmail.com>
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org>
Tested-by: Vincent Donnefort <vdonnefort@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:29 -08:00
Steven Rostedt (Red Hat)
36b53e8b2a tracing: Fix showing function event in available_events
commit d045437a169f899dfb0f6f7ede24cc042543ced9 upstream.

The ftrace:function event is only displayed for parsing the function tracer
data. It is not used to enable function tracing, and does not include an
"enable" file in its event directory.

Originally, this event was kept separate from other events because it did
not have a ->reg parameter. But perf added a "reg" parameter for its use
which caused issues, because it made the event available to functions where
it was not compatible for.

Commit 9b63776fa3 "tracing: Do not enable function event with enable"
added a TRACE_EVENT_FL_IGNORE_ENABLE flag that prevented the function event
from being enabled by normal trace events. But this commit missed keeping
the function event from being displayed by the "available_events" directory,
which is used to show what events can be enabled by set_event.

One documented way to enable all events is to:

 cat available_events > set_event

But because the function event is displayed in the available_events, this
now causes an INVALID error:

 cat: write error: Invalid argument

Reported-by: Chunyu Hu <chuhu@redhat.com>
Fixes: 9b63776fa3 "tracing: Do not enable function event with enable"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:29 -08:00
Tejun Heo
4cbd196324 cgroup: make sure a parent css isn't offlined before its children
commit aa226ff4a1ce79f229c6b7a4c0a14e17fececd01 upstream.

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

This patch updates offline path so that a parent css is never offlined
before its children.  Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.

This fixes the memory controller bug and allows the fix for cpu
controller.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reported-by: Brian Christiansen <brian.o.christiansen@gmail.com>
Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:28 -08:00
Tejun Heo
fff4dc84e7 cpuset: make mm migration asynchronous
commit e93ad19d05648397ef3bcb838d26aec06c245dc0 upstream.

If "cpuset.memory_migrate" is set, when a process is moved from one
cpuset to another with a different memory node mask, pages in used by
the process are migrated to the new set of nodes.  This was performed
synchronously in the ->attach() callback, which is synchronized
against process management.  Recently, the synchronization was changed
from per-process rwsem to global percpu rwsem for simplicity and
optimization.

Combined with the synchronous mm migration, this led to deadlocks
because mm migration could schedule a work item which may in turn try
to create a new worker blocking on the process management lock held
from cgroup process migration path.

This heavy an operation shouldn't be performed synchronously from that
deep inside cgroup migration in the first place.  This patch punts the
actual migration to an ordered workqueue and updates cgroup process
migration and cpuset config update paths to flush the workqueue after
all locks are released.  This way, the operations still seem
synchronous to userland without entangling mm migration with process
management synchronization.  CPU hotplug can also invoke mm migration
but there's no reason for it to wait for mm migrations and thus
doesn't synchronize against their completions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:28 -08:00
Tejun Heo
6684710434 Revert "workqueue: make sure delayed work run in local cpu"
commit 041bd12e272c53a35c54c13875839bcb98c999ce upstream.

This reverts commit 874bbfe600.

Workqueue used to implicity guarantee that work items queued without
explicit CPU specified are put on the local CPU.  Recent changes in
timer broke the guarantee and led to vmstat breakage which was fixed
by 176bed1de5 ("vmstat: explicitly schedule per-cpu work on the CPU
we need it to run on").

vmstat is the most likely to expose the issue and it's quite possible
that there are other similar problems which are a lot more difficult
to trigger.  As a preventive measure, 874bbfe600 ("workqueue: make
sure delayed work run in local cpu") was applied to restore the local
CPU guarnatee.  Unfortunately, the change exposed a bug in timer code
which got fixed by 22b886dd10 ("timers: Use proper base migration in
add_timer_on()").  Due to code restructuring, the commit couldn't be
backported beyond certain point and stable kernels which only had
874bbfe600 started crashing.

The local CPU guarantee was accidental more than anything else and we
want to get rid of it anyway.  As, with the vmstat case fixed,
874bbfe600 is causing more problems than it's fixing, it has been
decided to take the chance and officially break the guarantee by
reverting the commit.  A debug feature will be added to force foreign
CPU assignment to expose cases relying on the guarantee and fixes for
the individual cases will be backported to stable as necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 874bbfe600 ("workqueue: make sure delayed work run in local cpu")
Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shli@fb.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:27 -08:00
Tejun Heo
21b34b4574 workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
commit d6e022f1d207a161cd88e08ef0371554680ffc46 upstream.

When looking up the pool_workqueue to use for an unbound workqueue,
workqueue assumes that the target CPU is always bound to a valid NUMA
node.  However, currently, when a CPU goes offline, the mapping is
destroyed and cpu_to_node() returns NUMA_NO_NODE.

This has always been broken but hasn't triggered often enough before
874bbfe600 ("workqueue: make sure delayed work run in local cpu").
After the commit, workqueue forcifully assigns the local CPU for
delayed work items without explicit target CPU to fix a different
issue.  This widens the window where CPU can go offline while a
delayed work item is pending causing delayed work items dispatched
with target CPU set to an already offlined CPU.  The resulting
NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
NULL pool_workqueue and thus crash.

While 874bbfe600 has been reverted for a different reason making the
bug less visible again, it can still happen.  Fix it by mapping
NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
This is a temporary workaround.  The long term solution is keeping CPU
-> NODE mapping stable across CPU off/online cycles which is being
worked on.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:27 -08:00
Wanpeng Li
d024d46ec5 tick/nohz: Set the correct expiry when switching to nohz/lowres mode
commit 1ca8ec532fc2d986f1f4a319857bb18e0c9739b4 upstream.

commit 0ff53d0964 sets the next tick interrupt to the last jiffies update,
i.e. in the past, because the forward operation is invoked before the set
operation. There is no resulting damage (yet), but we get an extra pointless
tick interrupt.

Revert the order so we get the next tick interrupt in the future.

Fixes: commit 0ff53d0964 "tick: sched: Force tick interrupt and get rid of softirq magic"
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1453893967-3458-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:26 -08:00
Jann Horn
df86161e44 seccomp: always propagate NO_NEW_PRIVS on tsync
commit 103502a35cfce0710909da874f092cb44823ca03 upstream.

Before this patch, a process with some permissive seccomp filter
that was applied by root without NO_NEW_PRIVS was able to add
more filters to itself without setting NO_NEW_PRIVS by setting
the new filter from a throwaway thread with NO_NEW_PRIVS.

Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:25 -08:00
David Gibson
972e9e3c7f time: Avoid signed overflow in timekeeping_get_ns()
commit 35a4933a895927990772ae96fdcfd2f806929ee2 upstream.

1e75fa8 "time: Condense timekeeper.xtime into xtime_sec" replaced a call to
clocksource_cyc2ns() from timekeeping_get_ns() with an open-coded version
of the same logic to avoid keeping a semi-redundant struct timespec
in struct timekeeper.

However, the commit also introduced a subtle semantic change - where
clocksource_cyc2ns() uses purely unsigned math, the new version introduces
a signed temporary, meaning that if (delta * tk->mult) has a 63-bit
overflow the following shift will still give a negative result.  The
choice of 'maxsec' in __clocksource_updatefreq_scale() means this will
generally happen if there's a ~10 minute pause in examining the
clocksource.

This can be triggered on a powerpc KVM guest by stopping it from qemu for
a bit over 10 minutes.  After resuming time has jumped backwards several
minutes causing numerous problems (jiffies does not advance, msleep()s can
be extended by minutes..).  It doesn't happen on x86 KVM guests, because
the guest TSC is effectively frozen while the guest is stopped, which is
not the case for the powerpc timebase.

Obviously an unsigned (64 bit) overflow will only take twice as long as a
signed, 63-bit overflow.  I don't know the time code well enough to know
if that will still cause incorrect calculations, or if a 64-bit overflow
is avoided elsewhere.

Still, an incorrect forwards clock adjustment will cause less trouble than
time going backwards.  So, this patch removes the potential for
intermediate signed overflow.

Suggested-by: Laurent Vivier <lvivier@redhat.com>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:16 -08:00
Richard Cochran
b8175b171f posix-clock: Fix return code on the poll method's error path
commit 1b9f23727abb92c5e58f139e7d180befcaa06fe0 upstream.

The posix_clock_poll function is supposed to return a bit mask of
POLLxxx values.  However, in case the hardware has disappeared (due to
hot plugging for example) this code returns -ENODEV in a futile
attempt to throw an error at the file descriptor level.  The kernel's
file_operations interface does not accept such error codes from the
poll method.  Instead, this function aught to return POLLERR.

The value -ENODEV does, in fact, contain the POLLERR bit (and almost
all the other POLLxxx bits as well), but only by chance.  This patch
fixes code to return a proper bit mask.

Credit goes to Markus Elfring for pointing out the suspicious
signed/unsigned mismatch.

Reported-by: Markus Elfring <elfring@users.sourceforge.net>
igned-off-by: Richard Cochran <richardcochran@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Link: http://lkml.kernel.org/r/1450819198-17420-1-git-send-email-richardcochran@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:15 -08:00
Thomas Gleixner
827685a637 genirq: Validate action before dereferencing it in handle_irq_event_percpu()
commit 570540d50710ed192e98e2f7f74578c9486b6b05 upstream.

commit 71f64340fc changed the handling of irq_desc->action from

CPU 0                   CPU 1
free_irq()              lock(desc)
  lock(desc)            handle_edge_irq()
                        if (desc->action) {
                          handle_irq_event()
                            action = desc->action
                            unlock(desc)
  desc->action = NULL       handle_irq_event_percpu(desc, action)
                              action->xxx
to

CPU 0                   CPU 1
free_irq()              lock(desc)
  lock(desc)            handle_edge_irq()
                        if (desc->action) {
                          handle_irq_event()
                            unlock(desc)
  desc->action = NULL       handle_irq_event_percpu(desc, action)
                              action = desc->action
                              action->xxx

So if free_irq manages to set the action to NULL between the unlock and before
the readout, we happily dereference a null pointer.

We could simply revert 71f64340fc, but we want to preserve the better code
generation. A simple solution is to change the action loop from a do {} while
to a while {} loop.

This is safe because we either see a valid desc->action or NULL. If the action
is about to be removed it is still valid as free_irq() is blocked on
synchronize_irq().

CPU 0                   CPU 1
free_irq()              lock(desc)
  lock(desc)            handle_edge_irq()
                          handle_irq_event(desc)
                            set(INPROGRESS)
                            unlock(desc)
                            handle_irq_event_percpu(desc)
                            action = desc->action
  desc->action = NULL           while (action) {
                                  action->xxx
                                  ...
                                  action = action->next;
  sychronize_irq()
    while(INPROGRESS);      lock(desc)
                            clr(INPROGRESS)
free(action)

That's basically the same mechanism as we have for shared
interrupts. action->next can become NULL while handle_irq_event_percpu()
runs. Either it sees the action or NULL. It does not matter, because action
itself cannot go away before the interrupt in progress flag has been cleared.

Fixes: commit 71f64340fc "genirq: Remove the second parameter from handle_irq_event_percpu()"
Reported-by: zyjzyj2000@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Huang Shijie <shijie.huang@arm.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1601131224190.3575@nanos
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:11 -08:00
Toshi Kani
bb7d70e1c3 devm_memremap: Fix error value when memremap failed
commit 93f834df9c2d4e362dfdc4b05daa0a4e18814836 upstream.

devm_memremap() returns an ERR_PTR() value in case of error.
However, it returns NULL when memremap() failed.  This causes
the caller, such as the pmem driver, to proceed and oops later.

Change devm_memremap() to return ERR_PTR(-ENXIO) when memremap()
failed.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:08 -08:00
Daniel Borkmann
a34f2f9f20 bpf: fix branch offset adjustment on backjumps after patching ctx expansion
[ Upstream commit a1b14d27ed0965838350f1377ff97c93ee383492 ]

When ctx access is used, the kernel often needs to expand/rewrite
instructions, so after that patching, branch offsets have to be
adjusted for both forward and backward jumps in the new eBPF program,
but for backward jumps it fails to account the delta. Meaning, for
example, if the expansion happens exactly on the insn that sits at
the jump target, it doesn't fix up the back jump offset.

Analysis on what the check in adjust_branches() is currently doing:

  /* adjust offset of jmps if necessary */
  if (i < pos && i + insn->off + 1 > pos)
    insn->off += delta;
  else if (i > pos && i + insn->off + 1 < pos)
    insn->off -= delta;

First condition (forward jumps):

  Before:                         After:

  insns[0]                        insns[0]
  insns[1] <--- i/insn            insns[1] <--- i/insn
  insns[2] <--- pos               insns[P] <--- pos
  insns[3]                        insns[P]  `------| delta
  insns[4] <--- target_X          insns[P]   `-----|
  insns[5]                        insns[3]
                                  insns[4] <--- target_X
                                  insns[5]

First case is if we cross pos-boundary and the jump instruction was
before pos. This is handeled correctly. I.e. if i == pos, then this
would mean our jump that we currently check was the patchlet itself
that we just injected. Since such patchlets are self-contained and
have no awareness of any insns before or after the patched one, the
delta is correctly not adjusted. Also, for the second condition in
case of i + insn->off + 1 == pos, means we jump to that newly patched
instruction, so no offset adjustment are needed. That part is correct.

Second condition (backward jumps):

  Before:                         After:

  insns[0]                        insns[0]
  insns[1] <--- target_X          insns[1] <--- target_X
  insns[2] <--- pos <-- target_Y  insns[P] <--- pos <-- target_Y
  insns[3]                        insns[P]  `------| delta
  insns[4] <--- i/insn            insns[P]   `-----|
  insns[5]                        insns[3]
                                  insns[4] <--- i/insn
                                  insns[5]

Second interesting case is where we cross pos-boundary and the jump
instruction was after pos. Backward jump with i == pos would be
impossible and pose a bug somewhere in the patchlet, so the first
condition checking i > pos is okay only by itself. However, i +
insn->off + 1 < pos does not always work as intended to trigger the
adjustment. It works when jump targets would be far off where the
delta wouldn't matter. But, for example, where the fixed insn->off
before pointed to pos (target_Y), it now points to pos + delta, so
that additional room needs to be taken into account for the check.
This means that i) both tests here need to be adjusted into pos + delta,
and ii) for the second condition, the test needs to be <= as pos
itself can be a target in the backjump, too.

Fixes: 9bac3d6d54 ("bpf: allow extended BPF programs access skb fields")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:06 -08:00
Alex Shi
582ee3a96f Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2016-02-29 10:18:54 +08:00
Luis R. Rodriguez
e2f712dc92 modules: fix modparam async_probe request
commit 4355efbd80482a961cae849281a8ef866e53d55c upstream.

Commit f2411da746 ("driver-core: add driver module
asynchronous probe support") added async probe support,
in two forms:

  * in-kernel driver specification annotation
  * generic async_probe module parameter (modprobe foo async_probe)

To support the generic kernel parameter parse_args() was
extended via commit ecc8617053 ("module: add extra
argument for parse_params() callback") however commit
failed to f2411da746 failed to add the required argument.

This causes a crash then whenever async_probe generic
module parameter is used. This was overlooked when the
form in which in-kernel async probe support was reworked
a bit... Fix this as originally intended.

Cc: Hannes Reinecke <hare@suse.de>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> [minimized]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25 12:01:25 -08:00
Rusty Russell
a24d9a2fee module: wrapper for symbol name.
commit 2e7bac536106236104e9e339531ff0fcdb7b8147 upstream.

This trivial wrapper adds clarity and makes the following patch
smaller.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25 12:01:25 -08:00
Thomas Gleixner
82e730baa9 itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper
commit 51cbb5242a41700a3f250ecfb48dcfb7e4375ea4 upstream.

As Helge reported for timerfd we have the same issue in itimers. We return
remaining time larger than the programmed relative time to user space in case
of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra time
added in hrtimer_start_range_ns().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Helge Deller <deller@gmx.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: dhowells@redhat.com
Link: http://lkml.kernel.org/r/20160114164159.528222587@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25 12:01:25 -08:00
Thomas Gleixner
1c94da3e74 posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper
commit 572c39172684c3711e4a03c9a7380067e2b0661c upstream.

As Helge reported for timerfd we have the same issue in posix timers. We
return remaining time larger than the programmed relative time to user space
in case of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra
time added in hrtimer_start_range_ns().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Helge Deller <deller@gmx.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: dhowells@redhat.com
Link: http://lkml.kernel.org/r/20160114164159.450510905@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25 12:01:25 -08:00
Mateusz Guzik
e5e99792b6 prctl: take mmap sem for writing to protect against others
commit ddf1d398e517e660207e2c807f76a90df543a217 upstream.

An unprivileged user can trigger an oops on a kernel with
CONFIG_CHECKPOINT_RESTORE.

proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
start/end values. These get sanity checked as follows:
        BUG_ON(arg_start > arg_end);
        BUG_ON(env_start > env_end);

These can be changed by prctl_set_mm. Turns out also takes the semaphore for
reading, effectively rendering it useless. This results in:

  kernel BUG at fs/proc/base.c:240!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: virtio_net
  CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
  RIP: proc_pid_cmdline_read+0x520/0x530
  RSP: 0018:ffff8800784d3db8  EFLAGS: 00010206
  RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
  RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
  RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
  R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
  FS:  00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
  Call Trace:
    __vfs_read+0x37/0x100
    vfs_read+0x82/0x130
    SyS_read+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
  Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
  RIP   proc_pid_cmdline_read+0x520/0x530
  ---[ end trace 97882617ae9c6818 ]---

Turns out there are instances where the code just reads aformentioned
values without locking whatsoever - namely environ_read and get_cmdline.

Interestingly these functions look quite resilient against bogus values,
but I don't believe this should be relied upon.

The first patch gets rid of the oops bug by grabbing mmap_sem for
writing.

The second patch is optional and puts locking around aformentioned
consumers for safety.  Consumers of other fields don't seem to benefit
from similar treatment and are left untouched.

This patch (of 2):

The code was taking the semaphore for reading, which does not protect
against readers nor concurrent modifications.

The problem could cause a sanity checks to fail in procfs's cmdline
reader, resulting in an OOPS.

Note that some functions perform an unlocked read of various mm fields,
but they seem to be fine despite possible modificaton.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anshuman Khandual <anshuman.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25 12:01:25 -08:00
Thomas Gleixner
acaf84251f futex: Drop refcount if requeue_pi() acquired the rtmutex
commit fb75a4282d0d9a3c7c44d940582c2d226cf3acfb upstream.

If the proxy lock in the requeue loop acquires the rtmutex for a
waiter then it acquired also refcount on the pi_state related to the
futex, but the waiter side does not drop the reference count.

Add the missing free_pi_state() call.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe <Andy_Lowe@mentor.com>
Link: http://lkml.kernel.org/r/20151219200607.178132067@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25 12:01:23 -08:00
Toshi Kani
30066dcdf9 devm_memremap_release(): fix memremap'd addr handling
commit 9273a8bbf58a15051e53a777389a502420ddc60e upstream.

The pmem driver calls devm_memremap() to map a persistent memory range.
When the pmem driver is unloaded, this memremap'd range is not released
so the kernel will leak a vma.

Fix devm_memremap_release() to handle a given memremap'd address
properly.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25 12:01:23 -08:00
Jann Horn
969624b7c1 ptrace: use fsuid, fsgid, effective creds for fs access checks
commit caaee6234d05a58c5b4d05e7bf766131b810a657 upstream.

By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.

To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.

The problem was that when a privileged task had temporarily dropped its
privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.

While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.

In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:

 /proc/$pid/stat - uses the check for determining whether pointers
     should be visible, useful for bypassing ASLR
 /proc/$pid/maps - also useful for bypassing ASLR
 /proc/$pid/cwd - useful for gaining access to restricted
     directories that contain files with lax permissions, e.g. in
     this scenario:
     lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
     drwx------ root root /root
     drwxr-xr-x root root /root/foobar
     -rw-r--r-- root root /root/foobar/secret

Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25 12:01:16 -08:00
Raghavendra K T
a800cbddbc sched: Fix crash in sched_init_numa()
commit 9c03ee147193645be4c186d3688232fa438c57c7 upstream.

The following PowerPC commit:

  c118baf802 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes")

avoids allocating bootmem memory for non existent nodes.

But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot
because in sched_init_numa(), cpumask_or() operation was done on
unallocated nodes.

Fix that by making cpumask_or() operation only on existing nodes.

[ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ]

Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: <gkurz@linux.vnet.ibm.com>
Cc: <grant.likely@linaro.org>
Cc: <nikunj@linux.vnet.ibm.com>
Cc: <vdavydov@parallels.com>
Cc: <linuxppc-dev@lists.ozlabs.org>
Cc: <linux-mm@kvack.org>
Cc: <peterz@infradead.org>
Cc: <benh@kernel.crashing.org>
Cc: <paulus@samba.org>
Cc: <mpe@ellerman.id.au>
Cc: <anton@samba.org>
Link: http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17 12:31:02 -08:00
Thomas Gleixner
dd0d511548 hrtimer: Handle remaining time proper for TIME_LOW_RES
commit 203cbf77de59fc8f13502dcfd11350c6d4a5c95f upstream.

If CONFIG_TIME_LOW_RES is enabled we add a jiffie to the relative timeout to
prevent short sleeps, but we do not account for that in interfaces which
retrieve the remaining time.

Helge observed that timerfd can return a remaining time larger than the
relative timeout. That's not expected and breaks userland test programs.

Store the information that the timer was armed relative and provide functions
to adjust the remaining time. To avoid bloating the hrtimer struct make state
a u8, which as a bonus results in better code on x86 at least.

Reported-and-tested-by: Helge Deller <deller@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: dhowells@redhat.com
Link: http://lkml.kernel.org/r/20160114164159.273328486@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17 12:30:57 -08:00
Tejun Heo
a623f87a72 printk: do cond_resched() between lines while outputting to consoles
commit 8d91f8b15361dfb438ab6eb3b319e2ded43458ff upstream.

@console_may_schedule tracks whether console_sem was acquired through
lock or trylock.  If the former, we're inside a sleepable context and
console_conditional_schedule() performs cond_resched().  This allows
console drivers which use console_lock for synchronization to yield
while performing time-consuming operations such as scrolling.

However, the actual console outputting is performed while holding
irq-safe logbuf_lock, so console_unlock() clears @console_may_schedule
before starting outputting lines.  Also, only a few drivers call
console_conditional_schedule() to begin with.  This means that when a
lot of lines need to be output by console_unlock(), for example on a
console registration, the task doing console_unlock() may not yield for
a long time on a non-preemptible kernel.

If this happens with a slow console devices, for example a serial
console, the outputting task may occupy the cpu for a very long time.
Long enough to trigger softlockup and/or RCU stall warnings, which in
turn pile more messages, sometimes enough to trigger the next cycle of
warnings incapacitating the system.

Fix it by making console_unlock() insert cond_resched() between lines if
@console_may_schedule.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Calvin Owens <calvinowens@fb.com>
Acked-by: Jan Kara <jack@suse.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Kyle McMartin <kyle@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17 12:30:57 -08:00
Steven Rostedt
bf46aa7ec7 tracing/stacktrace: Show entire trace if passed in function not found
commit 6ccd83714a009ee301b50c15f6c3a5dc1f30164c upstream.

When a max stack trace is discovered, the stack dump is saved. In order to
not record the overhead of the stack tracer, the ip of the traced function
is looked for within the dump. The trace is started from the location of
that function. But if for some reason the ip is not found, the entire stack
trace is then truncated. That's not very useful. Instead, print everything
if the ip of the traced function is not found within the trace.

This issue showed up on s390.

Link: http://lkml.kernel.org/r/20160129102241.1b3c9c04@gandalf.local.home

Fixes: 72ac426a5b ("tracing: Clean up stack tracing and fix fentry updates")
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17 12:30:57 -08:00
Steven Rostedt (Red Hat)
cc6d98006b tracing: Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()
commit 7717c6be699975f6733d278b13b7c4295d73caf6 upstream.

While cleaning the stacktrace code I unintentially changed the skip depth of
trace_buffer_unlock_commit_regs() from 0 to 6. kprobes uses this function,
and with skipping 6 call backs, it can easily produce no stack.

Here's how I tested it:

 # echo 'p:ext4_sync_fs ext4_sync_fs ' > /sys/kernel/debug/tracing/kprobe_events
 # echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
 # cat /sys/kernel/debug/trace
            sync-2394  [005]   502.457060: ext4_sync_fs: (ffffffff81317650)
            sync-2394  [005]   502.457063: kernel_stack:         <stack trace>
            sync-2394  [005]   502.457086: ext4_sync_fs: (ffffffff81317650)
            sync-2394  [005]   502.457087: kernel_stack:         <stack trace>
            sync-2394  [005]   502.457091: ext4_sync_fs: (ffffffff81317650)

After putting back the skip stack to zero, we have:

            sync-2270  [000]   748.052693: ext4_sync_fs: (ffffffff81317650)
            sync-2270  [000]   748.052695: kernel_stack:         <stack trace>
 => iterate_supers (ffffffff8126412e)
 => sys_sync (ffffffff8129c4b6)
 => entry_SYSCALL_64_fastpath (ffffffff8181f0b2)
            sync-2270  [000]   748.053017: ext4_sync_fs: (ffffffff81317650)
            sync-2270  [000]   748.053019: kernel_stack:         <stack trace>
 => iterate_supers (ffffffff8126412e)
 => sys_sync (ffffffff8129c4b6)
 => entry_SYSCALL_64_fastpath (ffffffff8181f0b2)
            sync-2270  [000]   748.053381: ext4_sync_fs: (ffffffff81317650)
            sync-2270  [000]   748.053383: kernel_stack:         <stack trace>
 => iterate_supers (ffffffff8126412e)
 => sys_sync (ffffffff8129c4b6)
 => entry_SYSCALL_64_fastpath (ffffffff8181f0b2)

Fixes: 73dddbb57b "tracing: Only create stacktrace option when STACKTRACE is configured"
Reported-by: Brendan Gregg <brendan.d.gregg@gmail.com>
Tested-by: Brendan Gregg <brendan.d.gregg@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17 12:30:56 -08:00
Amit Pundir
02bbd06e48 Merge branch 'android-4.4' of https://android.googlesource.com/kernel/common
* android-4.4: (475 commits)
  android: base-cfg: Add CONFIG_IP_MULTICAST
  android: recommended.cfg: enable taskstats
  ANDROID: android: base-cfg: disable CONFIG_SYSVIPC
  android: configs: base: enable configfs gadget functions
  android: add CONFIG_DEBUG_RODATA to recommended config
  android: configs: remove CONFIG_BATTERY_ANDROID=y
  android: configs: base: enable IPV6
  android: configs: Enable SELinux and its dependencies.
  android: base-cfg: disable ALARM_DEV
  android: base-cfg: turn off /dev/mem and /dev/kmem
  android: base-cfg: enable ARMV8_DEPRECATED and subfeatures
  android: base-cfg: enforce the needed XFRM_MODE_TUNNEL (for VPN)
  android: base-cfg: disable LOGGER
  android: base-cfg: enable DM_VERITY (used for secureboot)
  android: configs: add systrace support to recommended configs
  android: configs: update 3.10 options
  android: configs: Add CONFIG_NETFILTER_XT_TARGET_IDLETIMER
  android: configs: add IPV6 ROUTE INFO
  android: configs: add TIMER_STATS back, helps with sysrq t.
  android: configs: Add HIDRAW to recommended set
  ...
2016-02-18 00:20:12 +05:30
Colin Cross
55d35d82b7 hardlockup: detect hard lockups without NMIs using secondary cpus
Emulate NMIs on systems where they are not available by using timer
interrupts on other cpus.  Each cpu will use its softlockup hrtimer
to check that the next cpu is processing hrtimer interrupts by
verifying that a counter is increasing.

This patch is useful on systems where the hardlockup detector is not
available due to a lack of NMIs, for example most ARM SoCs.
Without this patch any cpu stuck with interrupts disabled can
cause a hardware watchdog reset with no debugging information,
but with this patch the kernel can detect the lockup and panic,
which can result in useful debugging info.

Change-Id: Ia5faf50243e19c1755201212e04c8892d929785a
Signed-off-by: Colin Cross <ccross@android.com>
2016-02-16 13:54:19 -08:00
dcashman
d49d88766b FROMLIST: mm: mmap: Add new /proc tunable for mmap_base ASLR.
(cherry picked from commit https://lkml.org/lkml/2015/12/21/337)

ASLR  only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such
a way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.

Bug: 24047224
Signed-off-by: Daniel Cashman <dcashman@android.com>
Signed-off-by: Daniel Cashman <dcashman@google.com>
Change-Id: Ibf9ed3d4390e9686f5cc34f605d509a20d40e6c2
2016-02-16 13:54:14 -08:00
Amit Pundir
29a4f01daa mm: private anonymous memory build fixes for 4.4
Update vma_merge() call in private anonymous memory prctl,
introduced in AOSP commit ee8c5f78f09a
"mm: add a field to store names for private anonymous memory",
so as to align with changes from upstream commit 19a809afe2
"userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx".

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-02-16 13:54:13 -08:00
Colin Cross
586278d78b mm: add a field to store names for private anonymous memory
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory.  When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.

This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma.  vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.

Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.

The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas.  If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".

The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen.  This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging.  The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.

Includes fix from Jed Davis <jld@mozilla.com> for typo in
prctl_set_vma_anon_name, which could attempt to set the name
across two vmas at the same time due to a typo, which might
corrupt the vma list.  Fix it to use tmp instead of end to limit
the name setting to a single vma at a time.

Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2016-02-16 13:54:13 -08:00
Rik van Riel
f8ade3666c add extra free kbytes tunable
Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.

This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period.  In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.

It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.

[ccross]
Revived for use on old kernels where no other solution exists.
The tunable will be removed on kernels that do better at avoiding
direct reclaim.

Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e
Signed-off-by: Rik van Riel<riel@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
2016-02-16 13:54:12 -08:00
Arve Hjønnevåg
2774238025 ARM: Fix "Make low-level printk work" to use a separate config option
Change-Id: I5ca8db61b595adc642a07ea187bd41fd7636840e
Signed-off-by: Arve Hjønnevåg <arve@android.com>
2016-02-16 13:54:03 -08:00
Nishanth Menon
4e09c51018 panic: Add board ID to panic output
At times, it is necessary for boards to provide some additional information
as part of panic logs. Provide information on the board hardware as part
of panic logs.

It is safer to print this information at the very end in case something
bad happens as part of the information retrieval itself.

To use this, set global mach_panic_string to an appropriate string in the
board file.

Change-Id: Id12cdda87b0cd2940dd01d52db97e6162f671b4d
Signed-off-by: Nishanth Menon <nm@ti.com>
2016-02-16 13:54:02 -08:00
Tony Lindgren
3200304ca3 ARM: Make low-level printk work
Makes low-level printk work.

Signed-off-by: Tony Lindgren <tony@atomide.com>
2016-02-16 13:54:01 -08:00
San Mehat
9d19f72b43 proc: smaps: Allow smaps access for CAP_SYS_RESOURCE
Signed-off-by: San Mehat <san@google.com>
2016-02-16 13:53:50 -08:00
Micha Kalfon
d4d049c55d prctl: make PR_SET_TIMERSLACK_PID pid namespace aware
Make PR_SET_TIMERSLACK_PID consider pid namespace and resolve the
target pid in the caller's namespace. Otherwise, calls from pid
namespace other than init would fail or affect the wrong task.

Change-Id: I1da15196abc4096536713ce03714e99d2e63820a
Signed-off-by: Micha Kalfon <micha@cellrox.com>
Acked-by: Oren Laadan <orenl@cellrox.com>
2016-02-16 13:53:49 -08:00
Micha Kalfon
18f42f60be prctl: fix misplaced PR_SET_TIMERSLACK_PID case
The case clause for the PR_SET_TIMERSLACK_PID option was placed inside
the an internal switch statement for PR_MCE_KILL (see commits 37a591d4
and 8ae872f1) . This commit moves it to the right place.

Change-Id: I63251669d7e2f2aa843d1b0900e7df61518c3dea
Signed-off-by: Micha Kalfon <micha@cellrox.com>
Acked-by: Oren Laadan <orenl@cellrox.com>
2016-02-16 13:53:48 -08:00
Ruchi Kandoi
2476d3c241 prctl: adds the capable(CAP_SYS_NICE) check to PR_SET_TIMERSLACK_PID.
Adds a capable() check to make sure that arbitary apps do not change the
timer slack for other apps.

Bug: 15000427
Change-Id: I558a2551a0e3579c7f7e7aae54b28aa9d982b209
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2016-02-16 13:53:48 -08:00
Ruchi Kandoi
f2902f9065 prctl: adds PR_SET_TIMERSLACK_PID for setting timer slack of an arbitrary thread.
Second argument is similar to PR_SET_TIMERSLACK, if non-zero then the
slack is set to that value otherwise sets it to the default for the thread.

Takes PID of the thread as the third argument.

This allows power/performance management software to set timer slack for
other threads according to its policy for the thread (such as when the
thread is designated foreground vs. background activity)

Change-Id: I744d451ff4e60dae69f38f53948ff36c51c14a3f
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2016-02-16 13:53:47 -08:00
Amit Pundir
703920c14a cgroup: refactor allow_attach handler for 4.4
Refactor *allow_attach() handler to align it with the changes
from mainline commit 1f7dd3e5a6 "cgroup: fix handling of
multi-destination migration from subtree_control enabling".

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-02-16 13:53:46 -08:00
Dmitry Shmidt
69db8fca42 cgroup: fix cgroup_taskset_for_each call in allow_attach() for 4.1
Change-Id: I05013f6e76c30b0ece3671f9f2b4bbdc626cd35c
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2016-02-16 13:53:46 -08:00
Christian Poetzsch
f4adb71017 Fix generic cgroup subsystem permission checks
In 53b5e2f generic cgroup subsystem permission checks have been added.
When this is been done within procs_write an empty taskset is added to
the tasks css set. When a task later on migrates to a new group we see a
dmesg warning cause the mg_node isn't empty (cgroup.c:2086). Cause this
happens all the time this spams dmesg.

I am not really familiar with this code, but it looks to me like adding
the taskset is just a temporary action in this context. Therefore this
taskset should be removed after the actual check. This is what this fix
does.

This problem was seen and the fix tested on x86 using l-mr1 and master.

Change-Id: I9894d39e8b5692ef65149002b07e65a84a33ffea
Signed-off-by: Christian Poetzsch <christian.potzsch@imgtec.com>
2016-02-16 13:53:45 -08:00
Rom Lemarchand
0d6687a82a cgroup: Fix issues in allow_attach callback
- Return -EINVAL when cgroups support isn't enabled
- Add allow_attach callback in CPU cgroups

Change-Id: Id3360b4a39919524fc4b6fcbd44fa2050009f000
Signed-off-by: Rom Lemarchand <romlem@android.com>
2016-02-16 13:53:44 -08:00
Colin Cross
1811046286 cgroup: Add generic cgroup subsystem permission checks
Rather than using explicit euid == 0 checks when trying to move
tasks into a cgroup via CFS, move permission checks into each
specific cgroup subsystem. If a subsystem does not specify a
'allow_attach' handler, then we fall back to doing our checks
the old way.

Use the 'allow_attach' handler for the 'cpu' cgroup to allow
non-root processes to add arbitrary processes to a 'cpu' cgroup
if it has the CAP_SYS_NICE capability set.

This version of the patch adds a 'allow_attach' handler instead
of reusing the 'can_attach' handler.  If the 'can_attach' handler
is reused, a new cgroup that implements 'can_attach' but not
the permission checks could end up with no permission checks
at all.

Change-Id: Icfa950aa9321d1ceba362061d32dc7dfa2c64f0c
Original-Author: San Mehat <san@google.com>
Signed-off-by: Colin Cross <ccross@android.com>
2016-02-16 13:53:43 -08:00
Rom Lemarchand
6809864a2c cgroup: refactor allow_attach function into common code
move cpu_cgroup_allow_attach to a common subsys_cgroup_allow_attach.
This allows any process with CAP_SYS_NICE to move tasks across cgroups if
they use this function as their allow_attach handler.

Bug: 18260435
Change-Id: I6bb4933d07e889d0dc39e33b4e71320c34a2c90f
Signed-off-by: Rom Lemarchand <romlem@android.com>
2016-02-16 13:53:42 -08:00
Ruchi Kandoi
1f027b30b5 wakeup_reason: use vsnprintf instead of snsprintf for vargs.
Bug: 22368519
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2016-02-16 13:53:42 -08:00