In order to make the topology contruction fully dynamic, remove the
still hard-coded list of possible domains and stick them in a
data-structure.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.770335383@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In order to unify the sched domain creation more, create proper
cpu_$DOM_mask() functions for those domains that didn't already have
one.
Use the sched_domains_tmpmask for the weird NUMA domain span.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.717702108@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since we're all serialized by sched_domains_mutex we can use
sched_domains_tmpmask and avoid having to do allocations. This means
we can use sched_domains_debug() for cpu_attach_domain() again.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.664347467@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since sched domain creation is fully serialized by the
sched_domains_mutex we can create a single persistent tmpmask to use
during domain creation.
This removes the need for s_data::send_covered.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.607287405@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's only one nodemask user left so remove it with a direct
computation and save some memory and reduce some code-flow
complexity.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.505608966@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Don't treat ALLNODES/NODE different for difference's sake. Simply
always create the ALLNODES domain and let the sd_degenerate() checks
kill it when its redundant. This simplifies the code flow.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.455464579@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Don't use sd->level for identifying properties of the domain.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.350174079@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If we check the root_domain reference count we can see if its been
used or not, use this observation to simplify some of the return
paths.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.298339503@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instead of relying on static allocations for the sched_domain and
sched_group trees, dynamically allocate and RCU free them.
Allocating this dynamically also allows for some build_sched_groups()
simplification since we can now (like with other simplifications) rely
on the sched_domain tree instead of hard-coded knowledge.
One tricky to note is that detach_destroy_domains() needs to hold
rcu_read_lock() over the entire tear-down, per-cpu is not sufficient
since that can lead to partial sched_group existance (could possibly
be solved by doing the tear-down backwards but this is much more
robust).
A concequence of the above is that we can no longer print the
sched_domain debug stuff from cpu_attach_domain() since that might now
run with preemption disabled (due to classic RCU etc.) and
sched_domain_debug() does some GFP_KERNEL allocations.
Another thing to note is that we now fully rely on normal RCU and not
RCU-sched, this is because with the new and exiting RCU flavours we
grew over the years BH doesn't necessarily hold off RCU-sched grace
periods (-rt is known to break this). This would in fact already cause
us grief since we do sched_domain/sched_group iterations from softirq
context.
This patch is somewhat larger than I would like it to be, but I didn't
find any means of shrinking/splitting this.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.245307941@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Again, instead of relying on knowing the possible domains and their
order, simply rely on the sched_domain tree and whatever domains are
present in there to initialize the sched_group cpu_power.
Note: we need to iterate the CPU mask backwards because of the
cpumask_first() condition for iterating up the tree. By iterating the
mask backwards we ensure all groups of a domain are set-up before
starting on the parent groups that rely on its children to be
completely done.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.187335414@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instead of relying on knowing the build order and various CONFIG_
flags simply remember the bottom most sched_domain when we created the
domain hierarchy.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.134511046@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instead of calling build_sched_groups() for each possible sched_domain
we might have created, note that we can simply iterate the
sched_domain tree and call it for each sched_domain present.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.077862519@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The NODE sched_domain is 'special' in that it allocates sched_groups
per CPU, instead of sharing the sched_groups between all CPUs.
While this might have some benefits on large NUMA and avoid remote
memory accesses when iterating the sched_groups, this does break
current code that assumes sched_groups are shared between all
sched_domains (since the dynamic cpu_power patches).
So refactor the NODE groups to behave like all other groups.
(The ALLNODES domain again shared its groups across the CPUs for some
reason).
If someone does measure a performance decrease due to this change we
need to revisit this and come up with another way to have both dynamic
cpu_power and NUMA work nice together.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122941.978111700@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Notice that the mask being computed is the same as the domain span we
just computed. By using the domain_span we can avoid some mask
allocations and computations.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122941.925028189@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The code in update_group_power() does what init_sched_groups_power()
does and more, so remove the special init_ code and call the generic
code instead.
Also move the sd->span_weight initialization because
update_group_power() needs it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122941.875856012@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Non weak static functions clearly are not arch specific, so remove the
arch_ prefix.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122941.820460566@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The scheduler load balancer has specific code to deal with cases of
unbalanced system due to lots of unmovable tasks (for example because of
hard CPU affinity). In those situation, it excludes the busiest CPU that
has pinned tasks for load balance consideration such that it can perform
second 2nd load balance pass on the rest of the system.
This all works as designed if there is only one cgroup in the system.
However, when we have multiple cgroups, this logic has false positives and
triggers multiple load balance passes despite there are actually no pinned
tasks at all.
The reason it has false positives is that the all pinned logic is deep in
the lowest function of can_migrate_task() and is too low level:
load_balance_fair() iterates each task group and calls balance_tasks() to
migrate target load. Along the way, balance_tasks() will also set a
all_pinned variable. Given that task-groups are iterated, this all_pinned
variable is essentially the status of last group in the scanning process.
Task group can have number of reasons that no load being migrated, none
due to cpu affinity. However, this status bit is being propagated back up
to the higher level load_balance(), which incorrectly think that no tasks
were moved. It kick off the all pinned logic and start multiple passes
attempt to move load onto puller CPU.
To fix this, move the all_pinned aggregation up at the iterator level.
This ensures that the status is aggregated over all task-groups, not just
last one in the list.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In function find_busiest_group(), the sched-domain avg_load isn't
calculated at all if there is a group imbalance within the domain. This
will cause erroneous imbalance calculation.
The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
will dump entire sds->max_load into imbalance variable, which is used
later on to migrate entire load from busiest CPU to the puller CPU.
This has two really bad effect:
1. stampede of task migration, and they won't be able to break out
of the bad state because of positive feedback loop: large load
delta -> heavier load migration -> larger imbalance and the cycle
goes on.
2. severe imbalance in CPU queue depth. This causes really long
scheduling latency blip which affects badly on application that
has tight latency requirement.
The fix is to have kernel calculate domain avg_load in both cases. This
will ensure that imbalance calculation is always sensible and the target
is usually half way between busiest and puller CPU.
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is a bug in perf_event_enable_on_exec() when cgroup events are
active on a CPU: the cgroup events may be scheduled twice causing event
state corruptions which eventually may lead to kernel panics.
The reason is that the function needs to first schedule out the cgroup
events, just like for the per-thread events. The cgroup event are
scheduled back in automatically from the perf_event_context_sched_in()
function.
The patch also adds a WARN_ON_ONCE() is perf_cgroup_switch() to catch any
bogus state.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110406005454.GA1062@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-32, fpu: Fix FPU exception handling on non-SSE systems
x86, hibernate: Initialize mmu_cr4_features during boot
x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change
x86: visws: Fixup irq overhaul fallout
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Clean up rebalance_domains() load-balance interval calculation
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86/mrst/vrtc: Fix boot crash in mrst_rtc_init()
rtc, x86/mrst/vrtc: Fix boot crash in rtc_read_alarm()
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Fix cpumask leak in __setup_irq()
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf probe: Fix listing incorrect line number with inline function
perf probe: Fix to find recursively inlined function
perf probe: Fix multiple --vars options behavior
perf probe: Fix to remove redundant close
perf probe: Fix to ensure function declared file
Instead of the possible multiple-evaluation of num_online_cpus()
in rebalance_domains() that Linus reported, avoid it altogether
in the normal case since it's implemented with a Hamming weight
function over a cpu bitmask which can be darn expensive for those
with big iron.
This also makes it cleaner, smaller and documents the code.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1301991265.2225.12.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add kernel-doc to syscalls in signal.c.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
General coding style and comment fixes; no code changes:
- Use multi-line-comment coding style.
- Put some function signatures completely on one line.
- Hyphenate some words.
- Spell Posix as POSIX.
- Correct typos & spellos in some comments.
- Drop trailing whitespace.
- End sentences with periods.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce:
static __always_inline bool static_branch(struct jump_label_key *key);
instead of the old JUMP_LABEL(key, label) macro.
In this way, jump labels become really easy to use:
Define:
struct jump_label_key jump_key;
Can be used as:
if (static_branch(&jump_key))
do unlikely code
enable/disale via:
jump_label_inc(&jump_key);
jump_label_dec(&jump_key);
that's it!
For the jump labels disabled case, the static_branch() becomes an
atomic_read(), and jump_label_inc()/dec() are simply atomic_inc(),
atomic_dec() operations. We show testing results for this change below.
Thanks to H. Peter Anvin for suggesting the 'static_branch()' construct.
Since we now require a 'struct jump_label_key *key', we can store a pointer into
the jump table addresses. In this way, we can enable/disable jump labels, in
basically constant time. This change allows us to completely remove the previous
hashtable scheme. Thanks to Peter Zijlstra for this re-write.
Testing:
I ran a series of 'tbench 20' runs 5 times (with reboots) for 3
configurations, where tracepoints were disabled.
jump label configured in
avg: 815.6
jump label *not* configured in (using atomic reads)
avg: 800.1
jump label *not* configured in (regular reads)
avg: 803.4
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110316212947.GA8792@redhat.com>
Signed-off-by: Jason Baron <jbaron@redhat.com>
Suggested-by: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: David Daney <ddaney@caviumnetworks.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
running following commands:
# enable the binary option
echo 1 > ./options/bin
# disable context info option
echo 0 > ./options/context-info
# tracing only events
echo 1 > ./events/enable
cat trace_pipe
plus forcing system to generate many tracing events,
is causing lockup (in NON preemptive kernels) inside
tracing_read_pipe function.
The issue is also easily reproduced by running ltp stress test.
(ftrace_stress_test.sh)
The reasons are:
- bin/hex/raw output functions for events are set to
trace_nop_print function, which prints nothing and
returns TRACE_TYPE_HANDLED value
- LOST EVENT trace do not handle trace_seq overflow
These reasons force the while loop in tracing_read_pipe
function never to break.
The attached patch fixies handling of lost event trace, and
changes trace_nop_print to print minimal info, which is needed
for the correct tracing_read_pipe processing.
v2 changes:
- omit the cond_resched changes by trace_nop_print changes
- WARN changed to WARN_ONCE and added info to be able
to find out the culprit
v3 changes:
- make more accurate patch comment
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <20110325110518.GC1922@jolsa.brq.redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The file debugfs/tracing/printk_formats maps the addresses
to the formats that are used by trace_bprintk() so that userspace
tools can read the buffer and be able to decode trace_bprintk events
to get the format saved when reading the ring buffer directly.
This is because trace_bprintk() does not store the format into the
buffer, but just the address of the format, which is hidden in
the kernel memory.
But currently it only exports trace_bprintk()s from the kernel core
and not for modules. The modules need their formats exported
as well.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_printk() formats for modules do not show up in the
debugfs/tracing/printk_formats file. Only the formats that are
for trace_printk()s that are in the kernel core.
To facilitate the change to add trace_printk() formats from modules
into that file as well, we need to convert the structure that
holds the formats from char fmt[], into const char *fmt,
and allocate them separately.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ADJ_SETOFFSET bit added in commit 094aa188 ("ntp: Add ADJ_SETOFFSET
mode bit") also introduced a way for any user to change the system time.
Sneaky or buggy calls to adjtimex() could set
ADJ_OFFSET_SS_READ | ADJ_SETOFFSET
which would result in a successful call to timekeeping_inject_offset().
This patch fixes the issue by adding the capability check.
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The CAP_INIT macros of INH, BSET, and EFF made sense at one point in time,
but now days they aren't helping. Just open code the logic in the
init_cred.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
unused code. Clean it up.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
In olden' days of yore CAP_SETPCAP had special meaning for the init task.
We actually have code to make sure that CAP_SETPCAP wasn't in pE of things
using the init_cred. But CAP_SETPCAP isn't so special any more and we
don't have a reason to special case dropping it for init or kthreads....
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
There is no way to limit the capabilities of usermodehelpers. This problem
reared its head recently when someone complained that any user with
cap_net_admin was able to load arbitrary kernel modules, even though the user
didn't have cap_sys_module. The reason is because the actual load is done by
a usermode helper and those always have the full cap set. This patch addes new
sysctls which allow us to bound the permissions of usermode helpers.
/proc/sys/kernel/usermodehelper/bset
/proc/sys/kernel/usermodehelper/inheritable
You must have CAP_SYS_MODULE and CAP_SETPCAP to change these (changes are
&= ONLY). When the kernel launches a usermodehelper it will do so with these
as the bset and pI.
-v2: make globals static
create spinlock to protect globals
-v3: require both CAP_SETPCAP and CAP_SYS_MODULE
-v4: fix the typo s/CAP_SET_PCAP/CAP_SETPCAP/ because I didn't commit
Signed-off-by: Eric Paris <eparis@redhat.com>
No-objection-from: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
After "ptrace: Clean transitions between TASK_STOPPED and TRACED"
d79fdd6d96, ptrace_check_attach()
should never see a TASK_STOPPED tracee and s/STOPPED/TRACED/ is
no longer legal. Add the warning.
Note: ptrace_check_attach() can be greatly simplified, in particular
it doesn't need tasklist. But I'd prefer another patch for that.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch moves SIGNAL_STOP_DEQUEUED from signal_struct->flags to
task_struct->group_stop, and thus makes it per-thread.
Like SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can be false-positive
after return from get_signal_to_deliver(), this is fine. The only
purpose of this bit is: we can drop ->siglock after __dequeue_signal()
returns the sig_kernel_stop() signal and before we call
do_signal_stop(), in this case we must not miss SIGCONT if it comes in
between.
But, unlike SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can not be
false-positive in do_signal_stop() if multiple threads dequeue the
sig_kernel_stop() signal at the same time.
Consider two threads T1 and T2, SIGTTIN has a hanlder.
- T1 dequeues SIGTSTP and sets SIGNAL_STOP_DEQUEUED, then
it drops ->siglock
- SIGCONT comes and clears SIGNAL_STOP_DEQUEUED, SIGTSTP
should be cancelled.
- T2 dequeues SIGTTIN and sets SIGNAL_STOP_DEQUEUED again.
Since we have a handler we should not stop, T2 returns
to usermode to run the handler.
- T1 continues, calls do_signal_stop() and wrongly starts
the group stop because SIGNAL_STOP_DEQUEUED was restored
in between.
With or without this change:
- we need to do something with ptrace_signal() which can
return SIGSTOP, but this needs another discussion
- SIGSTOP can be lost if it races with the mt exec, will
be fixed later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
PF_EXITING or TASK_STOPPED has already called task_participate_group_stop()
and cleared its ->group_stop. No need to do task_clear_group_stop_pending()
when we start the new group stop.
Add a small comment to explain the !task_is_stopped() check. Note that this
check is not exactly right and it can lead to unnecessary stop later if the
thread is TASK_PTRACED. What we need is task_participated_in_group_stop(),
this will be solved later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
prepare_signal(SIGCONT) should never set TIF_SIGPENDING or wake up
the TASK_INTERRUPTIBLE threads. We are going to call complete_signal()
which should pick the right thread correctly. All we need is to wake
up the TASK_STOPPED threads.
If the task was stopped, it can't return to usermode without taking
->siglock. Otherwise we don't care, and the spurious TIF_SIGPENDING
can't be useful.
The comment says:
* If there is a handler for SIGCONT, we must make
* sure that no thread returns to user mode before
* we post the signal
It is not clear what this means. Probably, "when there's only a single
thread" and this continues to be true. Otherwise, even if this SIGCONT
is not private, with or without this change only one thread can dequeue
SIGCONT, other threads can happily return to user mode before before
that thread handles this signal.
Note also that wake_up_state(t, __TASK_STOPPED) can't race with the task
which changes its state, TASK_STOPPED state is protected by ->siglock as
well.
In short: when it comes to signal delivery, SIGCONT is the normal signal
and does not need any special support.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The allocated cpumask should be freed in __setup_irq().
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
LKML-Reference: <1301744375-6812-1-git-send-email-dfeng@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On ppc64 the crashkernel region almost always overlaps an area of firmware.
This works fine except when using the sysfs interface to reduce the kdump
region. If we free the firmware area we are guaranteed to crash.
Rename free_reserved_phys_range to crash_free_reserved_phys_range and make
it a weak function so we can override it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
sys_perf_event_open() had an imbalance in the number of task refs it
took causing memory leakage
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org # .37+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>