Commit graph

10940 commits

Author SHA1 Message Date
Mike Galbraith
5091faa449 sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity.  This patch
implements an idea from Linus, to automatically create task
groups.  Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.

Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group.  When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped.  Child
processes inherit this task group thereafter, and increase it's
refcount.  When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.

At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.

Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:

  cat /proc/<pid>/autogroup

Displays the task's group and the group's nice level.

  echo <nice level> > /proc/<pid>/autogroup

Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.

The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:

  echo [01] > /proc/sys/kernel/sched_autogroup_enabled

... which will automatically move tasks to/from the root task group.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 16:03:35 +01:00
Paul Turner
822bc180a7 sched: Fix unregister_fair_sched_group()
In the flipping and flopping between calling
unregister_fair_sched_group() on a per-cpu versus per-group basis
we ended up in a bad state.

Remove from the list for the passed cpu as opposed to some
arbitrary index.

( This fixes explosions w/ autogroup as well as a group
  creation/destruction stress test. )

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20101130005740.080828123@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 10:07:10 +01:00
Paul E. McKenney
46fdb0937f rcu: Make synchronize_srcu_expedited() fast if running readers
The synchronize_srcu_expedited() function is currently quick if there
are no active readers, but will delay a full jiffy if there are any.
If these readers leave their SRCU read-side critical sections quickly,
this is way too long to wait.  So this commit first waits ten microseconds,
and only then falls back to jiffy-at-a-time waiting.

Reported-by: Avi Kivity <avi@redhat.com>
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29 22:02:40 -08:00
Paul E. McKenney
db3a892099 rcu: fix race condition in synchronize_sched_expedited()
The new (early 2010) implementation of synchronize_sched_expedited() uses
try_stop_cpu() to force a context switch on every CPU.  It also permits
concurrent calls to synchronize_sched_expedited() to share a single call
to try_stop_cpu() through use of an atomically incremented
synchronize_sched_expedited_count variable.  Unfortunately, this is
subject to failure as follows:

o	Task A invokes synchronize_sched_expedited(), try_stop_cpus()
	succeeds, but Task A is preempted before getting to the atomic
	increment of synchronize_sched_expedited_count.

o	Task B also invokes synchronize_sched_expedited(), with exactly
	the same outcome as Task A.

o	Task C also invokes synchronize_sched_expedited(), again with
	exactly the same outcome as Tasks A and B.

o	Task D also invokes synchronize_sched_expedited(), but only
	gets as far as acquiring the mutex within try_stop_cpus()
	before being preempted, interrupted, or otherwise delayed.

o	Task E also invokes synchronize_sched_expedited(), but only
	gets to the snapshotting of synchronize_sched_expedited_count.

o	Tasks A, B, and C all increment synchronize_sched_expedited_count.

o	Task E fails to get the mutex, so checks the new value
	of synchronize_sched_expedited_count.  It finds that the
	value has increased, so (wrongly) assumes that its work
	has been done, returning despite there having been no
	expedited grace period since it began.

The solution is to have the lowest-numbered CPU atomically increment
the synchronize_sched_expedited_count variable within the
synchronize_sched_expedited_cpu_stop() function, which is under
the protection of the mutex acquired by try_stop_cpus().  However, this
also requires that piggybacking tasks wait for three rather than two
instances of try_stop_cpu(), because we cannot control the order in
which the per-CPU callback function occur.

Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29 22:02:00 -08:00
Paul E. McKenney
2d999e03b7 rcu: update documentation/comments for Lai's adoption patch
Lai's RCU-callback immediate-adoption patch changes the RCU tracing
output, so update tracing.txt.  Also update a few comments to clarify
the synchronization design.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29 22:01:59 -08:00
Lai Jiangshan
29494be71a rcu,cleanup: simplify the code when cpu is dying
When we handle the CPU_DYING notifier, the whole system is stopped except
for the current CPU.  We therefore need no synchronization with the other
CPUs.  This allows us to move any orphaned RCU callbacks directly to the
list of any online CPU without needing to run them through the global
orphan lists.  These global orphan lists can therefore be dispensed with.
This commit makes thes changes, though currently victimizes CPU 0 @@@.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29 22:01:58 -08:00
Lai Jiangshan
7b27d5475f rcu,cleanup: move synchronize_sched_expedited() out of sched.c
The first version of synchronize_sched_expedited() used the migration
code in the scheduler, and was therefore implemented in kernel/sched.c.
However, the more recent version of this code no longer uses the
migration code, so this commit moves it to the main RCU source files.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29 22:01:58 -08:00
Paul E. McKenney
deb7a41815 rcu: get rid of obsolete "classic" names in TREE_RCU tracing
The TREE_RCU tracing had obsolete rcuclassic_trace_init() and
rcuclassic_trace_cleanup() function names.  This commit brings them
up to date: rcutree_trace_init() and rcutree_trace_cleanup(),
respectively.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29 22:01:57 -08:00
Paul E. McKenney
e940cc804e rcu: Distinguish between boosting and boosted
RCU priority boosting's tracing did not distinguish between ongoing
boosting and completion of boosting.  This commit therefore adds this
capability.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29 22:01:56 -08:00
Paul E. McKenney
9e571a82f0 rcu: add tracing for TINY_RCU and TINY_PREEMPT_RCU
Add tracing for the tiny RCU implementations, including statistics on
boosting in the case of TINY_PREEMPT_RCU and RCU_BOOST.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29 22:01:55 -08:00
Paul E. McKenney
24278d1483 rcu: priority boosting for TINY_PREEMPT_RCU
Add priority boosting, but only for TINY_PREEMPT_RCU.  This is enabled
by the default-off RCU_BOOST kernel parameter.  The priority to which to
boost preempted RCU readers is controlled by the RCU_BOOST_PRIO kernel
parameter (defaulting to real-time priority 1) and the time to wait
before boosting the readers blocking a given grace period is controlled
by the RCU_BOOST_DELAY kernel parameter (defaulting to 500 milliseconds).

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29 22:01:54 -08:00
Jesper Juhl
fa9f90be74 Kill off a bunch of warning: ‘inline’ is not at beginning of declaration
These warnings are spewed during a build of a 'allnoconfig' kernel
(especially the ones from u64_stats_sync.h show up a lot) when building
with -Wextra (which I often do)..
They are
  a) annoying
  b) easy to get rid of.
This patch kills them off.

include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration
kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration
kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration
kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration
mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-28 23:08:04 +01:00
Serge E. Hallyn
ce6ada35bd security: Define CAP_SYSLOG
Privileged syslog operations currently require CAP_SYS_ADMIN.  Split
this off into a new CAP_SYSLOG privilege which we can sanely take away
from a container through the capability bounding set.

With this patch, an lxc container can be prevented from messing with
the host's syslog (i.e. dmesg -c).

Changelog: mar 12 2010: add selinux capability2:cap_syslog perm
Changelog: nov 22 2010:
	. port to new kernel
	. add a WARN_ONCE if userspace isn't using CAP_SYSLOG

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Acked-By: Kees Cook <kees.cook@canonical.com>
Cc: James Morris <jmorris@namei.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: "Christopher J. PeBenito" <cpebenito@tresys.com>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: James Morris <jmorris@namei.org>
2010-11-29 08:35:12 +11:00
Linus Torvalds
a9e40a2493 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf: Fix the software context switch counter
  perf, x86: Fixup Kconfig deps
  x86, perf, nmi: Disable perf if counters are not accessible
  perf: Fix inherit vs. context rotation bug
2010-11-28 12:25:02 -08:00
Jens Axboe
f30195c502 Merge branch 'cleanup-bd_claim' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.38/core 2010-11-27 19:49:18 +01:00
Linus Torvalds
0318755276 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call
2010-11-27 07:29:20 +09:00
Linus Torvalds
d2f30c73ab Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf symbols: Remove incorrect open-coded container_of()
  perf record: Handle restrictive permissions in /proc/{kallsyms,modules}
  x86/kprobes: Prevent kprobes to probe on save_args()
  irq_work: Drop cmpxchg() result
  perf: Fix owner-list vs exit
  x86, hw_nmi: Move backtrace_mask declaration under ARCH_HAS_NMI_WATCHDOG
  tracing: Fix recursive user stack trace
  perf,hw_breakpoint: Initialize hardware api earlier
  x86: Ignore trap bits on single step exceptions
  tracing: Force arch_local_irq_* notrace for paravirt
  tracing: Fix module use of trace_bprintk()
2010-11-27 07:28:17 +09:00
Linus Torvalds
1b065fdff1 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Fix idle balancing
  sched: Fix volanomark performance regression
2010-11-27 07:27:54 +09:00
Peter Zijlstra
004417a6d4 perf, arch: Cleanup perf-pmu init vs lockup-detector
The perf hardware pmu got initialized at various points in the boot,
some before early_initcall() some after (notably arch_initcall).

The problem is that the NMI lockup detector is ran from early_initcall()
and expects the hardware pmu to be present.

Sanitize this by moving all architecture hardware pmu implementations to
initialize at early_initcall() and move the lockup detector to an explicit
initcall right after that.

Cc: paulus <paulus@samba.org>
Cc: davem <davem@davemloft.net>
Cc: Michael Cree <mcree@orcon.net.nz>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290707759.2145.119.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:14:56 +01:00
Peter Zijlstra
963988262c perf: Ignore non-sampling overflows
Some arch implementations call perf_event_overflow() by 'accident',
ignore this.

Reported-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:14:55 +01:00
Franck Bui-Huu
5d508e820a perf: Don't bother to init the hrtimer for no SW sampling counters
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290525705-6265-3-git-send-email-fbuihuu@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:14:55 +01:00
Franck Bui-Huu
2e939d1da9 perf: Limit event refresh to sampling event
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290525705-6265-2-git-send-email-fbuihuu@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:14:55 +01:00
Franck Bui-Huu
6c7e550f13 perf: Introduce is_sampling_event()
and use it when appropriate.

Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290525705-6265-1-git-send-email-fbuihuu@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:14:54 +01:00
Ingo Molnar
6c869e772c Merge branch 'perf/urgent' into perf/core
Conflicts:
	arch/x86/kernel/apic/hw_nmi.c

Merge reason: Resolve conflict, queue up dependent patch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:07:02 +01:00
Nikanth Karthikesan
b7a2b39d9b sched: Remove unused argument dest_cpu to migrate_task()
Remove unused argument, 'dest_cpu' of migrate_task(), and pass runqueue,
as it is always known at the call site.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <201011261237.09187.knikanth@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:05:36 +01:00
Gerald Schaefer
335d7afbfb mutexes, sched: Introduce arch_mutex_cpu_relax()
The spinning mutex implementation uses cpu_relax() in busy loops as a
compiler barrier. Depending on the architecture, cpu_relax() may do more
than needed in this specific mutex spin loops. On System z we also give
up the time slice of the virtual cpu in cpu_relax(), which prevents
effective spinning on the mutex.

This patch replaces cpu_relax() in the spinning mutex code with
arch_mutex_cpu_relax(), which can be defined by each architecture that
selects HAVE_ARCH_MUTEX_CPU_RELAX. The default is still cpu_relax(), so
this patch should not affect other architectures than System z for now.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290437256.7455.4.camel@thinkpad>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:05:34 +01:00
Ingo Molnar
22a867d817 Merge commit 'v2.6.37-rc3' into sched/core
Merge reason: Pick up latest fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:05:21 +01:00
Ingo Molnar
e4e91ac410 Merge commit 'v2.6.37-rc3' into perf/core
Merge reason: Pick up latest fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:04:47 +01:00
Heiko Carstens
61ab25447a nohz: Fix printk_needs_cpu() return value on offline cpus
This patch fixes a hang observed with 2.6.32 kernels where timers got enqueued
on offline cpus.

printk_needs_cpu() may return 1 if called on offline cpus. When a cpu gets
offlined it schedules the idle process which, before killing its own cpu, will
call tick_nohz_stop_sched_tick(). That function in turn will call
printk_needs_cpu() in order to check if the local tick can be disabled. On
offline cpus this function should naturally return 0 since regardless if the
tick gets disabled or not the cpu will be dead short after. That is besides the
fact that __cpu_disable() should already have made sure that no interrupts on
the offlined cpu will be delivered anyway.

In this case it prevents tick_nohz_stop_sched_tick() to call
select_nohz_load_balancer(). No idea if that really is a problem. However what
made me debug this is that on 2.6.32 the function get_nohz_load_balancer() is
used within __mod_timer() to select a cpu on which a timer gets enqueued. If
printk_needs_cpu() returns 1 then the nohz_load_balancer cpu doesn't get
updated when a cpu gets offlined. It may contain the cpu number of an offline
cpu. In turn timers get enqueued on an offline cpu and not very surprisingly
they never expire and cause system hangs.

This has been observed 2.6.32 kernels. On current kernels __mod_timer() uses
get_nohz_timer_target() which doesn't have that problem. However there might be
other problems because of the too early exit tick_nohz_stop_sched_tick() in
case a cpu goes offline.

Easiest way to fix this is just to test if the current cpu is offline and call
printk_tick() directly which clears the condition.

Alternatively I tried a cpu hotplug notifier which would clear the condition,
however between calling the notifier function and printk_needs_cpu() something
could have called printk() again and the problem is back again. This seems to
be the safest fix.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
LKML-Reference: <20101126120235.406766476@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:03:12 +01:00
Heiko Carstens
49f4138346 printk: Fix wake_up_klogd() vs cpu hotplug
wake_up_klogd() may get called from preemptible context but uses
__raw_get_cpu_var() to write to a per cpu variable. If it gets preempted
between getting the address and writing to it, the cpu in question could be
offline if the process gets scheduled back and hence writes to the per cpu data
of an offline cpu.

This buggy behaviour was introduced with fa33507a "printk: robustify
printk, fix #2" which was supposed to fix a "using smp_processor_id() in
preemptible" warning.

Let's use this_cpu_write() instead which disables preemption and makes sure
that the outlined scenario cannot happen.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101126124247.GC7023@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:03:11 +01:00
Peter Zijlstra
ee6dcfa40a perf: Fix the software context switch counter
Stephane noticed that because the perf_sw_event() call is inside the
perf_event_task_sched_out() call it won't get called unless we
have a per-task counter.

Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:00:59 +01:00
Thomas Gleixner
dddd3379a6 perf: Fix inherit vs. context rotation bug
It was found that sometimes children of tasks with inherited events had
one extra event. Eventually it turned out to be due to the list rotation
no being exclusive with the list iteration in the inheritance code.

Cure this by temporarily disabling the rotation while we inherit the events.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26 15:00:56 +01:00
Hitoshi Mitake
e5cba24e3f workqueue: check the allocation of system_unbound_wq
I found a trivial bug on initialization of workqueue.
Current init_workqueues doesn't check the result of
allocation of system_unbound_wq, this should be checked
like other queues.

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-11-26 12:06:44 +01:00
Peter Zijlstra
5bb6b1ea67 sched: Add some clock info to sched_debug
Add more clock information to /proc/sched_debug, Thomas wanted to see
the sched_clock_stable state.

Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-23 10:29:08 +01:00
Peter Zijlstra
51a96c7781 cpu: Remove incorrect BUG_ON
Oleg mentioned that there is no actual guarantee the dying cpu's
migration thread is actually finished running when we get there, so
replace the BUG_ON() with a spinloop waiting for it.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-23 10:29:08 +01:00
Dhaval Giani
2e01f4740a cpu: Remove unused variable
GCC warns us about:

 kernel/cpu.c: In function ‘take_cpu_down’:
 kernel/cpu.c:200:15: warning: unused variable ‘cpu’

This variable is unused since param->hcpu is directly
used later on in cpu_notify.

Signed-off-by: Dhaval Giani <dhaval_giani@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290091494.1145.5.camel@gondor.retis>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-23 10:29:07 +01:00
Peter Zijlstra
70caf8a6c1 sched: Fix UP build breakage
The recent cgroup-scheduling rework caused a UP build problem.

Cc: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-23 10:29:07 +01:00
Erik Gilling
28d0686cf7 sched: Make task dump print all 15 chars of proc comm
Signed-off-by: Erik Gilling <konkers@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290218934-8544-3-git-send-email-john.stultz@linaro.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-23 10:29:07 +01:00
Linus Torvalds
33e0d57f5d Revert "kernel: make /proc/kallsyms mode 400 to reduce ease of attacking"
This reverts commit 59365d136d.

It turns out that this can break certain existing user land setups.
Quoth Sarah Sharp:

 "On Wednesday, I updated my branch to commit 460781b from linus' tree,
  and my box would not boot.  klogd segfaulted, which stalled the whole
  system.

  At first I thought it actually hung the box, but it continued booting
  after 5 minutes, and I was able to log in.  It dropped back to the
  text console instead of the graphical bootup display for that period
  of time.  dmesg surprisingly still works.  I've bisected the problem
  down to this commit (commit 59365d136d)

  The box is running klogd 1.5.5ubuntu3 (from Jaunty).  Yes, I know
  that's old.  I read the bit in the commit about changing the
  permissions of kallsyms after boot, but if I can't boot that doesn't
  help."

So let's just keep the old default, and encourage distributions to do
the "chmod -r /proc/kallsyms" in their bootup scripts.  This is not
worth a kernel option to change default behavior, since it's so easily
done in user space.

Reported-and-bisected-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
Cc: Marcus Meissner <meissner@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-19 11:54:40 -08:00
Steven Rostedt
0429578016 tracing/events: Show real number in array fields
Currently we have in something like the sched_switch event:

  field:char prev_comm[TASK_COMM_LEN];	offset:12;	size:16;	signed:1;

When a userspace tool such as perf tries to parse this, the
TASK_COMM_LEN is meaningless. This is done because the TRACE_EVENT() macro
simply uses a #len to show the string of the length. When the length is
an enum, we get a string that means nothing for tools.

By adding a static buffer and a mutex to protect it, we can store the
string into that buffer with snprintf and show the actual number.
Now we get:

  field:char prev_comm[16];       offset:12;      size:16;        signed:1;

Something much more useful.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-11-19 10:18:47 -05:00
Ingo Molnar
ae51ce9061 Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core 2010-11-18 20:07:12 +01:00
Linus Torvalds
2d42dc3feb Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
  kgdb,ppc: Fix regression in evr register handling
  kgdb,x86: fix regression in detach handling
  kdb: fix crash when KDB_BASE_CMD_MAX is exceeded
  kdb: fix memory leak in kdb_main.c
2010-11-18 08:24:58 -08:00
Frederic Weisbecker
61c32659b1 tracing: New flag to allow non privileged users to use a trace event
This adds a new trace event internal flag that allows them to be
used in perf by non privileged users in case of task bound tracing.

This is desired for syscalls tracepoint because they don't leak
global system informations, like some other tracepoints.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jason Baron <jbaron@redhat.com>
2010-11-18 14:37:40 +01:00
matthieu castet
84e1c6bb38 x86: Add RO/NX protection for loadable kernel modules
This patch is a logical extension of the protection provided by
CONFIG_DEBUG_RODATA to LKMs. The protection is provided by
splitting module_core and module_init into three logical parts
each and setting appropriate page access permissions for each
individual section:

 1. Code: RO+X
 2. RO data: RO+NX
 3. RW data: RW+NX

In order to achieve proper protection, layout_sections() have
been modified to align each of the three parts mentioned above
onto page boundary. Next, the corresponding page access
permissions are set right before successful exit from
load_module(). Further, free_module() and sys_init_module have
been modified to set module_core and module_init as RW+NX right
before calling module_free().

By default, the original section layout and access flags are
preserved. When compiled with CONFIG_DEBUG_SET_MODULE_RONX=y,
the patch will page-align each group of sections to ensure that
each page contains only one type of content and will enforce
RO/NX for each group of pages.

  -v1: Initial proof-of-concept patch.
  -v2: The patch have been re-written to reduce the number of #ifdefs
       and to make it architecture-agnostic. Code formatting has also
       been corrected.
  -v3: Opportunistic RO/NX protection is now unconditional. Section
       page-alignment is enabled when CONFIG_DEBUG_RODATA=y.
  -v4: Removed most macros and improved coding style.
  -v5: Changed page-alignment and RO/NX section size calculation
  -v6: Fixed comments. Restricted RO/NX enforcement to x86 only
  -v7: Introduced CONFIG_DEBUG_SET_MODULE_RONX, added
       calls to set_all_modules_text_rw() and set_all_modules_text_ro()
       in ftrace
  -v8: updated for compatibility with linux 2.6.33-rc5
  -v9: coding style fixes
 -v10: more coding style fixes
 -v11: minor adjustments for -tip
 -v12: minor adjustments for v2.6.35-rc2-tip
 -v13: minor adjustments for v2.6.37-rc1-tip

Signed-off-by: Siarhei Liakh <sliakh.lkml@gmail.com>
Signed-off-by: Xuxian Jiang <jiang@cs.ncsu.edu>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dave Jones <davej@redhat.com>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <4CE2F914.9070106@free.fr>
[ minor cleanliness edits, -v14: build failure fix ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18 13:32:56 +01:00
Paul Turner
9437178f62 sched: Update tg->shares after cpu.shares write
Formerly sched_group_set_shares would force a rebalance by overflowing domain
share sums.  Now that per-cpu averages are maintained we can set the true value
by issuing an update_cfs_shares() following a tg->shares update.

Also initialize tg se->load to 0 for consistency since we'll now set correct
weights on enqueue.

Signed-off-by: Paul Turner <pjt@google.com?>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.465521344@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18 13:27:50 +01:00
Paul Turner
d6b5591829 sched: Allow update_cfs_load() to update global load
Refactor the global load updates from update_shares_cpu() so that
update_cfs_load() can update global load when it is more than ~10%
out of sync.

The new global_load parameter allows us to force an update, regardless of
the error factor so that we can synchronize w/ update_shares().

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.377473595@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18 13:27:50 +01:00
Paul Turner
3b3d190ec3 sched: Implement demand based update_cfs_load()
When the system is busy, dilation of rq->next_balance makes lb->update_shares()
insufficiently frequent for threads which don't sleep (no dequeue/enqueue
updates).  Adjust for this by making demand based updates based on the
accumulation of execution time sufficient to wrap our averaging window.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.291159744@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18 13:27:49 +01:00
Paul Turner
c66eaf619c sched: Update shares on idle_balance
Since shares updates are no longer expensive and effectively local, update them
at idle_balance().  This allows us to more quickly redistribute shares to
another cpu when our load becomes idle.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.204191702@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18 13:27:49 +01:00
Paul Turner
a7a4f8a752 sched: Add sysctl_sched_shares_window
Introduce a new sysctl for the shares window and disambiguate it from
sched_time_avg.

A 10ms window appears to be a good compromise between accuracy and performance.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.112173964@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18 13:27:49 +01:00
Paul Turner
67e86250f8 sched: Introduce hierarchal order on shares update list
Avoid duplicate shares update calls by ensuring children always appear before
parents in rq->leaf_cfs_rq_list.

This allows us to do a single in-order traversal for update_shares().

Since we always enqueue in bottom-up order this reduces to 2 cases:

1) Our parent is already in the list, e.g.

   root
     \
      b
      /\
      c d* (root->b->c already enqueued)

Since d's parent is enqueued we push it to the head of the list, implicitly ahead of b.

2) Our parent does not appear in the list (or we have no parent)

In this case we enqueue to the tail of the list, if our parent is subsequently enqueued
(bottom-up) it will appear to our right by the same rule.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.022488865@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18 13:27:48 +01:00