Commit graph

17875 commits

Author SHA1 Message Date
Paul E. McKenney
f1f399d128 rcu: Optimize RCU_FAST_NO_HZ for RCU_NOCB_CPU_ALL
If CONFIG_RCU_NOCB_CPU_ALL=y, then no CPU will ever have RCU callbacks
because these callbacks will instead be handled by the rcuo kthreads.
However, the current version of RCU_FAST_NO_HZ nevertheless checks for RCU
callbacks.  This commit therefore creates static inline implementations
of rcu_prepare_for_idle() and rcu_cleanup_after_idle() that are no-ops
when CONFIG_RCU_NOCB_CPU_ALL=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 16:03:33 -08:00
Paul E. McKenney
ffa83fb565 rcu: Optimize rcu_needs_cpu() for RCU_NOCB_CPU_ALL
If CONFIG_RCU_NOCB_CPU_ALL=y, then rcu_needs_cpu() will always
return false, however, the current version nevertheless checks
for RCU callbacks.  This commit therefore creates a static inline
implementation of rcu_needs_cpu() that unconditionally returns false
when CONFIG_RCU_NOCB_CPU_ALL=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 16:03:09 -08:00
Paul E. McKenney
2f33b512a5 rcu: Optimize rcu_is_nocb_cpu() for RCU_NOCB_CPU_ALL
If CONFIG_RCU_NOCB_CPU_ALL=y, then rcu_is_nocb_cpu() will always
return true, however, the current version nevertheless checks
rcu_nocb_mask.  This commit therefore creates a static inline
implementation of rcu_is_nocb_cpu() that unconditionally returns
true when CONFIG_RCU_NOCB_CPU_ALL=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:32:48 -08:00
Shaibal Dutta
ae1670339c rcu: Move SRCU grace period work to power efficient workqueue
For better use of CPU idle time, allow the scheduler to select the CPU
on which the SRCU grace period work would be scheduled. This improves
idle residency time and conserves power.

This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.

Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
[zoran.markovic@linaro.org: Rebased to latest kernel version. Added commit
message. Fixed code alignment.]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:02:14 -08:00
Paul Bolle
52e2bb958a rcu: Disambiguate CONFIG_RCU_NOCB_CPUs
This commit fixes a grammar issue in the rcu_nohz_full_cpu() comment
header, so that it is clear that the plural is CPUs not Kconfig options.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:02:08 -08:00
Paul E. McKenney
cb1e78cfa2 rcu: Remove ACCESS_ONCE() from jiffies
Because jiffies is one of a very few variables marked "volatile", there
is no need to use ACCESS_ONCE() when accessing it.  This commit therefore
removes the redundant ACCESS_ONCE() wrappers.

Reported by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:01:42 -08:00
Paul E. McKenney
87de1cfdc5 rcu: Stop tracking FSF's postal address
All of the RCU source files have the usual GPL header, which contains a
long-obsolete postal address for FSF.  To avoid the need to track the
FSF office's movements, this commit substitutes the URL where GPL may
be found.

Reported-by: Greg KH <gregkh@linuxfoundation.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:01:37 -08:00
Paul E. McKenney
3660c2813f rcu: Add ACCESS_ONCE() to ->n_force_qs_lh accesses
The ->n_force_qs_lh field is accessed without the benefit of any
synchronization, so this commit adds the needed ACCESS_ONCE() wrappers.
Yes, increments to ->n_force_qs_lh can be lost, but contention should
be low and the field is strictly statistical in nature, so this is not
a problem.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:01:10 -08:00
Linus Torvalds
e4178d809f printk: fix syslog() overflowing user buffer
This is not a buffer overflow in the traditional sense: we don't
overflow any *kernel* buffers, but we do mis-count the amount of data we
copy back to user space for the SYSLOG_ACTION_READ_ALL case.

In particular, if the user buffer is too small to hold everything, and
*if* there is a continuation line at just the right place, we can end up
giving the user more data than he asked for.

The reason is that we first count up the number of bytes all the log
records contains, then we walk the records again until we've skipped the
records at the beginning that won't fit, and then we walk the rest of
the records and copy them to the user space buffer.

And in between that "skip the initial records that won't fit" and the
"copy the records that *will* fit to user space", we reset the 'prev'
variable that contained the record information for the last record not
copied.  That meant that when we started copying to user space, we now
had a different character count than what we had originally calculated
in the first record walk-through.

The fix is to simply not clear the 'prev' flags value (in both cases
where we had the same logic: syslog_print_all and kmsg_dump_get_buffer:
the latter is used for pstore-like dumping)

Reported-and-tested-by: Debabrata Banerjee <dbanerje@akamai.com>
Acked-by: Kay Sievers <kay@vrfy.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-17 12:24:45 -08:00
Linus Torvalds
5a667a0c02 Merge branches 'irq-urgent-for-linus' and 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq update from Thomas Gleixner:
 "Fix from the urgent branch: a trivial oneliner adding the missing
  Kconfig dependency curing build failures which have been discovered by
  several build robots.

  The update in the irq-core branch provides a new function in the
  irq/devres code, which is a prerequisite for driver developers to get
  rid of boilerplate code all over the place.

  Not a bugfix, but it has zero impact on the current kernel due to the
  lack of users.  It's simpler to provide the infrastructure to
  interested parties via your tree than fulfilling the wishlist of
  driver maintainers on which particular commit or tag this should be
  based on"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Add missing irq_to_desc export for CONFIG_SPARSE_IRQ=n

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Add devm_request_any_context_irq()
2014-02-15 16:06:12 -08:00
Linus Torvalds
3a19c07c56 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "The following trilogy of patches brings you:

   - fix for a long standing math overflow issue with HZ < 60

   - an onliner fix for a corner case in the dreaded tick broadcast
     mechanism affecting a certain range of AMD machines which are
     infested with the infamous automagic C1E power control misfeature

   - a fix for one of the ARM platforms which allows the kernel to
     proceed and boot instead of stupidly panicing for no good reason.
     The patch is slightly larger than necessary, but it's less ugly
     than the alternative 5 liner"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick: Clear broadcast pending bit when switching to oneshot
  clocksource: Kona: Print warning rather than panic
  time: Fix overflow when HZ is smaller than 60
2014-02-15 16:04:42 -08:00
Paul Gortmaker
f96a34e27d nohz: ensure users are aware boot CPU is not NO_HZ_FULL
This bit of information is in the Kconfig help text:

  "Note the boot CPU will still be kept outside the range to
  handle the timekeeping duty."

However neither the variable NO_HZ_FULL_ALL, or the prompt
convey this important detail, so lets add it to the prompt
to make it more explicitly obvious to the average user.

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1391711781-7466-1-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-02-14 17:59:17 +01:00
Viresh Kumar
8ba1465428 timer: Spare IPI when deferrable timer is queued on idle remote targets
When a timer is enqueued or modified on a remote target, the latter is
expected to see and handle this timer on its next tick. However if the
target is idle and CONFIG_NO_HZ_IDLE=y, the CPU may be sleeping tickless
and the timer may be ignored.

wake_up_nohz_cpu() takes care of that by setting TIF_NEED_RESCHED and
sending an IPI to idle targets so that the tick is reevaluated on the
idle loop through the tick_nohz_idle_*() APIs.

Now this is all performed regardless of the power properties of the
timer. If the timer is deferrable, idle targets don't need to be woken
up. Only the next buzy tick needs to care about it, and no IPI kick
is needed for that to happen.

So lets spare the IPI on idle targets when the timer is deferrable.

Meanwhile we keep the current behaviour on full dynticks targets. We can
spare IPIs on idle full dynticks targets as well but some tricky races
against idle_cpu() must be dealt all along to make sure that the timer
is well handled after idle exit. We can deal with that later since
NO_HZ_FULL already has more important powersaving issues.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/CAKohpomMZ0TAN2e6N76_g4ZRzxd5vZ1XfuZfxrP7GMxfTNiLVw@mail.gmail.com
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-02-14 17:59:14 +01:00
Li Zefan
6534fd6c15 cgroup: fix memory leak in cgroup_mount()
We should free the memory allocated in parse_cgroupfs_options() before
calling this function again.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-14 10:52:40 -05:00
Li Zefan
bad3466034 cgroup: fix locking in cgroupstats_build()
css_set_lock has been converted to css_set_rwsem, and rwsem can't nest
inside rcu_read_lock.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-14 10:52:39 -05:00
Andi Kleen
58edae3aac lto: Disable LTO for sys_ni
The assembler alias code in cond_syscall does not work
when compiled for LTO. Just disable LTO for that file.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391846481-31491-6-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 20:24:53 -08:00
Joe Mario
80375980f1 lto: Handle LTO common symbols in module loader
Here is the workaround I made for having the kernel not reject modules
built with -flto.  The clean solution would be to get the compiler to not
emit the symbol.  Or if it has to emit the symbol, then emit it as
initialized data but put it into a comdat/linkonce section.

Minor tweaks by AK over Joe's patch.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391846481-31491-5-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 20:24:50 -08:00
Andi Kleen
285c00adf6 asmlinkage: Make trace_hardirqs_on/off_caller visible
These functions are called from assembler, and thus need to be
__visible.

Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391845930-28580-12-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 18:14:54 -08:00
Andi Kleen
a7330c997d asmlinkage Make __stack_chk_failed and memcmp visible
In LTO symbols implicitely referenced by the compiler need
to be visible. Earlier these symbols were visible implicitely
from being exported, but we disabled implicit visibility fo
 EXPORTs when modules are disabled to improve code size. So
now these symbols have to be marked visible explicitely.

Do this for __stack_chk_fail (with stack protector)
and memcmp.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391845930-28580-10-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 18:13:43 -08:00
Andi Kleen
3ebae4f3a2 asmlinkage: Mark rwsem functions that can be called from assembler asmlinkage
Mark the rwsem functions that can be called from assembler asmlinkage.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391845930-28580-9-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 18:13:37 -08:00
Andi Kleen
00b7103078 asmlinkage: Make main_extable_sort_needed visible
main_extable_sort_needed is used by the build system and needs
to be a normal ELF symbol. Make it visible so that LTO
does not remove or mangle it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391845930-28580-8-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 18:13:22 -08:00
Andi Kleen
22d9fd3411 asmlinkage, mutex: Mark __visible
Various kernel/mutex.c functions can be called from
inline assembler, so they should be all global and
__visible.

Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391845930-28580-7-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 18:13:19 -08:00
Andi Kleen
b35f830533 asmlinkage: Make trace_hardirq visible
Can be called from assembler code.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391845930-28580-6-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 18:13:07 -08:00
Andi Kleen
63f9a7fde7 asmlinkage: Make lockdep_sys_exit asmlinkage
lockdep_sys_exit can be called from assembler code, so make it
asmlinkage.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391845930-28580-5-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 18:12:54 -08:00
Andi Kleen
40747ffa5a asmlinkage: Make jiffies visible
Jiffies is referenced by the linker script, so it has to be visible.

Handled both the generic and the x86 version.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391845930-28580-3-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 18:12:09 -08:00
Fengguang Wu
430af8ad9d cgroup: fix coccinelle warnings
kernel/cgroup.c:2256:1-3: WARNING: PTR_RET can be used

 Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: coccinelle/api/ptr_ret.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-13 16:42:45 -05:00
Thomas Gleixner
dd5fd9b91a tick: Clear broadcast pending bit when switching to oneshot
AMD systems which use the C1E workaround in the amd_e400_idle routine
trigger the WARN_ON_ONCE in the broadcast code when onlining a CPU.

The reason is that the idle routine of those AMD systems switches the
cpu into forced broadcast mode early on before the newly brought up
CPU can switch over to high resolution / NOHZ mode. The timer related
CPU1 bringup looks like this:

  clockevent_register_device(local_apic);
  tick_setup(local_apic);
  ...
  idle()
	tick_broadcast_on_off(FORCE);
	tick_broadcast_oneshot_control(ENTER)
	  cpumask_set(cpu, broadcast_oneshot_mask);
	halt();

Now the broadcast interrupt on CPU0 sets CPU1 in the
broadcast_pending_mask and wakes CPU1. So CPU1 continues:

	local_apic_timer_interrupt()
	   tick_handle_periodic();
	   softirq()
	     tick_init_highres();
	       cpumask_clr(cpu, broadcast_oneshot_mask);
	
	tick_broadcast_oneshot_control(ENTER)
	   WARN_ON(cpumask_test(cpu, broadcast_pending_mask);

So while we remove CPU1 from the broadcast_oneshot_mask when we switch
over to highres mode, we do not clear the pending bit, which then
triggers the warning when we go back to idle.

The reason why this is only visible on C1E affected AMD systems is
that the other machines enter the deep sleep states via
acpi_idle/intel_idle and exit the broadcast mode before executing the
remote triggered local_apic_timer_interrupt. So the pending bit is
already cleared when the switch over to highres mode is clearing the
oneshot mask.

The solution is simple: Clear the pending bit together with the mask
bit when we switch over to highres mode.

Stanislaw came up independently with the same patch by enforcing the
C1E workaround and debugging the fallout. I picked mine, because mine
has a changelog :)

Reported-by: poma <pomidorabelisima@gmail.com>
Debugged-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Olaf Hering <olaf@aepfle.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Justin M. Forbes <jforbes@redhat.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1402111434180.21991@ionos.tec.linutronix.de
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-13 21:55:54 +01:00
Tejun Heo
8541fecc04 cgroup: unexport functions
With module support gone, a lot of functions no longer need to be
exported.  Unexport them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:43 -05:00
Tejun Heo
9db8de3722 cgroup: cosmetic updates to cgroup_attach_task()
cgroup_attach_task() is planned to go through restructuring.  Let's
tidy it up a bit in preparation.

* Update cgroup_attach_task() to receive the target task argument in
  @leader instead of @tsk.

* Rename @tsk to @task.

* Rename @retval to @ret.

This is purely cosmetic.

v2: get_nr_threads() was using uninitialized @task instead of @leader.
    Fixed.  Reported by Dan Carpenter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
2014-02-13 06:58:43 -05:00
Tejun Heo
bc668c7519 cgroup: remove cgroup_taskset_cur_css() and cgroup_taskset_size()
The two functions don't have any users left.  Remove them along with
cgroup_taskset->cur_cgrp.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:43 -05:00
Tejun Heo
57fce0a68e cpuset: don't use cgroup_taskset_cur_css()
cgroup_taskset_cur_css() will be removed during the planned
resturcturing of migration path.  The only use of
cgroup_taskset_cur_css() is finding out the old cgroup_subsys_state of
the leader in cpuset_attach().  This usage can easily be removed by
remembering the old value from cpuset_can_attach().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:41 -05:00
Tejun Heo
924f0d9a20 cgroup: drop @skip_css from cgroup_taskset_for_each()
If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
css.  The intention of the interface is to make it easy to skip css's
(cgroup_subsys_states) which already match the migration target;
however, this is entirely unnecessary as migration taskset doesn't
include tasks which are already in the target cgroup.  Drop @skip_css
from cgroup_taskset_for_each().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Daniel Borkmann <dborkman@redhat.com>
2014-02-13 06:58:41 -05:00
Tejun Heo
cb0f1fe9ba cgroup: move css_set_rwsem locking outside of cgroup_task_migrate()
Instead of repeatedly locking and unlocking css_set_rwsem inside
cgroup_task_migrate(), update cgroup_attach_task() to grab it outside
of the loop and update cgroup_task_migrate() to use
put_css_set_locked().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:41 -05:00
Tejun Heo
89c5509b0d cgroup: separate out put_css_set_locked() and remove put_css_set_taskexit()
put_css_set() is performed in two steps - it first tries to put
without grabbing css_set_rwsem if such put wouldn't make the count
zero.  If that fails, it puts after write-locking css_set_rwsem.  This
patch separates out the second phase into put_css_set_locked() which
should be called with css_set_rwsem locked.

Also, put_css_set_taskexit() is droped and put_css_set() is made to
take @taskexit.  There are only a handful users of these functions.
No point in providing different variants.

put_css_locked() will be used by later changes.  This patch doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:40 -05:00
Tejun Heo
889ed9ceaa cgroup: remove css_scan_tasks()
css_scan_tasks() doesn't have any user left.  Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:40 -05:00
Tejun Heo
d66393e54e cpuset: use css_task_iter_start/next/end() instead of css_scan_tasks()
Now that css_task_iter_start/next_end() supports blocking while
iterating, there's no reason to use css_scan_tasks() which is more
cumbersome to use and scheduled to be removed.

Convert all css_scan_tasks() usages in cpuset to
css_task_iter_start/next/end().  This simplifies the code by removing
heap allocation and callbacks.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:40 -05:00
Tejun Heo
96d365e0b8 cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.

We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.

While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.

Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:40 -05:00
Tejun Heo
e406d1cfff cgroup: reimplement cgroup_transfer_tasks() without using css_scan_tasks()
Reimplement cgroup_transfer_tasks() so that it repeatedly fetches the
first task in the cgroup and then tranfers it.  This achieves the same
result without using css_scan_tasks() which is scheduled to be
removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:39 -05:00
Tejun Heo
07bc356ed2 cgroup: implement cgroup_has_tasks() and unexport cgroup_task_count()
cgroup_task_count() read-locks css_set_lock and walks all tasks to
count them and then returns the result.  The only thing all the users
want is determining whether the cgroup is empty or not.  This patch
implements cgroup_has_tasks() which tests whether cgroup->cset_links
is empty, replaces all cgroup_task_count() usages and unexports it.

Note that the test isn't synchronized.  This is the same as before.
The test has always been racy.

This will help planned css_set locking update.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-02-13 06:58:39 -05:00
Tejun Heo
afeb0f9fd4 cgroup: relocate cgroup_enable_task_cg_lists()
Move it above so that prototype isn't necessary.  Let's also move the
definition of use_task_css_set_links next to it.

This is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:39 -05:00
Tejun Heo
56fde9e01d cgroup: enable task_cg_lists on the first cgroup mount
Tasks are not linked on their css_sets until cgroup task iteration is
actually used.  This is to avoid incurring overhead on the fork and
exit paths for systems which have cgroup compiled in but don't use it.
     
This lazy binding also affects the task migration path.  It has to be
careful so that it doesn't link tasks to css_sets when task_cg_lists
linking is not enabled yet.  Unfortunately, this conditional linking
in the migration path interferes with planned migration updates.

This patch moves the lazy binding a bit earlier, to the first cgroup
mount.  It's a clear indication that cgroup is being used on the
system and task_cg_lists linking is highly likely to be enabled soon
anyway through "tasks" and "cgroup.procs" files.

This allows cgroup_task_migrate() to always link @tsk->cg_list.  Note
that it may still race with cgroup_post_fork() but who wins that race
is inconsequential.

While at it, make use_task_css_set_links a bool, add sanity checks in
cgroup_enable_task_cg_lists() and css_task_iter_start(), and update
the former so that it's guaranteed and assumes to run only once.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:38 -05:00
Tejun Heo
3558557305 cgroup: drop CGRP_ROOT_SUBSYS_BOUND
Before kernfs conversion, due to the way super_block lookup works,
cgroup roots were created and made visible before being fully
initialized.  This in turn required a special flag to mark that the
root hasn't been fully initialized so that the destruction path can
tell fully bound ones from half initialized.

That flag is CGRP_ROOT_SUBSYS_BOUND and no longer necessary after the
kernfs conversion as the lookup and creation of new root are atomic
w.r.t. cgroup_mutex.  This patch removes the flag and passes the
requests subsystem mask to cgroup_setup_root() so that it can set the
respective mask bits as subsystems are bound.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:38 -05:00
Tejun Heo
d3ba07c3aa cgroup: disallow xattr, release_agent and name if sane_behavior
Disallow more mount options if sane_behavior.  Note that xattr used to
generate warning.

While at it, simplify option check in cgroup_mount() and update
sane_behavior comment in cgroup.h.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 06:58:38 -05:00
Tejun Heo
1a11533fbd Revert "cgroup: use an ordered workqueue for cgroup destruction"
This reverts commit ab3f5faa62.
Explanation from Hugh:

  It's because more thorough testing, by others here, found that it
  wasn't always solving the problem: so I asked Tejun privately to
  hold off from sending it in, until we'd worked out why not.

  Most of our testing being on a v3,11-based kernel, it was perfectly
  possible that the problem was merely our own e.g. missing Tejun's
  8a2b753844 ("workqueue: fix ordered workqueues in NUMA setups").

  But that turned out not to be enough to fix it either. Then Filipe
  pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched()
  before we ever get to put the offline on to the workqueue: by the
  time we get to the workqueue, the ordering has already been lost.

  So, thanks for the Acks, but I'm afraid that this ordered workqueue
  solution is just not good enough: we should simply forget that patch
  and provide a different answer."

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
2014-02-12 19:08:28 -05:00
Tejun Heo
776f02fa4e cgroup: remove cgroupfs_root->refcnt
Currently, cgroupfs_root and its ->top_cgroup are separated reference
counted and the latter's is ignored.  There's no reason to do this
separately.  This patch removes cgroupfs_root->refcnt and destroys
cgroupfs_root when the top_cgroup is released.

* cgroup_put() updated to ignore cgroup_is_dead() test for top
  cgroups.  cgroup_free_fn() updated to handle root destruction when
  releasing a top cgroup.

* As root destruction is now bounced through cgroup destruction, it is
  asynchronous.  Update cgroup_mount() so that it waits for pending
  release which is currently implemented using msleep().  Converting
  this to proper wait_queue isn't hard but likely unnecessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12 09:29:50 -05:00
Tejun Heo
3c9c825b8b cgroup: rename cgroupfs_root->number_of_cgroups to ->nr_cgrps and make it atomic_t
root->number_of_cgroups is currently an integer protected with
cgroup_mutex.  Except for sanity checks and proc reporting, the only
place it's used is to check whether the root has any child during
remount; however, this is a bit flawed as the counter is not
decremented when the cgroup is unlinked but when it's released,
meaning that there could be an extended period where all cgroups are
removed but remount is still not allowed because some internal objects
are lingering.  While not perfect either, it'd be better to use
emptiness test on root->top_cgroup.children.

This patch updates cgroup_remount() to test top_cgroup's children
instead, which makes number_of_cgroups only actual usage statistics
printing in proc implemented in proc_cgroupstats_show().  Let's
shorten its name and make it an atomic_t so that we don't have to
worry about its synchronization.  It's purely auxiliary at this point.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12 09:29:50 -05:00
Tejun Heo
e61734c55c cgroup: remove cgroup->name
cgroup->name handling became quite complicated over time involving
dedicated struct cgroup_name for RCU protection.  Now that cgroup is
on kernfs, we can drop all of it and simply use kernfs_name/path() and
friends.  Replace cgroup->name and all related code with kernfs
name/path constructs.

* Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
  of kernfs counterparts, which involves semantic changes.
  pr_cont_cgroup_name() and pr_cont_cgroup_path() added.

* cgroup->name handling dropped from cgroup_rename().

* All users of cgroup_name/path() updated to the new semantics.  Users
  which were formatting the string just to printk them are converted
  to use pr_cont_cgroup_name/path() instead, which simplifies things
  quite a bit.  As cgroup_name() no longer requires RCU read lock
  around it, RCU lockings which were protecting only cgroup_name() are
  removed.

v2: Comment above oom_info_lock updated as suggested by Michal.

v3: dummy_top doesn't have a kn associated and
    pr_cont_cgroup_name/path() ended up calling the matching kernfs
    functions with NULL kn leading to oops.  Test for NULL kn and
    print "/" if so.  This issue was reported by Fengguang Wu.

v4: Rebased on top of 0ab02ca8f8 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-02-12 09:29:50 -05:00
Tejun Heo
6f30558f37 cgroup: make cgroup hold onto its kernfs_node
cgroup currently releases its kernfs_node when it gets removed.  While
not buggy, this makes cgroup->kn access rules complicated than
necessary and leads to things like get/put protection around
kernfs_remove() in cgroup_destroy_locked().  In addition, we want to
use kernfs_name/path() and friends but also want to be able to
determine a cgroup's name between removal and release.

This patch makes cgroup hold onto its kernfs_node until freed so that
cgroup->kn is always accessible.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12 09:29:50 -05:00
Tejun Heo
21a2d3430b cgroup: simplify dynamic cftype addition and removal
Dynamic cftype addition and removal using cgroup_add/rm_cftypes()
respectively has been quite hairy due to vfs i_mutex.  As i_mutex
nests outside cgroup_mutex, cgroup_mutex has to be released and
regrabbed on each iteration through the hierarchy complicating the
process.  Now that i_mutex is no longer in play, it can be simplified.

* Just holding cgroup_tree_mutex is enough.  No need to meddle with
  cgroup_mutex.

* No reason to play the unlock - relock - check serial_nr dancing.
  Everything can be atomically while holding cgroup_tree_mutex.

* cgroup_cfts_prepare() is replaced with direct locking of
  cgroup_tree_mutex.

* cgroup_cfts_commit() no longer fiddles with locking.  It just
  applies the cftypes change to the existing cgroups in the hierarchy.
  Renamed to cgroup_cfts_apply().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12 09:29:49 -05:00
Tejun Heo
0adb070426 cgroup: remove cftype_set
cftype_set was added primarily to allow registering the same cftype
array more than once for different subsystems.  Nobody uses or needs
such thing and it's already broken because each cftype has ->ss
pointer which is initialized during registration.

Let's add list_head ->node to cftype and use the first cftype entry in
the array to link them instead of allocating separate cftype_set.
While at it, trigger WARN if cft seems previously initialized during
registration.

This simplifies cftype handling a bit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12 09:29:48 -05:00