Commit graph

18375 commits

Author SHA1 Message Date
Paul E. McKenney
48d684fdad rcutorture: Run rcu_torture_writer at normal priority
There are usually lots of readers and only one writer, so if there has
to be a choice, we would want rcu_torture_writer to win.  This commit
therefore removes the set_user_nice() from rcu_torture_writer().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:26 -07:00
Thomas Gleixner
424c1b6820 rcutorture: Add missing destroy_timer_on_stack()
The rcu_torture_reader() function uses an on-stack timer_list structure
which it initializes with setup_timer_on_stack().  However, it fails to
use destroy_timer_on_stack() before exiting, which results in leaking a
tracking object if DEBUG_OBJECTS is enabled.  This commit therefore
invokes destroy_timer_on_stack() to avoid this leakage.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-14 09:46:24 -07:00
Paul E. McKenney
f0bf8fab4f rcutorture: Explicitly test synchronous grace-period primitives
The original rcu_torture_writer() avoided testing the synchronous
grace-period primitives because they were simply wrappers around
call_rcu() invocations.  The testing of these synchronous primitives
was delegated to the fake writers.  However, there really is no excuse
not to test them, especially in the case of SRCU, where the wrappering
is somewhat more elaborate.  This commit therefore makes the default
rcutorture parameters cause rcu_torture_writer() to include synchronous
grace-period primitives in its testing.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:22 -07:00
Paul E. McKenney
a48f3fad4f rcutorture: Add tests for get_state_synchronize_rcu()
This commit adds rcutorture testing for get_state_synchronize_rcu()
and cond_synchronize_rcu().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:21 -07:00
Paul E. McKenney
d0d0606e2c rcutorture: Check for rcu_torture_fqs creation errors
The return value from torture_create_kthread() is currently ignored
when creating the rcu_torture_fqs kthread.  This commit therefore
captures the return value so that it can be tested for errors.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:17 -07:00
Iulia Manda
5ed63b199c torture: Notice if an all-zero cpumask is passed inside a critical section
In torture_shuffle_tasks function, the check if an all-zero mask can
be passed to set_cpus_allowed_ptr() is redundant after clearing the
shuffle_idle_cpu bit. If the mask had more than one bit set, after
clearing a bit it has at least one bit set. If the mask had only
one bit set, a check is made at the beginning, where the function
returns, as there is no need to shuffle only one cpu.

Also, this code is executed inside a critical section, delimited by
get_online_cpus(), and put_online_cpus(), preventing CPUs from leaving between
the check of num_online_cpus and the calls to set_cpus_allowed_ptr() function.

Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:14 -07:00
Paul E. McKenney
64e4b43ae0 rcutorture: Make rcu_torture_reader() use cond_resched()
The rcu_torture_reader() function currently uses schedule().  This commit
therefore speeds things up a bit by substituting cond_resched().
This change makes rcu_torture_reader() more CPU-bound, so this commit
also adjusts the number of readers (the "nreaders" module parameter,
which feeds into the "nrealreaders" variable) to allow one CPU to be
free of readers on SMP systems.  The point of this is to increase the
probability that readers will be watching while an updater makes a change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:13 -07:00
Paul E. McKenney
ac1bea8578 sched,rcu: Make cond_resched() report RCU quiescent states
Given a CPU running a loop containing cond_resched(), with no
other tasks runnable on that CPU, RCU will eventually report RCU
CPU stall warnings due to lack of quiescent states.  Fortunately,
every call to cond_resched() is a perfectly good quiescent state.
Unfortunately, invoking rcu_note_context_switch() is a bit heavyweight
for cond_resched(), especially given the need to disable preemption,
and, for RCU-preempt, interrupts as well.

This commit therefore maintains a per-CPU counter that causes
cond_resched(), cond_resched_lock(), and cond_resched_softirq() to call
rcu_note_context_switch(), but only about once per 256 invocations.
This ratio was chosen in keeping with the relative time constants of
RCU grace periods.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:11 -07:00
Paul E. McKenney
afea227fd4 rcutorture: Export RCU grace-period kthread wait state to rcutorture
This commit allows rcutorture to print additional state for the
RCU grace-period kthreads in cases where RCU seems reluctant to
start a new grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:09 -07:00
Paul E. McKenney
945fa9c631 torture: Dump ftrace buffer when the RCU grace period stalls
This commit adds a call to rcutorture_trace_dump() to dump the ftrace
buffer when the RCU grace period stalls in order to help debug the
stall.  Note that this is different than the RCU CPU stall warning,
as it is rcutorture detecting the stall rather than the underlying RCU
implementation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:07 -07:00
Paul E. McKenney
ab7d45053f torture: Increase stutter-end intensity
Currently, all stuttered kthreads block a jiffy at a time, which can
result in them starting at different times.  (Note: This is not an
energy-efficiency problem unless you run torture tests in production,
in which case you have other problems!)  This commit increases the
intensity of the restart event by causing kthreads to spin through the
last jiffy, restarting when they see the variable change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:02 -07:00
Paul E. McKenney
0d6821d5f7 torture: Include "Stopping" string to torture_kthread_stopping()
Currently, torture_kthread_stopping() prints only the name of the
kthread that is stopping, which can be unedifying.  This commit therefore
adds "Stopping" to make things more evident.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-14 09:45:59 -07:00
Paul E. McKenney
589a8f5950 rcutorture: Print negatives for SRCU counter wraparound
The srcu_torture_stats() function prints SRCU's per-CPU c[] array with
an unsigned format, which means that the number one less than zero is
a very large number.  This commit therefore prints this array with a
signed format in order to improve readability of the rcutorture output.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:45:58 -07:00
Rashika Kheria
b3b8a4d42b rcutorture: Mark function as static in kernel/rcu/torture.c
Mark functions as static in kernel/rcu/torture.c because they are not
used outside this file.

This eliminates the following warning in kernel/rcu/torture.c:
kernel/rcu/torture.c:902:6: warning: no previous prototype for ‘rcutorture_trace_dump’ [-Wmissing-prototypes]
kernel/rcu/torture.c:1572:6: warning: no previous prototype for ‘rcu_torture_barrier_cbf’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-14 09:45:53 -07:00
Steven Rostedt (Red Hat)
f1b2f2bd58 ftrace: Remove FTRACE_UPDATE_MODIFY_CALL_REGS flag
As the decision to what needs to be done (converting a call to the
ftrace_caller to ftrace_caller_regs or to convert from ftrace_caller_regs
to ftrace_caller) can easily be determined from the rec->flags of
FTRACE_FL_REGS and FTRACE_FL_REGS_EN, there's no need to have the
ftrace_check_record() return either a UPDATE_MODIFY_CALL_REGS or a
UPDATE_MODIFY_CALL. Just he latter is enough. This added flag causes
more complexity than is required. Remove it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-14 11:37:30 -04:00
Steven Rostedt (Red Hat)
7c0868e03b ftrace: Use the ftrace_addr helper functions to find the ftrace_addr
With the moving of the functions that determine what the mcount call site
should be replaced with into the generic code, there is a few places
in the generic code that can use them instead of hard coding it as it
does.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-14 11:37:29 -04:00
Steven Rostedt (Red Hat)
7413af1fb7 ftrace: Make get_ftrace_addr() and get_ftrace_addr_old() global
Move and rename get_ftrace_addr() and get_ftrace_addr_old() to
ftrace_get_addr_new() and ftrace_get_addr_curr() respectively.

This moves these two helper functions in the generic code out from
the arch specific code, and renames them to have a better generic
name. This will allow other archs to use them as well as makes it
a bit easier to work on getting separate trampolines for different
functions.

ftrace_get_addr_new() returns the trampoline address that the mcount
call address will be converted to.

ftrace_get_addr_curr() returns the trampoline address of what the
mcount call address currently jumps to.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-14 11:37:29 -04:00
Steven Rostedt (Red Hat)
68f40969f0 ftrace: Always inline ftrace_hash_empty() helper function
The ftrace_hash_empty() function is a simple test:

	return !hash || !hash->count;

But gcc seems to want to make it a call. As this is in an extreme
hot path of the function tracer, there's no reason it needs to be
a call. I only wrote it to be a helper function anyway, otherwise
it would have been inlined manually.

Force gcc to inline it, as it could have also been a macro.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-14 11:37:28 -04:00
Steven Rostedt (Red Hat)
19eab4a472 ftrace: Write in missing comment from a very old commit
Back in 2011 Commit ed926f9b35 "ftrace: Use counters to enable
functions to trace" changed the way ftrace accounts for enabled
and disabled traced functions. There was a comment started as:

	/*
	 *
	 */

But never finished. Well, that's rather useless. I probably forgot
to save the file before committing it. And it passed review from all
this time.

Anyway, better late than never. I updated the comment to express what
is happening in that somewhat complex code.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-14 11:37:27 -04:00
Steven Rostedt (Red Hat)
66209a5bd4 ftrace: Remove boolean of hash_enable and hash_disable
Commit 4104d326b6 "ftrace: Remove global function list and call
function directly" cleaned up the global_ops filtering and made
the code simpler, but it left a variable "hash_enable" that was used
to know if the hash functions should be updated or not. It was
updated if the global_ops did not override them. As the global_ops
are now no different than any other ftrace_ops, the hash always
gets updated and there's no reason to use the hash_enable boolean.

The same goes for hash_disable used in ftrace_shutdown().

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-14 11:37:25 -04:00
Tejun Heo
9d755d33f0 cgroup: use cgroup->self.refcnt for cgroup refcnting
Currently cgroup implements refcnting separately using atomic_t
cgroup->refcnt.  The destruction paths of cgroup and css are rather
complex and bear a lot of similiarities including the use of RCU and
bouncing to a work item.

This patch makes cgroup use the refcnt of self css for refcnting
instead of using its own.  This makes cgroup refcnting use css's
percpu refcnt and share the destruction mechanism.

* css_release_work_fn() and css_free_work_fn() are updated to handle
  both csses and cgroups.  This is a bit messy but should do until we
  can make cgroup->self a full css, which currently can't be done
  thanks to multiple hierarchies.

* cgroup_destroy_locked() now performs
  percpu_ref_kill(&cgrp->self.refcnt) instead of cgroup_put(cgrp).

* Negative refcnt sanity check in cgroup_get() is no longer necessary
  as percpu_ref already handles it.

* Similarly, as a cgroup which hasn't been killed will never be
  released regardless of its refcnt value and percpu_ref has sanity
  check on kill, cgroup_is_dead() sanity check in cgroup_put() is no
  longer necessary.

* As whether a refcnt reached zero or not can only be decided after
  the reference count is killed, cgroup_root->cgrp's refcnting can no
  longer be used to decide whether to kill the root or not.  Let's
  make cgroup_kill_sb() explicitly initiate destruction if the root
  doesn't have any children.  This makes sense anyway as unmounted
  cgroup hierarchy without any children should be destroyed.

While this is a bit messy, this will allow pushing more bookkeeping
towards cgroup->self and thus handling cgroups and csses in more
uniform way.  In the very long term, it should be possible to
introduce a base subsystem and convert the self css to a proper one
making things whole lot simpler and unified.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:02 -04:00
Tejun Heo
9395a45004 cgroup: enable refcnting for root csses
Currently, css_get(), css_tryget() and css_tryget_online() are noops
for root csses as an optimization; however, we're planning to use css
refcnts to track of cgroup lifetime too and root cgroups also need to
be reference counted.  Since css has been converted to percpu_refcnt,
the overhead of refcnting is miniscule and this optimization isn't too
meaningful anymore.  Furthermore, controllers which optimize the root
cgroup often never even invoke these functions in their hot paths.

This patch enables refcnting for root csses too.  This makes CSS_ROOT
flag unused and removes it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:02 -04:00
Tejun Heo
25e15d8350 cgroup: bounce css release through css->destroy_work
css release is planned to do more and would require process context.
Bounce it through css->destroy_work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:02 -04:00
Tejun Heo
249f3468a2 cgroup: remove cgroup_destory_css_killed()
cgroup_destroy_css_killed() is cgroup destruction stage which happens
after all csses are offlined.  After the recent updates, it no longer
does anything other than putting the base reference.  This patch
removes the function and makes cgroup_destroy_locked() put the base
ref at the end isntead.

This also makes cgroup->nr_css unnecessary.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo
4e4e284723 cgroup: move cgroup->sibling unlinking to cgroup_put()
Move cgroup->sibling unlinking from cgroup_destroy_css_killed() to
cgroup_put().  This is later but still before the RCU grace period, so
it doesn't break css_next_child() although there now is a larger
window in which a dead cgroup is visible during css iteration.  As css
iteration always could have included offline csses, this doesn't
affect correctness; however, it does make css_next_child() fall back
to reiterting mode more often.  This also makes cgroup_put() directly
take cgroup_mutex, which limits where it can be called from.  These
are not immediately problematic and will be dealt with later.

This change enables simplification of cgroup destruction path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo
9e4173e1f2 cgroup: move check_for_release(parent) call to the end of cgroup_destroy_locked()
Currently, check_for_release() on the parent of a destroyed cgroup is
invoked from cgroup_destroy_css_killed().  This is because this is
where the destroyed cgroup can be removed from the parent's children
list.  check_for_release() tests the emptiness of the list directly,
so invoking it before removing the cgroup from the list makes it think
that the parent still has children even when it no longer does.

This patch updates check_for_release() to use
cgroup_has_live_children() instead of directly testing ->children
emptiness and moves check_for_release(parent) earlier to the end of
cgroup_destroy_locked().  As cgroup_has_live_children() ignores
cgroups marked DEAD, check_for_release() functions correctly as long
as it's called after asserting DEAD.

This makes release notification slightly more timely and more
importantly enables further simplification of cgroup destruction path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo
cbc125efad cgroup: separate out cgroup_has_live_children() from cgroup_destroy_locked()
We're expecting another user.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo
9d800df12d cgroup: rename cgroup->dummy_css to ->self and move it to the top
cgroup->dummy_css is used as the placeholder css when performing css
oriended operations on the cgroup.  We're gonna shift more cgroup
management to this css.  Let's rename it to ->self and move it to the
top.

This is pure rename and field relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:00 -04:00
Tejun Heo
a015edd26e cgroup: use restart_syscall() for mount retries
cgroup_mount() uses dumb delay-and-retry logic to wait for cgroup_root
which is being destroyed.  The retry currently loops inside
cgroup_mount() proper.  This patch makes it return with
restart_syscall() instead so that retry travels out to userland
boundary.

This slightly simplifies the logic and more importantly makes the
retry logic behave better when the wait for some reason becomes
lengthy or infinite by allowing the operation to be suspended or
terminated from userland.

v2: The original patch forgot to free memory allocated for @opts.
    Fixed.  Caught by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:00 -04:00
Oleg Nesterov
b02ef20a9f uprobes/x86: Fix the wrong ->si_addr when xol triggers a trap
If the probed insn triggers a trap, ->si_addr = regs->ip is technically
correct, but this is not what the signal handler wants; we need to pass
the address of the probed insn, not the address of xol slot.

Add the new arch-agnostic helper, uprobe_get_trap_addr(), and change
fill_trap_info() and math_error() to use it. !CONFIG_UPROBES case in
uprobes.h uses a macro to avoid include hell and ensure that it can be
compiled even if an architecture doesn't define instruction_pointer().

Test-case:

	#include <signal.h>
	#include <stdio.h>
	#include <unistd.h>

	extern void probe_div(void);

	void sigh(int sig, siginfo_t *info, void *c)
	{
		int passed = (info->si_addr == probe_div);
		printf(passed ? "PASS\n" : "FAIL\n");
		_exit(!passed);
	}

	int main(void)
	{
		struct sigaction sa = {
			.sa_sigaction	= sigh,
			.sa_flags	= SA_SIGINFO,
		};

		sigaction(SIGFPE, &sa, NULL);

		asm (
			"xor %ecx,%ecx\n"
			".globl probe_div; probe_div:\n"
			"idiv %ecx\n"
		);

		return 0;
	}

it fails if probe_div() is probed.

Note: show_unhandled_signals users should probably use this helper too,
but we need to cleanup them first.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
2014-05-14 13:57:28 +02:00
Oleg Nesterov
29dedee0e6 uprobes: Add mem_cgroup_charge_anon() into uprobe_write_opcode()
Hugh says:

    The one I noticed was that it forgets all about memcg (because
    it was copied from KSM, and there the replacement page has already
    been charged to a memcg). See how mm/memory.c do_anonymous_page()
    does a mem_cgroup_charge_anon().

Hopefully not a big problem, uprobes is a system-wide thing and only
root can insert the probes. But I agree, should be fixed anyway.

Add mem_cgroup_{un,}charge_anon() into uprobe_write_opcode(). To simplify
the error handling (and avoid the new "uncharge" label) the patch also
moves anon_vma_prepare() up before we alloc/charge the new page.

While at it fix the comment about ->mmap_sem, it is held for write.

Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2014-05-14 13:57:24 +02:00
Rusty Russell
4982223e51 module: set nx before marking module MODULE_STATE_COMING.
We currently set RO & NX on modules very late: after we move them from
MODULE_STATE_UNFORMED to MODULE_STATE_COMING, and after we call
parse_args() (which can exec code in the module).

Much better is to do it in complete_formation() and then call
the notifier.

This means that the notifiers will be called on a module which
is already RO & NX, so that may cause problems (ftrace already
changed so they're unaffected).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2014-05-14 10:55:47 +09:30
Paul E. McKenney
da601c63fd torture: Intensify locking test
The current lock_torture_writer() spends too much time sleeping and not
enough time hammering locks, as in an eight-CPU test will often only be
utilizing a CPU or two.  This commit therefore makes lock_torture_writer()
sleep less and hammer more.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-13 17:03:01 -07:00
Paul E. McKenney
ad0dc7f94d rcutorture: Add forward-progress checking for writer
The rcutorture output currently does not distinguish between stalls in
the RCU implementation and stalls in the rcu_torture_writer() kthreads.
This commit therefore adds some diagnostics to help distinguish between
these two conditions, at least for the non-SRCU implementations.  (SRCU
does not provide evidence of update-side forward progress by design.)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-13 11:18:18 -07:00
Tejun Heo
8353da1f91 cgroup: remove cgroup_tree_mutex
cgroup_tree_mutex was introduced to work around the circular
dependency between cgroup_mutex and kernfs active protection - some
kernfs file and directory operations needed cgroup_mutex putting
cgroup_mutex under active protection but cgroup also needs to be able
to access cgroup hierarchies and cftypes to determine which
kernfs_nodes need to be removed.  cgroup_tree_mutex nested above both
cgroup_mutex and kernfs active protection and used to protect the
hierarchy and cftypes.  While this worked, it added a lot of double
lockings and was generally cumbersome.

kernfs provides a mechanism to opt out of active protection and cgroup
was already using it for removal and subtree_control.  There's no
reason to mix both methods of avoiding circular locking dependency and
the preceding cgroup_kn_lock_live() changes applied it to all relevant
cgroup kernfs operations making it unnecessary to nest cgroup_mutex
under kernfs active protection.  The previous patch reversed the
original lock ordering and put cgroup_mutex above kernfs active
protection.

After these changes, all cgroup_tree_mutex usages are now accompanied
by cgroup_mutex making the former completely redundant.  This patch
removes cgroup_tree_mutex and all its usages.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:23 -04:00
Tejun Heo
01f6474ce0 cgroup: nest kernfs active protection under cgroup_mutex
After the recent cgroup_kn_lock_live() changes, cgroup_mutex is no
longer nested below kernfs active protection.  The two don't have any
relationship now.

This patch nests kernfs active protection under cgroup_mutex.  All
cftype operations now require both cgroup_tree_mutex and cgroup_mutex,
temporary cgroup_mutex releases over kernfs operations are removed,
and cgroup_add/rm_cftypes() grab both mutexes.

This makes cgroup_tree_mutex redundant, which will be removed by the
next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:23 -04:00
Tejun Heo
e76ecaeef6 cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods
Make __cgroup_procs_write() and cgroup_release_agent_write() use
cgroup_kn_lock_live() and cgroup_kn_unlock() instead of
cgroup_lock_live_group().  This puts the operations under both
cgroup_tree_mutex and cgroup_mutex protection without circular
dependency from kernfs active protection.  Also, this means that
cgroup_mutex is no longer nested below kernfs active protection.
There is no longer any place where the two locks interact.

This leaves cgroup_lock_live_group() without any user.  Removed.

This will help simplifying cgroup locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:23 -04:00
Tejun Heo
a9746d8da7 cgroup: factor out cgroup_kn_lock_live() and cgroup_kn_unlock()
cgroup_mkdir(), cgroup_rmdir() and cgroup_subtree_control_write()
share the logic to break active protection so that they can grab
cgroup_tree_mutex which nests above active protection and/or remove
self.  Factor out this logic into cgroup_kn_lock_live() and
cgroup_kn_unlock().

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo
cfc79d5bec cgroup: move cgroup->kn->priv clearing to cgroup_rmdir()
The ->priv field of a cgroup directory kernfs_node points back to the
cgroup.  This field is RCU cleared in cgroup_destroy_locked() for
non-kernfs accesses from css_tryget_from_dir() and
cgroupstats_build().

As these are only applicable to cgroups which finished creation
successfully and fully initialized cgroups are always removed by
cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir().

This will help simplifying cgroup locking and shouldn't introduce any
behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo
ddab2b6e0e cgroup: grab cgroup_mutex earlier in cgroup_subtree_control_write()
Move cgroup_lock_live_group() invocation upwards to right below
cgroup_tree_mutex in cgroup_subtree_control_write().  This is to help
the planned locking simplification.

This doesn't make any userland-visible behavioral changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo
b3bfd983ca cgroup: collapse cgroup_create() into croup_mkdir()
cgroup_mkdir() is the sole user of cgroup_create().  Let's collapse
the latter into the former.  This will help simplifying locking.
While at it, remove now stale comment about inode locking.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo
ba0f4d7615 cgroup: reorganize cgroup_create()
Reorganize cgroup_create() so that all paths share unlock out path.

* All err_* labels are renamed to out_* as they're now shared by both
  success and failure paths.

* @err renamed to @ret for the similar reason as above and so that
  it's more consistent with other functions.

* cgroup memory allocation moved after locking so that freeing failed
  cgroup happens before unlocking.  While this moves more code inside
  critical section, memory allocations inside cgroup locking are
  already pretty common and this is unlikely to make any noticeable
  difference.

* While at it, replace a stray @parent->root dereference with @root.

This reorganization will help simplifying locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo
b7fc5ad235 cgroup: remove cgroup->control_kn
Now that cgroup_subtree_control_write() has access to the associated
kernfs_open_file and thus the kernfs_node, there's no need to cache it
in cgroup->control_kn on creation.  Remove cgroup->control_kn and use
@of->kn directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:16:22 -04:00
Tejun Heo
acbef755f4 cgroup: convert "tasks" and "cgroup.procs" handle to use cftype->write()
cgroup_tasks_write() and cgroup_procs_write() are currently using
cftype->write_u64().  This patch converts them to use cftype->write()
instead.  This allows access to the associated kernfs_open_file which
will be necessary to implement the planned kernfs active protection
manipulation for these files.

This shifts buffer parsing to attach_task_by_pid() and makes it return
@nbytes on success.  Let's rename it to __cgroup_procs_write() to
clearly indicate that this is a write handler implementation.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:16:22 -04:00
Tejun Heo
6770c64e5c cgroup: replace cftype->trigger() with cftype->write()
cftype->trigger() is pointless.  It's trivial to ignore the input
buffer from a regular ->write() operation.  Convert all ->trigger()
users to ->write() and remove ->trigger().

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
2014-05-13 12:16:21 -04:00
Tejun Heo
451af504df cgroup: replace cftype->write_string() with cftype->write()
Convert all cftype->write_string() users to the new cftype->write()
which maps directly to kernfs write operation and has full access to
kernfs and cgroup contexts.  The conversions are mostly mechanical.

* @css and @cft are accessed using of_css() and of_cft() accessors
  respectively instead of being specified as arguments.

* Should return @nbytes on success instead of 0.

* @buf is not trimmed automatically.  Trim if necessary.  Note that
  blkcg and netprio don't need this as the parsers already handle
  whitespaces.

cftype->write_string() has no user left after the conversions and
removed.

While at it, remove unnecessary local variable @p in
cgroup_subtree_control_write() and stale comment about
CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

This patch doesn't introduce any visible behavior changes.

v2: netprio was missing from conversion.  Converted.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
2014-05-13 12:16:21 -04:00
Tejun Heo
b41686401e cgroup: implement cftype->write()
During the recent conversion to kernfs, cftype's seq_file operations
are updated so that they are directly mapped to kernfs operations and
thus can fully access the associated kernfs and cgroup contexts;
however, write path hasn't seen similar updates and none of the
existing write operations has access to, for example, the associated
kernfs_open_file.

Let's introduce a new operation cftype->write() which maps directly to
the kernfs write operation and has access to all the arguments and
contexts.  This will replace ->write_string() and ->trigger() and ease
manipulation of kernfs active protection from cgroup file operations.

Two accessors - of_cft() and of_css() - are introduced to enable
accessing the associated cgroup context from cftype->write() which
only takes kernfs_open_file for the context information.  The
accessors for seq_file operations - seq_cft() and seq_css() - are
rewritten to wrap the of_ accessors.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:16:21 -04:00
Tejun Heo
ec903c0c85 cgroup: rename css_tryget*() to css_tryget_online*()
Unlike the more usual refcnting, what css_tryget() provides is the
distinction between online and offline csses instead of protection
against upping a refcnt which already reached zero.  cgroup is
planning to provide actual tryget which fails if the refcnt already
reached zero.  Let's rename the existing trygets so that they clearly
indicate that they're onliness.

I thought about keeping the existing names as-are and introducing new
names for the planned actual tryget; however, given that each
controller participates in the synchronization of the online state, it
seems worthwhile to make it explicit that these functions are about
on/offline state.

Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
to css_tryget_online_from_dir().  This is pure rename.

v2: cgroup_freezer grew new usages of css_tryget().  Update
    accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
2014-05-13 12:11:01 -04:00
Tejun Heo
46cfeb043b cgroup: use release_agent_path_lock in cgroup_release_agent_show()
release_path is now protected by release_agent_path_lock to allow
accessing it without grabbing cgroup_mutex; however,
cgroup_release_agent_show() was still grabbing cgroup_mutex.  Let's
convert it to release_agent_path_lock so that we don't have to worry
about this one for the planned locking updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:11:00 -04:00
Tejun Heo
7d331fa985 cgroup: use restart_syscall() for retries after offline waits in cgroup_subtree_control_write()
After waiting for a child to finish offline,
cgroup_subtree_control_write() jumps up to retry from after the input
parsing and active protection breaking.  This retry makes the
scheduled locking update - removal of cgroup_tree_mutex - more
difficult.  Let's simplify it by returning with restart_syscall() for
retries.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:11:00 -04:00