Make it easier to break up printk into bite-sized chunks.
Remove printk path/filename from comment.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 3105b86a9f ("mm: sched: numa: Control enabling and disabling of
NUMA balancing if !SCHED_DEBUG") defined numabalancing_enabled to
control the enabling and disabling of automatic NUMA balancing, but it
is never used.
I believe the intention was to use this in place of sched_feat_numa(NUMA).
Currently, if SCHED_DEBUG is not defined, sched_feat_numa(NUMA) will
never be changed from the initial "false".
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) Fix association failures not triggering a connect-failure event in
cfg80211, from Johannes Berg.
2) Eliminate a potential NULL deref with older iptables tools when
configuring xt_socket rules, from Eric Dumazet.
3) Missing RTNL locking in wireless regulatory code, from Johannes
Berg.
4) Fix OOPS caused by firmware loading races in ath9k_htc, from Alexey
Khoroshilov.
5) Fix usb URB leak in usb_8dev CAN driver, also from Alexey
Khoroshilov.
6) VXLAN namespace teardown fails to unregister devices, from Stephen
Hemminger.
7) Fix multicast settings getting dropped by firmware in qlcnic driver,
from Sucheta Chakraborty.
8) Add sysctl range enforcement for tcp_syn_retries, from Michal Tesar.
9) Fix a nasty bug in bridging where an active timer would get
reinitialized with a setup_timer() call. From Eric Dumazet.
10) Fix use after free in new mlx5 driver, from Dan Carpenter.
11) Fix freed pointer reference in ipv6 multicast routing on namespace
cleanup, from Hannes Frederic Sowa.
12) Some usbnet drivers report TSO and SG in their feature set, but the
usbnet layer doesn't really support them. From Eric Dumazet.
13) Fix crash on EEH errors in tg3 driver, from Gavin Shan.
14) Drop cb_lock when requesting modules in genetlink, from Stanislaw
Gruszka.
15) Kernel stack leaks in cbq scheduler and af_key pfkey messages, from
Dan Carpenter.
16) FEC driver erroneously signals NETDEV_TX_BUSY on transmit leading to
endless loops, from Uwe Kleine-König.
17) Fix hangs from loading mvneta driver, from Arnaud Patard.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (84 commits)
mlx5: fix error return code in mlx5_alloc_uuars()
mvneta: Try to fix mvneta when compiled as module
mvneta: Fix hang when loading the mvneta driver
atl1c: Fix misuse of netdev_alloc_skb in refilling rx ring
genetlink: fix usage of NLM_F_EXCL or NLM_F_REPLACE
af_key: more info leaks in pfkey messages
net/fec: Don't let ndo_start_xmit return NETDEV_TX_BUSY without link
net_sched: Fix stack info leak in cbq_dump_wrr().
igb: fix vlan filtering in promisc mode when not in VT mode
ixgbe: Fix Tx Hang issue with lldpad on 82598EB
genetlink: release cb_lock before requesting additional module
net: fec: workaround stop tx during errata ERR006358
qlcnic: Fix diagnostic interrupt test for 83xx adapters.
qlcnic: Fix setting Guest VLAN
qlcnic: Fix operation type and command type.
qlcnic: Fix initialization of work function.
Revert "atl1c: Fix misuse of netdev_alloc_skb in refilling rx ring"
atl1c: Fix misuse of netdev_alloc_skb in refilling rx ring
net/tg3: Fix warning from pci_disable_device()
net/tg3: Fix kernel crash
...
The "break" used in the do_for_each_event_file() is used as an optimization
as the loop is really a double loop. The loop searches all event files
for each trace_array. There's only one matching event file per trace_array
and after we find the event file for the trace_array, the break is used
to jump to the next trace_array and start the search there.
As this is not a standard way of using "break" in C code, it requires
a comment right before the break to let people know what is going on.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Change trace_remove_event_call(call) to return the error if this
call is active. This is what the callers assume but can't verify
outside of the tracing locks. Both trace_kprobe.c/trace_uprobe.c
need the additional changes, unregister_trace_probe() should abort
if trace_remove_event_call() fails.
The caller is going to free this call/file so we must ensure that
nobody can use them after trace_remove_event_call() succeeds.
debugfs should be fine after the previous changes and event_remove()
does TRACE_REG_UNREGISTER, but still there are 2 reasons why we need
the additional checks:
- There could be a perf_event(s) attached to this tp_event, so the
patch checks ->perf_refcount.
- TRACE_REG_UNREGISTER can be suppressed by FTRACE_EVENT_FL_SOFT_MODE,
so we simply check FTRACE_EVENT_FL_ENABLED protected by event_mutex.
Link: http://lkml.kernel.org/r/20130729175033.GB26284@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This enables us to lookup a cgroup by its id.
v4:
- add a comment for idr_remove() in cgroup_offline_fn().
v3:
- on success, idr_alloc() returns the id but not 0, so fix the BUG_ON()
in cgroup_init().
- pass the right value to idr_alloc() so that the id for dummy cgroup is 0.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Constantly use @cset for css_set variables and use @cgrp as cgroup
variables.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We can use struct cfent instead.
v2:
- remove cgroup_seqfile_release().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This should have been removed in commit d7eeac1913
("cgroup: hold cgroup_mutex before calling css_offline").
While at it, update the comments.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There's been a nasty bug that would show up and not give much info.
The bug displayed the following warning:
WARNING: at kernel/trace/ftrace.c:1529 __ftrace_hash_rec_update+0x1e3/0x230()
Pid: 20903, comm: bash Tainted: G O 3.6.11+ #38405.trunk
Call Trace:
[<ffffffff8103e5ff>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8103e65a>] warn_slowpath_null+0x1a/0x20
[<ffffffff810c2ee3>] __ftrace_hash_rec_update+0x1e3/0x230
[<ffffffff810c4f28>] ftrace_hash_move+0x28/0x1d0
[<ffffffff811401cc>] ? kfree+0x2c/0x110
[<ffffffff810c68ee>] ftrace_regex_release+0x8e/0x150
[<ffffffff81149f1e>] __fput+0xae/0x220
[<ffffffff8114a09e>] ____fput+0xe/0x10
[<ffffffff8105fa22>] task_work_run+0x72/0x90
[<ffffffff810028ec>] do_notify_resume+0x6c/0xc0
[<ffffffff8126596e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[<ffffffff815c0f88>] int_signal+0x12/0x17
---[ end trace 793179526ee09b2c ]---
It was finally narrowed down to unloading a module that was being traced.
It was actually more than that. When functions are being traced, there's
a table of all functions that have a ref count of the number of active
tracers attached to that function. When a function trace callback is
registered to a function, the function's record ref count is incremented.
When it is unregistered, the function's record ref count is decremented.
If an inconsistency is detected (ref count goes below zero) the above
warning is shown and the function tracing is permanently disabled until
reboot.
The ftrace callback ops holds a hash of functions that it filters on
(and/or filters off). If the hash is empty, the default means to filter
all functions (for the filter_hash) or to disable no functions (for the
notrace_hash).
When a module is unloaded, it frees the function records that represent
the module functions. These records exist on their own pages, that is
function records for one module will not exist on the same page as
function records for other modules or even the core kernel.
Now when a module unloads, the records that represents its functions are
freed. When the module is loaded again, the records are recreated with
a default ref count of zero (unless there's a callback that traces all
functions, then they will also be traced, and the ref count will be
incremented).
The problem is that if an ftrace callback hash includes functions of the
module being unloaded, those hash entries will not be removed. If the
module is reloaded in the same location, the hash entries still point
to the functions of the module but the module's ref counts do not reflect
that.
With the help of Steve and Joern, we found a reproducer:
Using uinput module and uinput_release function.
cd /sys/kernel/debug/tracing
modprobe uinput
echo uinput_release > set_ftrace_filter
echo function > current_tracer
rmmod uinput
modprobe uinput
# check /proc/modules to see if loaded in same addr, otherwise try again
echo nop > current_tracer
[BOOM]
The above loads the uinput module, which creates a table of functions that
can be traced within the module.
We add uinput_release to the filter_hash to trace just that function.
Enable function tracincg, which increments the ref count of the record
associated to uinput_release.
Remove uinput, which frees the records including the one that represents
uinput_release.
Load the uinput module again (and make sure it's at the same address).
This recreates the function records all with a ref count of zero,
including uinput_release.
Disable function tracing, which will decrement the ref count for uinput_release
which is now zero because of the module removal and reload, and we have
a mismatch (below zero ref count).
The solution is to check all currently tracing ftrace callbacks to see if any
are tracing any of the module's functions when a module is loaded (it already does
that with callbacks that trace all functions). If a callback happens to have
a module function being traced, it increments that records ref count and starts
tracing that function.
There may be a strange side effect with this, where tracing module functions
on unload and then reloading a new module may have that new module's functions
being traced. This may be something that confuses the user, but it's not
a big deal. Another approach is to disable all callback hashes on module unload,
but this leaves some ftrace callbacks that may not be registered, but can
still have hashes tracing the module's function where ftrace doesn't know about
it. That situation can cause the same bug. This solution solves that case too.
Another benefit of this solution, is it is possible to trace a module's
function on unload and load.
Link: http://lkml.kernel.org/r/20130705142629.GA325@redhat.com
Reported-by: Jörn Engel <joern@logfs.org>
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Steve Hodgson <steve@purestorage.com>
Tested-by: Steve Hodgson <steve@purestorage.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A perf event can be used without forcing the tick to
stay alive if it doesn't use a frequency but a sample
period and if it doesn't throttle (raise storm of events).
Since the lockup detector neither use a perf event frequency
nor should ever throttle due to its high period, it can now
run concurrently with the full dynticks feature.
So remove the hack that disabled the watchdog.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Anish Singh <anish198519851985@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-9-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the full dynticks subsystem keep the
tick alive as long as there are perf events running.
This prevents the tick from being stopped as long as features
such that the lockup detectors are running. As a temporary fix,
the lockup detector is disabled by default when full dynticks
is built but this is not a long term viable solution.
To fix this, only keep the tick alive when an event configured
with a frequency rather than a period is running on the CPU,
or when an event throttles on the CPU.
These are the only purposes of the perf tick, especially now that
the rotation of flexible events is handled from a seperate hrtimer.
The tick can be shutdown the rest of the time.
Original-patch-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is going to be used by the full dynticks subsystem
as a finer-grained information to know when to keep and
when to stop the tick.
Original-patch-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-7-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When an event is migrated, move the event per-cpu
accounting accordingly so that branch stack and cgroup
events work correctly on the new CPU.
Original-patch-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This way we can use the per-cpu handling seperately.
This is going to be used by to fix the event migration
code accounting.
Original-patch-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Gather all the event accounting code to a single place,
once all the prerequisites are completed. This simplifies
the refcounting.
Original-patch-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In case of allocation failure, get_callchain_buffer() keeps the
refcount incremented for the current event.
As a result, when get_callchain_buffers() returns an error,
we must cleanup what it did by cancelling its last refcount
with a call to put_callchain_buffers().
This is a hack in order to be able to call free_event()
after that failure.
The original purpose of that was to simplify the failure
path. But this error handling is actually counter intuitive,
ugly and not very easy to follow because one expect to
see the resources used to perform a service to be cleaned
by the callee if case of failure, not by the caller.
So lets clean this up by cancelling the refcount from
get_callchain_buffer() in case of failure. And correctly free
the event accordingly in perf_event_alloc().
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On callchain buffers allocation failure, free_event() is
called and all the accounting performed in perf_event_alloc()
for that event is cancelled.
But if the event has branch stack sampling, it is unaccounted
as well from the branch stack sampling events refcounts.
This is a bug because this accounting is performed after the
callchain buffer allocation. As a result, the branch stack sampling
events refcount can become negative.
To fix this, move the branch stack event accounting before the
callchain buffer allocation.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After commit 8969a5ede0
("generic-ipi: remove kmalloc()"), wait = 0 can be guaranteed,
and all callsites of generic_exec_single() do an unconditional
csd_lock() now.
So csd_flags is unnecessary now. Remove it.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Link: http://lkml.kernel.org/r/51F72DA1.7010401@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The check needs to be for > 1, because ctx->acquired is already incremented.
This will prevent ww_mutex_lock_slow from returning -EDEADLK and not locking
the mutex. It caused a lot of false gpu lockups on radeon with
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y because a function that shouldn't be able
to return -EDEADLK did.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/51F775B5.201@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We typically update a task_group's shares within the dequeue/enqueue
path. However, continuously running tasks sharing a CPU are not
subject to these updates as they are only put/picked. Unfortunately,
when we reverted f269ae046 (in 17bc14b7), we lost the augmenting
periodic update that was supposed to account for this; resulting in a
potential loss of fairness.
To fix this, re-introduce the explicit update in
update_cfs_rq_blocked_load() [called via entity_tick()].
Reported-by: Max Hailperin <max@gustavus.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/n/tip-9545m3apw5d93ubyrotrj31y@git.kernel.org
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Calling freeze_processes sets a global flag that will cause any
process that calls try_to_freeze to enter the refrigerator. It
skips sending a signal to the current task, but if the current
task ever hits try_to_freeze, all threads will be frozen and the
system will deadlock.
Set a new flag, PF_SUSPEND_TASK, on the task that calls
freeze_processes. The flag notifies the freezer that the thread
is involved in suspend and should not be frozen. Also add a
WARN_ON in thaw_processes if the caller does not have the
PF_SUSPEND_TASK flag set to catch if a different task calls
thaw_processes than the one that called freeze_processes, leaving
a task with PF_SUSPEND_TASK permanently set on it.
Threads that spawn off a task with PF_SUSPEND_TASK set (which
swsusp does) will also have PF_SUSPEND_TASK set, preventing them
from freezing while they are helping with suspend, but they need
to be dead by the time suspend is triggered, otherwise they may
run when userspace is expected to be frozen. Add a WARN_ON in
thaw_processes if more than one thread has the PF_SUSPEND_TASK
flag set.
Reported-and-tested-by: Michael Leun <lkml20130126@newton.leun.net>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When ftrace ops modifies the functions that it will trace, the update
to the function mcount callers may need to be modified. Consolidate
the two places that do the checks to see if an update is required
with a wrapper function for those checks.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Change remove_event_file_dir() to clear ->i_private for every
file we are going to remove.
We need to check file->dir != NULL because event_create_dir()
can fail. debugfs_remove_recursive(NULL) is fine but the patch
moves it under the same check anyway for readability.
spin_lock(d_lock) and "d_inode != NULL" check are not needed
afaics, but I do not understand this code enough.
tracing_open_generic_file() and tracing_release_generic_file()
can go away, ftrace_enable_fops and ftrace_event_filter_fops()
use tracing_open_generic() but only to check tracing_disabled.
This fixes all races with event_remove() or instance_delete().
f_op->read/write/whatever can never use the freed file/call,
all event/* files were changed to check and use ->i_private
under event_mutex.
Note: this doesn't not fix other problems, event_remove() can
destroy the active ftrace_event_call, we need more changes but
those changes are completely orthogonal.
Link: http://lkml.kernel.org/r/20130728183527.GB16723@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Preparation for the next patch. Extract the common code from
remove_event_from_tracers() and __trace_remove_event_dirs()
into the new helper, remove_event_file_dir().
The patch looks more complicated than it actually is, it also
moves remove_subsystem() up to avoid the forward declaration.
Link: http://lkml.kernel.org/r/20130726172547.GA3629@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_format_open() and trace_format_seq_ops are racy, nothing
protects ftrace_event_call from trace_remove_event_call().
Change f_start() to take event_mutex and verify i_private != NULL,
change f_stop() to drop this lock.
This fixes nothing, but now we can change debugfs_remove("format")
callers to nullify ->i_private and fix the the problem.
Note: the usage of event_mutex is sub-optimal but simple, we can
change this later.
Link: http://lkml.kernel.org/r/20130726172543.GA3622@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
event_filter_read/write() are racy, ftrace_event_call can be already
freed by trace_remove_event_call() callers.
1. Shift mutex_lock(event_mutex) from print/apply_event_filter to
the callers.
2. Change the callers, event_filter_read() and event_filter_write()
to read i_private under this mutex and abort if it is NULL.
This fixes nothing, but now we can change debugfs_remove("filter")
callers to nullify ->i_private and fix the the problem.
Link: http://lkml.kernel.org/r/20130726172540.GA3619@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open_generic_file() is racy, ftrace_event_file can be
already freed by rmdir or trace_remove_event_call().
Change event_enable_read() and event_disable_read() to read and
verify "file = i_private" under event_mutex.
This fixes nothing, but now we can change debugfs_remove("enable")
callers to nullify ->i_private and fix the the problem.
Link: http://lkml.kernel.org/r/20130726172536.GA3612@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
event_id_read() is racy, ftrace_event_call can be already freed
by trace_remove_event_call() callers.
Change event_create_dir() to pass "data = call->event.type", this
is all event_id_read() needs. ftrace_event_id_fops no longer needs
tracing_open_generic().
We add the new helper, event_file_data(), to read ->i_private, it
will have more users.
Note: currently ACCESS_ONCE() and "id != 0" check are not needed,
but we are going to change event_remove/rmdir to clear ->i_private.
Link: http://lkml.kernel.org/r/20130726172532.GA3605@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, RCU tracepoints save only a pointer to strings in the
ring buffer. When displayed via the /sys/kernel/debug/tracing/trace file
they are referenced like the printf "%s" that looks at the address
in the ring buffer and prints out the string it points too. This requires
that the strings are constant and persistent in the kernel.
The problem with this is for tools like trace-cmd and perf that read the
binary data from the buffers but have no access to the kernel memory to
find out what string is represented by the address in the buffer.
By using the tracepoint_string infrastructure, the RCU tracepoint strings
can be exported such that userspace tools can map the addresses to
the strings.
# cat /sys/kernel/debug/tracing/printk_formats
0xffffffff81a4a0e8 : "rcu_preempt"
0xffffffff81a4a0f4 : "rcu_bh"
0xffffffff81a4a100 : "rcu_sched"
0xffffffff818437a0 : "cpuqs"
0xffffffff818437a6 : "rcu_sched"
0xffffffff818437a0 : "cpuqs"
0xffffffff818437b0 : "rcu_bh"
0xffffffff818437b7 : "Start context switch"
0xffffffff818437cc : "End context switch"
0xffffffff818437a0 : "cpuqs"
[...]
Now userspaces tools can display:
rcu_utilization: Start context switch
rcu_dyntick: Start 1 0
rcu_utilization: End context switch
rcu_batch_start: rcu_preempt CBs=0/5 bl=10
rcu_dyntick: End 0 140000000000000
rcu_invoke_callback: rcu_preempt rhp=0xffff880071c0d600 func=proc_i_callback
rcu_invoke_callback: rcu_preempt rhp=0xffff880077b5b230 func=__d_free
rcu_dyntick: Start 140000000000000 0
rcu_invoke_callback: rcu_preempt rhp=0xffff880077563980 func=file_free_rcu
rcu_batch_end: rcu_preempt CBs-invoked=3 idle=>c<>c<>c<>c<
rcu_utilization: End RCU core
rcu_grace_period: rcu_preempt 9741 start
rcu_dyntick: Start 1 0
rcu_dyntick: End 0 140000000000000
rcu_dyntick: Start 140000000000000 0
Instead of:
rcu_utilization: ffffffff81843110
rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f32
rcu_batch_start: ffffffff81842f1d CBs=0/4 bl=10
rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f3c
rcu_grace_period: ffffffff81842f1d 9939 ffffffff81842f80
rcu_invoke_callback: ffffffff81842f1d rhp=0xffff88007888aac0 func=file_free_rcu
rcu_grace_period: ffffffff81842f1d 9939 ffffffff81842f95
rcu_invoke_callback: ffffffff81842f1d rhp=0xffff88006aeb4600 func=proc_i_callback
rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f32
rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f3c
rcu_invoke_callback: ffffffff81842f1d rhp=0xffff880071cb9fc0 func=__d_free
rcu_grace_period: ffffffff81842f1d 9939 ffffffff81842f80
rcu_invoke_callback: ffffffff81842f1d rhp=0xffff88007888ae80 func=file_free_rcu
rcu_batch_end: ffffffff81842f1d CBs-invoked=4 idle=>c<>c<>c<>c<
rcu_utilization: ffffffff8184311f
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The RCU_STATE_INITIALIZER() macro is used only in the rcutree.c file
as well as the rcutree_plugin.h file. It is passed as a rvalue to
a variable of a similar name. A per_cpu variable is also created
with a similar name as well.
The uses of RCU_STATE_INITIALIZER() can be simplified to remove some
of the duplicate code that is done. Currently the three users of this
macro has this format:
struct rcu_state rcu_sched_state =
RCU_STATE_INITIALIZER(rcu_sched, call_rcu_sched);
DEFINE_PER_CPU(struct rcu_data, rcu_sched_data);
Notice that "rcu_sched" is called three times. This is the same with
the other two users. This can be condensed to just:
RCU_STATE_INITIALIZER(rcu_sched, call_rcu_sched);
by moving the rest into the macro itself.
This also opens the door to allow the RCU tracepoint strings and
their addresses to be exported so that userspace tracing tools can
translate the contents of the pointers of the RCU tracepoints.
The change will allow for helper code to be placed in the
RCU_STATE_INITIALIZER() macro to export the name that is used.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
All the RCU tracepoints and functions that reference char pointers do
so with just 'char *' even though they do not modify the contents of
the string itself. This will cause warnings if a const char * is used
in one of these functions.
The RCU tracepoints store the pointer to the string to refer back to them
when the trace output is displayed. As this can be minutes, hours or
even days later, those strings had better be constant.
This change also opens the door to allow the RCU tracepoint strings and
their addresses to be exported so that userspace tracing tools can
translate the contents of the pointers of the RCU tracepoints.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Comment for cpuset_css_offline() was on top of cpuset_css_free().
Move it.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
get rid of the useless forward declaration of the struct cpuset cause the
below define it.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
and deleting that event at the same time (both must be done as root).
I also found a bug while testing Oleg's patches which has to do with
a race with kprobes using the function tracer.
There's also a deadlock fix that was introduced with the previous fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJR8nMiAAoJEOdOSU1xswtMXMMH/jNuG1vDcTjo7WXu3kYHJNWc
u1Z3YPXunFozz4DofzXPCuSkSgRqSR4cVeGOAv3oEfwQv07jJdTAor8/FuVTNXW9
yiGMUvth0nLVZQEmsh3VvVztqFy8FxhpnQHhIa6pillPUdROKgE/A7/q4wT37lxT
qniM1QJTK9fIf2t6suMwSsBD7fepiDkdHwVbs5NvN7qj62/QSPN+pHLAOBl90AGp
r5eUSj8cDHxmzlV+GAJxeqI7KH+P2PGts9USI+s5EX8mODci620jz1HkKab6XRpz
ggYdNzJ21z1fO9vfnpGCF0d03sKdnbJoIQjkyD4AJHlojmLe3s6l/nxS6jg3wH8=
=VrR4
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Oleg is working on fixing a very tight race between opening a event
file and deleting that event at the same time (both must be done as
root).
I also found a bug while testing Oleg's patches which has to do with a
race with kprobes using the function tracer.
There's also a deadlock fix that was introduced with the previous
fixes"
* tag 'trace-fixes-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove locking trace_types_lock from tracing_reset_all_online_cpus()
ftrace: Add check for NULL regs if ops has SAVE_REGS set
tracing: Kill trace_cpu struct/members
tracing: Change tracing_fops/snapshot_fops to rely on tracing_get_cpu()
tracing: Change tracing_entries_fops to rely on tracing_get_cpu()
tracing: Change tracing_stats_fops to rely on tracing_get_cpu()
tracing: Change tracing_buffers_fops to rely on tracing_get_cpu()
tracing: Change tracing_pipe_fops() to rely on tracing_get_cpu()
tracing: Introduce trace_create_cpu_file() and tracing_get_cpu()
When (integer) sysctl values are expressed in ms and have to be
represented internally as jiffies. The msecs_to_jiffies function
returns an unsigned long, which gets assigned to the integer.
This patch prevents the value to be assigned if bigger than
INT_MAX, done in a similar way as in cba9f3 ("Range checking in
do_proc_dointvec_(userhz_)jiffies_conv").
Signed-off-by: Francesco Fusco <ffusco@redhat.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several tracepoints (mostly in RCU), that reference a string
pointer and uses the print format of "%s" to display the string that
exists in the kernel, instead of copying the actual string to the
ring buffer (saves time and ring buffer space).
But this has an issue with userspace tools that read the binary buffers
that has the address of the string but has no access to what the string
itself is. The end result is just output that looks like:
rcu_dyntick: ffffffff818adeaa 1 0
rcu_dyntick: ffffffff818adeb5 0 140000000000000
rcu_dyntick: ffffffff818adeb5 0 140000000000000
rcu_utilization: ffffffff8184333b
rcu_utilization: ffffffff8184333b
The above is pretty useless when read by the userspace tools. Ideally
we would want something that looks like this:
rcu_dyntick: Start 1 0
rcu_dyntick: End 0 140000000000000
rcu_dyntick: Start 140000000000000 0
rcu_callback: rcu_preempt rhp=0xffff880037aff710 func=put_cred_rcu 0/4
rcu_callback: rcu_preempt rhp=0xffff880078961980 func=file_free_rcu 0/5
rcu_dyntick: End 0 1
The trace_printk() which also only stores the address of the string
format instead of recording the string into the buffer itself, exports
the mapping of kernel addresses to format strings via the printk_format
file in the debugfs tracing directory.
The tracepoint strings can use this same method and output the format
to the same file and the userspace tools will be able to decipher
the address without any modification.
The tracepoint strings need its own section to save the strings because
the trace_printk section will cause the trace_printk() buffers to be
allocated if anything exists within the section. trace_printk() is only
used for debugging and should never exist in the kernel, we can not use
the trace_printk sections.
Add a new tracepoint_str section that will also be examined by the output
of the printk_format file.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit a82274151a "tracing: Protect ftrace_trace_arrays list in trace_events.c"
added taking the trace_types_lock mutex in trace_events.c as there were
several locations that needed it for protection. Unfortunately, it also
encapsulated a call to tracing_reset_all_online_cpus() which also takes
the trace_types_lock, causing a deadlock.
This happens when a module has tracepoints and has been traced. When the
module is removed, the trace events module notifier will grab the
trace_types_lock, do a bunch of clean ups, and also clears the buffer
by calling tracing_reset_all_online_cpus. This doesn't happen often
which explains why it wasn't caught right away.
Commit a82274151a was marked for stable, which means this must be
sent to stable too.
Link: http://lkml.kernel.org/r/51EEC646.7070306@broadcom.com
Reported-by: Arend van Spril <arend@broadcom.com>
Tested-by: Arend van Spriel <arend@broadcom.com>
Cc: Alexander Z Lam <azl@google.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Change where ftrace is disabled and re-enabled during system
suspend/resume to allow tracing of device driver pm callbacks.
Ftrace will now be turned off when suspend reaches
disable_nonboot_cpus() instead of at the very beginning of system
suspend.
Ftrace was disabled during suspend/resume back in 2008 by
Steven Rostedt as he discovered there was a conflict in the
enable_nonboot_cpus() call (see commit f42ac38 "ftrace: disable
tracing for suspend to ram"). This change preserves his fix by
disabling ftrace, but only at the function where it is known
to cause problems.
The new change allows tracing of the device level code for better
debug.
[rjw: Changelog]
Signed-off-by: Todd Brandt <todd.e.brandt@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fengguang reported the following warning when optimistic
spinning is disabled (ie: make allnoconfig):
kernel/mutex.c:599:1: warning: label 'done' defined but not used
Remove the 'done' label altogether.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cpu is not used after commit 5b8621a68f
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
If the user enables CONFIG_NO_HZ_FULL and runs the kernel on a machine
with an unstable TSC, it will produce a WARN_ON dump as well as taint
the kernel. This is a bit extreme for a kernel that just enables a
feature but doesn't use it.
The warning should only happen if the user tries to use the feature by
either adding nohz_full to the kernel command line, or by enabling
CONFIG_NO_HZ_FULL_ALL that makes nohz used on all CPUs at boot up. Note,
this second feature should not (yet) be used by distros or anyone that
doesn't care if NO_HZ is used or not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
If the @fn call work_on_cpu() again, the lockdep will complain:
> [ INFO: possible recursive locking detected ]
> 3.11.0-rc1-lockdep-fix-a #6 Not tainted
> ---------------------------------------------
> kworker/0:1/142 is trying to acquire lock:
> ((&wfc.work)){+.+.+.}, at: [<ffffffff81077100>] flush_work+0x0/0xb0
>
> but task is already holding lock:
> ((&wfc.work)){+.+.+.}, at: [<ffffffff81075dd9>] process_one_work+0x169/0x610
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock((&wfc.work));
> lock((&wfc.work));
>
> *** DEADLOCK ***
It is false-positive lockdep report. In this sutiation,
the two "wfc"s of the two work_on_cpu() are different,
they are both on stack. flush_work() can't be deadlock.
To fix this, we need to avoid the lockdep checking in this case,
thus we instroduce a internal __flush_work() which skip the lockdep.
tj: Minor comment adjustment.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reported-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
If a ftrace ops is registered with the SAVE_REGS flag set, and there's
already a ops registered to one of its functions but without the
SAVE_REGS flag, there's a small race window where the SAVE_REGS ops gets
added to the list of callbacks to call for that function before the
callback trampoline gets set to save the regs.
The problem is, the function is not currently saving regs, which opens
a small race window where the ops that is expecting regs to be passed
to it, wont. This can cause a crash if the callback were to reference
the regs, as the SAVE_REGS guarantees that regs will be set.
To fix this, we add a check in the loop case where it checks if the ops
has the SAVE_REGS flag set, and if so, it will ignore it if regs is
not set.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
After the previous changes trace_array_cpu->trace_cpu and
trace_array->trace_cpu becomes write-only. Remove these members
and kill "struct trace_cpu" as well.
As a side effect this also removes memset(per_cpu_memory, 0).
It was not needed, alloc_percpu() returns zero-filled memory.
Link: http://lkml.kernel.org/r/20130723152613.GA23741@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open() and tracing_snapshot_open() are racy, the memory
inode->i_private points to can be already freed.
Convert these last users of "inode->i_private == trace_cpu" to
use "i_private = trace_array" and rely on tracing_get_cpu().
v2: incorporate the fix from Steven, tracing_release() must not
blindly dereference file->private_data unless we know that
the file was opened for reading.
Link: http://lkml.kernel.org/r/20130723152610.GA23737@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open_generic_tc() is racy, the memory inode->i_private
points to can be already freed.
1. Change its last user, tracing_entries_fops, to use
tracing_*_generic_tr() instead.
2. Change debugfs_create_file("buffer_size_kb", data) callers
to pass "data = tr".
3. Change tracing_entries_read() and tracing_entries_write() to
use tracing_get_cpu().
4. Kill the no longer used tracing_open_generic_tc() and
tracing_release_generic_tc().
Link: http://lkml.kernel.org/r/20130723152606.GA23730@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>