Commit graph

14909 commits

Author SHA1 Message Date
Tejun Heo
38db41d984 workqueue: replace for_each_worker_pool() with for_each_std_worker_pool()
for_each_std_worker_pool() takes @cpu instead of @gcwq.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:34 -08:00
Tejun Heo
a1056305fa workqueue: make freezing/thawing per-pool
Instead of holding locks from both pools and then processing the pools
together, make freezing/thwaing per-pool - grab locks of one pool,
process it, release it and then proceed to the next pool.

While this patch changes processing order across pools, order within
each pool remains the same.  As each pool is independent, this
shouldn't break anything.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo
94cf58bb29 workqueue: make hotplug processing per-pool
Instead of holding locks from both pools and then processing the pools
together, make hotplug processing per-pool - grab locks of one pool,
process it, release it and then proceed to the next pool.

rebind_workers() is updated to take and process @pool instead of @gcwq
which results in a lot of de-indentation.  gcwq_claim_assoc_and_lock()
and its counterpart are replaced with in-line per-pool locking.

While this patch changes processing order across pools, order within
each pool remains the same.  As each pool is independent, this
shouldn't break anything.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo
d565ed6309 workqueue: move global_cwq->lock to worker_pool
Move gcwq->lock to pool->lock.  The conversion is mostly
straight-forward.  Things worth noting are

* In many places, this removes the need to use gcwq completely.  pool
  is used directly instead.  get_std_worker_pool() is added to help
  some of these conversions.  This also leaves get_work_gcwq() without
  any user.  Removed.

* In hotplug and freezer paths, the pools belonging to a CPU are often
  processed together.  This patch makes those paths hold locks of all
  pools, with highpri lock nested inside, to keep the conversion
  straight-forward.  These nested lockings will be removed by
  following patches.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo
ec22ca5eab workqueue: move global_cwq->cpu to worker_pool
Move gcwq->cpu to pool->cpu.  This introduces a couple places where
gcwq->pools[0].cpu is used.  These will soon go away as gcwq is
further reduced.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo
c9e7cf273f workqueue: move busy_hash from global_cwq to worker_pool
There's no functional necessity for the two pools on the same CPU to
share the busy hash table.  It's also likely to be a bottleneck when
implementing pools with user-specified attributes.

This patch makes busy_hash per-pool.  The conversion is mostly
straight-forward.  Changes worth noting are,

* Large block of changes in rebind_workers() is moving the block
  inside for_each_worker_pool() as now there are separate hash tables
  for each pool.  This changes the order of operations but doesn't
  break anything.

* Thre for_each_worker_pool() loops in gcwq_unbind_fn() are combined
  into one.  This again changes the order of operaitons but doesn't
  break anything.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo
7c3eed5cd6 workqueue: record pool ID instead of CPU in work->data when off-queue
Currently, when a work item is off-queue, work->data records the CPU
it was last on, which is used to locate the last executing instance
for non-reentrance, flushing, etc.

We're in the process of removing global_cwq and making worker_pool the
top level abstraction.  This patch makes work->data point to the pool
it was last associated with instead of CPU.

After the previous WORK_OFFQ_POOL_CPU and worker_poo->id additions,
the conversion is fairly straight-forward.  WORK_OFFQ constants and
functions are modified to record and read back pool ID instead.
worker_pool_by_id() is added to allow looking up pool from ID.
get_work_pool() replaces get_work_gcwq(), which is reimplemented using
get_work_pool().  get_work_pool_id() replaces work_cpu().

This patch shouldn't introduce any observable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo
9daf9e678d workqueue: add worker_pool->id
Add worker_pool->id which is allocated from worker_pool_idr.  This
will be used to record the last associated worker_pool in work->data.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo
715b06b864 workqueue: introduce WORK_OFFQ_CPU_NONE
Currently, when a work item is off queue, high bits of its data
encodes the last CPU it was on.  This is scheduled to be changed to
pool ID, which will make it impossible to use WORK_CPU_NONE to
indicate no association.

This patch limits the number of bits which are used for off-queue cpu
number to 31 (so that the max fits in an int) and uses the highest
possible value - WORK_OFFQ_CPU_NONE - to indicate no association.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo
35b6bb63b8 workqueue: make GCWQ_FREEZING a pool flag
Make GCWQ_FREEZING a pool flag POOL_FREEZING.  This patch doesn't
change locking - FREEZING on both pools of a CPU are set or clear
together while holding gcwq->lock.  It shouldn't cause any functional
difference.

This leaves gcwq->flags w/o any flags.  Removed.

While at it, convert BUG_ON()s in freeze_workqueue_begin() and
thaw_workqueues() to WARN_ON_ONCE().

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo
2464757086 workqueue: make GCWQ_DISASSOCIATED a pool flag
Make GCWQ_DISASSOCIATED a pool flag POOL_DISASSOCIATED.  This patch
doesn't change locking - DISASSOCIATED on both pools of a CPU are set
or clear together while holding gcwq->lock.  It shouldn't cause any
functional difference.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo
e34cdddb03 workqueue: use std_ prefix for the standard per-cpu pools
There are currently two worker pools per cpu (including the unbound
cpu) and they are the only pools in use.  New class of pools are
scheduled to be added and some pool related APIs will be added
inbetween.  Call the existing pools the standard pools and prefix them
with std_.  Do this early so that new APIs can use std_ prefix from
the beginning.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo
e2905b2912 workqueue: unexport work_cpu()
This function no longer has any external users.  Unexport it.  It will
be removed later on.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:32 -08:00
Viresh Kumar
16c8f1c72e sched/fair: Set se->vruntime directly in place_entity()
We are first storing the new vruntime in a variable and then
storing it in se->vruntime. Simply update se->vruntime directly.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-dev@lists.linaro.org
Cc: patches@linaro.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/ae59db1945518d6f6250920d46eb1f1a9cc0024e.1352361704.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 18:06:11 +01:00
Alexander Gordeev
51906e779f x86/MSI: Support multiple MSIs in presense of IRQ remapping
The MSI specification has several constraints in comparison with
MSI-X, most notable of them is the inability to configure MSIs
independently. As a result, it is impossible to dispatch
interrupts from different queues to different CPUs. This is
largely devalues the support of multiple MSIs in SMP systems.

Also, a necessity to allocate a contiguous block of vector
numbers for devices capable of multiple MSIs might cause a
considerable pressure on x86 interrupt vector allocator and
could lead to fragmentation of the interrupt vectors space.

This patch overcomes both drawbacks in presense of IRQ remapping
and lets devices take advantage of multiple queues and per-IRQ
affinity assignments.

Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c8bd86ff56b5fc118257436768aaa04489ac0a4c.1353324359.git.agordeev@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 17:25:12 +01:00
Kirill Tkhai
1158ddb554 sched/rt: Add reschedule check to switched_from_rt()
Reschedule rq->curr if the first RT task has just been
pulled to the rq.

Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tkhai Kirill <tkhai@yandex.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/118761353614535@web28f.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 17:14:30 +01:00
Frederic Weisbecker
ba6fdda46b profiling: Remove unused timer hook
The last remaining user was oprofile and its use has been
removed a while ago in commit bc078e4eab
("oprofile: convert oprofile from timer_hook to hrtimer").

There doesn't seem to be any upstream user of this hook
for about two years now. And I'm not even aware of any out of
tree user.

Let's remove it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1356191991-2251-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 15:37:26 +01:00
Zhu Yanhai
a59f4e079d sched: Fix the broken sched_rr_get_interval()
The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1357621012-15039-1-git-send-email-gaoyang.zyh@taobao.com
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 14:41:00 +01:00
Ingo Molnar
2a1337599b Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core
Pull small function-tracing smatch fixlet from Steve Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 14:31:53 +01:00
Steven Rostedt
d41032a83b tracing: Fix unsigned int compare of zero in recursion check
Dan's smatch found a compare bug with the result of the
trace_test_and_set_recursion() and comparing to less than
zero. If the function fails, it returns -1, but was saved in
an unsigned int, which will never be less than zero and will
ignore the result of the test if a recursion did happen.

Luckily this is the last of the recursion tests, as the
infrastructure of ftrace would catch recursions before it
got here, except for some few exceptions.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-24 07:52:34 -05:00
Ingo Molnar
4913ae3991 Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core
Pull tracing updates from Steve Rostedt.

This commit:

      tracing: Remove the extra 4 bytes of padding in events

changes the ABI. All involved parties seem to agree that it's safe to
do now, but the devil is in the details ...

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 13:39:31 +01:00
Ingo Molnar
d36b7b9643 Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent
Pull RCU fixes from Paul E. McKenney.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 13:28:24 +01:00
Ingo Molnar
786133f6e8 Merge branch 'core/irq_work' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into irq/core
irq_work fixes and cleanups, in preparation for full dyntics support.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 12:48:41 +01:00
Ingo Molnar
befddb21c8 Linux 3.8-rc4
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJQ+MDOAAoJEHm+PkMAQRiGFL8H/39bXl3xs/ZkBct3ciICtaAt
 UeRcdjvBdUrObQmmw8FEFJA62Az74g1UYyNyc0yVJoB6DwL2oawXhGXQXqDXMBMs
 xJa/dpx41gmcrvsLOJ1v2Tmokk32hgdpLNWwzcn36xZMsrhYUxjxPhVNK4aDbwrW
 3uzBj83dK1rvi2PEz3hgHizRhmdD1cNgLXKrczNbT6y6rLp6Jkng2q5UhNfaZUvE
 A7Ub5/nW+X6gb1PctAjulz2CoySq0FbH4viUc+zLj3NN0L9ImeJvpREauE2QPVgJ
 a8Sg3E6/cL4wIXBEV1E1KapVTd1XIEGa8yB5X6q8HWNKI8nf7Kwuj65VgvwJIMc=
 =+VR8
 -----END PGP SIGNATURE-----

Merge tag 'v3.8-rc4' into irq/core

Merge Linux 3.8-rc4 before pulling in new commits - we were on an old v3.7 base.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-24 12:47:48 +01:00
Tejun Heo
9fdb04cdc5 async: replace list of active domains with global list of pending items
Global synchronization - async_synchronize_full() - is currently
implemented by keeping a list of all active registered domains and
syncing them one by one until no domain is active.

While this isn't necessarily a complex scheme, it can easily be
simplified by keeping global list of the pending items of all
registered active domains instead of list of domains and simply using
the globl pending list the same way as domain syncing.

This patch replaces async_domains with async_global_pending and update
lowest_in_progress() to use the global pending list if @domain is
%NULL.  async_synchronize_full_domain(NULL) is now allowed and
equivalent to async_synchronize_full().  As no one is calling with
NULL domain, this doesn't affect any existing users.

async_register_mutex is no longer necessary and dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Dan Williams <djbw@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-23 09:32:30 -08:00
Tejun Heo
52722794d6 async: keep pending tasks on async_domain and remove async_pending
Async kept single global pending list and per-domain running lists.
When an async item is queued, it's put on the global pending list.
The item is moved to the per-domain running list when its execution
starts.

At this point, this design complicates execution and synchronization
without bringing any benefit.  The list only matters for
synchronization which doesn't care whether a given async item is
pending or executing.  Also, global synchronization is done by
iterating through all active registered async_domains, so the global
async_pending list doesn't help anything either.

Rename async_domain->running to async_domain->pending and put async
items directly there and remove when execution completes.  This
simplifies lowest_in_progress() a lot - the first item on the pending
list is the one with the lowest cookie, and async_run_entry_fn()
doesn't have to mess with moving the item from pending to running.

After the change, whether a domain is empty or not can be trivially
determined by looking at async_domain->pending.  Remove
async_domain->count and use list_empty() on pending instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Dan Williams <djbw@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-23 09:32:30 -08:00
Tejun Heo
c68eee14ec async: use ULLONG_MAX for infinity cookie value
Currently, next_cookie is used as the infinity value.  In most cases,
this should work fine but it theoretically could bring subtle behavior
difference between async_synchronize_full() and
async_synchronize_full_domain().

async_synchronize_full() keeps waiting until there's no registered
async_entry left regardless of what next_cookie was when the function
was called.  It guarantees that the queue is completely drained at
least once before returning.

However, async_synchronize_full_domain() doesn't.  It synchronizes
upto next_cookie and if further async jobs are queued after the
next_cookie value to synchronize is decided, they won't be waited for.

For unrelated async jobs, the behavior difference doesn't matter;
however, if async jobs which are related (nested or otherwise) to the
executing ones are queued while sychronization is in progress, the
resulting behavior difference could be problematic.

This can be easily fixed by using ULLONG_MAX as the infinity value
instead.  Define ASYNC_COOKIE_MAX as ULLONG_MAX and use it as the
infinity value for synchronization.  This makes
async_synchronize_full_domain() fully drain the domain at least once
before returning, making its behavior match async_synchronize_full().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Dan Williams <djbw@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-23 09:32:30 -08:00
Tejun Heo
8723d5037c async: bring sanity to the use of words domain and running
In the beginning, running lists were literal struct list_heads.  Later
on, struct async_domain was added.  For some reason, while the
conversion substituted list_heads with async_domains, the variable
names weren't fully converted.  In more places, "running" was used for
struct async_domain while other places adopted new "domain" name.

The situation is made much worse by having async_domain's running list
named "domain" and async_entry's field pointing to async_domain named
"running".

So, we end up with mix of "running" and "domain" for variable names
for async_domain, with the field names of async_domain and async_entry
swapped between "running" and "domain".

It feels almost intentionally made to be as confusing as possible.
Bring some sanity by

* Renaming all async_domain variables "domain".

* s/async_running/async_dfl_domain/

* s/async_domain->domain/async_domain->running/

* s/async_entry->running/async_entry->domain/

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Dan Williams <djbw@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-23 09:32:30 -08:00
Tejun Heo
c14afb82ff Merge branch 'master' into for-3.9-async
To receive f56c3196f2 ("async: fix
__lowest_in_progress()").

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-01-23 09:31:01 -08:00
Steven Rostedt
0b07436d95 ring-buffer: Remove trace.h from ring_buffer.c
ring_buffer.c use to require declarations from trace.h, but
these have moved to the generic header files. There's nothing
in trace.h that ring_buffer.c requires.

There's some headers that trace.h included that ring_buffer.c
needs, but it's best that it includes them directly, and not
include trace.h.

Also, some things may use ring_buffer.c without having tracing
configured. This removes the dependency that may come in the
future.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 23:38:03 -05:00
Steven Rostedt
567cd4da54 ring-buffer: User context bit recursion checking
Using context bit recursion checking, we can help increase the
performance of the ring buffer.

Before this patch:

 # echo function > /debug/tracing/current_tracer
 # for i in `seq 10`; do ./hackbench 50; done
Time: 10.285
Time: 10.407
Time: 10.243
Time: 10.372
Time: 10.380
Time: 10.198
Time: 10.272
Time: 10.354
Time: 10.248
Time: 10.253

(average: 10.3012)

Now we have:

 # echo function > /debug/tracing/current_tracer
 # for i in `seq 10`; do ./hackbench 50; done
Time: 9.712
Time: 9.824
Time: 9.861
Time: 9.827
Time: 9.962
Time: 9.905
Time: 9.886
Time: 10.088
Time: 9.861
Time: 9.834

(average: 9.876)

 a 4% savings!

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 23:38:03 -05:00
Steven Rostedt
897f68a48b ftrace: Use only the preempt version of function tracing
The function tracer had two different versions of function tracing.

The disabling of irqs version and the preempt disable version.

As function tracing in very intrusive and can cause nasty recursion
issues, it has its own recursion protection. But the old method to
do this was a flat layer. If it detected that a recursion was happening
then it would just return without recording.

This made the preempt version (much faster than the irq disabling one)
not very useful, because if an interrupt were to occur after the
recursion flag was set, the interrupt would not be traced at all,
because every function that was traced would think it recursed on
itself (due to the context it preempted setting the recursive flag).

Now that we have a recursion flag for every context level, we
no longer need to worry about that. We can disable preemption,
set the current context recursion check bit, and go on. If an
interrupt were to come along, it would check its own context bit
and happily continue to trace.

As the preempt version is faster than the irq disable version,
there's no more reason to keep the preempt version around.
And the irq disable version still had an issue with missing
out on tracing NMI code.

Remove the irq disable function tracer version and have the
preempt disable version be the default (and only version).

Before this patch we had from running:

 # echo function > /debug/tracing/current_tracer
 # for i in `seq 10`; do ./hackbench 50; done
Time: 12.028
Time: 11.945
Time: 11.925
Time: 11.964
Time: 12.002
Time: 11.910
Time: 11.944
Time: 11.929
Time: 11.941
Time: 11.924

(average: 11.9512)

Now we have:

 # echo function > /debug/tracing/current_tracer
 # for i in `seq 10`; do ./hackbench 50; done
Time: 10.285
Time: 10.407
Time: 10.243
Time: 10.372
Time: 10.380
Time: 10.198
Time: 10.272
Time: 10.354
Time: 10.248
Time: 10.253

(average: 10.3012)

 a 13.8% savings!

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 23:38:02 -05:00
Steven Rostedt
edc15cafcb tracing: Avoid unnecessary multiple recursion checks
When function tracing occurs, the following steps are made:
  If arch does not support a ftrace feature:
   call internal function (uses INTERNAL bits) which calls...
  If callback is registered to the "global" list, the list
   function is called and recursion checks the GLOBAL bits.
   then this function calls...
  The function callback, which can use the FTRACE bits to
   check for recursion.

Now if the arch does not suppport a feature, and it calls
the global list function which calls the ftrace callback
all three of these steps will do a recursion protection.
There's no reason to do one if the previous caller already
did. The recursion that we are protecting against will
go through the same steps again.

To prevent the multiple recursion checks, if a recursion
bit is set that is higher than the MAX bit of the current
check, then we know that the check was made by the previous
caller, and we can skip the current check.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 23:38:01 -05:00
Steven Rostedt
e46cbf75c6 tracing: Make the trace recursion bits into enums
Convert the bits into enums which makes the code a little easier
to maintain.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 23:38:00 -05:00
Steven Rostedt
c29f122cd7 ftrace: Add context level recursion bit checking
Currently for recursion checking in the function tracer, ftrace
tests a task_struct bit to determine if the function tracer had
recursed or not. If it has, then it will will return without going
further.

But this leads to races. If an interrupt came in after the bit
was set, the functions being traced would see that bit set and
think that the function tracer recursed on itself, and would return.

Instead add a bit for each context (normal, softirq, irq and nmi).

A check of which context the task is in is made before testing the
associated bit. Now if an interrupt preempts the function tracer
after the previous context has been set, the interrupt functions
can still be traced.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 23:38:00 -05:00
Steven Rostedt
0a016409e4 ftrace: Optimize the function tracer list loop
There is lots of places that perform:

       op = rcu_dereference_raw(ftrace_control_list);
       while (op != &ftrace_list_end) {

Add a helper macro to do this, and also optimize for a single
entity. That is, gcc will optimize a loop for either no iterations
or more than one iteration. But usually only a single callback
is registered to the function tracer, thus the optimized case
should be a single pass. to do this we now do:

	op = rcu_dereference_raw(list);
	do {
		[...]
	} while (likely(op = rcu_dereference_raw((op)->next)) &&
	       unlikely((op) != &ftrace_list_end));

An op is always registered (ftrace_list_end when no callbacks is
registered), thus when a single callback is registered, the link
list looks like:

 top => callback => ftrace_list_end => NULL.

The likely(op = op->next) still must be performed due to the race
of removing the callback, where the first op assignment could
equal ftrace_list_end. In that case, the op->next would be NULL.
But this is unlikely (only happens in a race condition when
removing the callback).

But it is very likely that the next op would be ftrace_list_end,
unless more than one callback has been registered. This tells
gcc what the most common case is and makes the fast path with
the least amount of branches.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 23:37:59 -05:00
Steven Rostedt
9640388b63 ftrace: Fix function tracing recursion self test
The function tracing recursion self test should not crash
the machine if the resursion test fails. If it detects that
the function tracing is recursing when it should not be, then
bail, don't go into an infinite recursive loop.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 23:37:58 -05:00
Steven Rostedt
6350379452 ftrace: Fix global function tracers that are not recursion safe
If one of the function tracers set by the global ops is not recursion
safe, it can still be called directly without the added recursion
supplied by the ftrace infrastructure.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 23:37:57 -05:00
Steven Rostedt
05cbbf643b tracing: Fix selftest function recursion accounting
The test that checks function recursion does things differently
if the arch does not support all ftrace features. But that really
doesn't make a difference with how the test runs, and either way
the count variable should be 2 at the end.

Currently the test wrongly fails for archs that don't support all
the ftrace features.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 23:35:11 -05:00
Steven Rostedt
34600f0e9c tracing: Fix race with max_tr and changing tracers
There's a race condition between the setting of a new tracer and
the update of the max trace buffers (the swap). When a new tracer
is added, it sets current_trace to nop_trace before disabling
the old tracer. At this moment, if the old tracer uses update_max_tr(),
the update may trigger the warning against !current_trace->use_max-tr,
as nop_trace doesn't have that set.

As update_max_tr() requires that interrupts be disabled, we can
add a check to see if current_trace == nop_trace and bail if it
does. Then when disabling the current_trace, set it to nop_trace
and run synchronize_sched(). This will make sure all calls to
update_max_tr() have completed (it was called with interrupts disabled).

As a clean up, this commit also removes shrinking and recreating
the max_tr buffer if the old and new tracers both have use_max_tr set.
The old way use to always shrink the buffer, and then expand it
for the next tracer. This is a waste of time.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 23:33:07 -05:00
Tejun Heo
0fdff3ec6d async, kmod: warn on synchronous request_module() from async workers
Synchronous requet_module() from an async worker can lead to deadlock
because module init path may invoke async_synchronize_full().  The
async worker waits for request_module() to complete and the module
loading waits for the async task to finish.  This bug happened in the
block layer because of default elevator auto-loading.

Block layer has been updated not to do default elevator auto-loading
and it has been decided to disallow synchronous request_module() from
async workers.

Trigger WARN_ON_ONCE() on synchronous request_module() from async
workers.

For more details, please refer to the following thread.

  http://thread.gmane.org/gmane.linux.kernel/1420814

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Alex Riesen <raa.lkml@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
2013-01-22 16:48:03 -08:00
Tejun Heo
f56c3196f2 async: fix __lowest_in_progress()
Commit 083b804c4d ("async: use workqueue for worker pool") made it
possible that async jobs are moved from pending to running out-of-order.
While pending async jobs will be queued and dispatched for execution in
the same order, nothing guarantees they'll enter "1) move self to the
running queue" of async_run_entry_fn() in the same order.

Before the conversion, async implemented its own worker pool.  An async
worker, upon being woken up, fetches the first item from the pending
list, which kept the executing lists sorted.  The conversion to
workqueue was done by adding work_struct to each async_entry and async
just schedules the work item.  The queueing and dispatching of such work
items are still in order but now each worker thread is associated with a
specific async_entry and moves that specific async_entry to the
executing list.  So, depending on which worker reaches that point
earlier, which is non-deterministic, we may end up moving an async_entry
with larger cookie before one with smaller one.

This broke __lowest_in_progress().  running->domain may not be properly
sorted and is not guaranteed to contain lower cookies than pending list
when not empty.  Fix it by ensuring sort-inserting to the running list
and always looking at both pending and running when trying to determine
the lowest cookie.

Over time, the async synchronization implementation became quite messy.
We better restructure it such that each async_entry is linked to two
lists - one global and one per domain - and not move it when execution
starts.  There's no reason to distinguish pending and running.  They
behave the same for synchronization purposes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-22 16:21:24 -08:00
Linus Torvalds
d26d45253b Kprobes now uses the function tracer if it can. That is, if a probe
is placed on a function mcount/nop location, and the arch supports it,
 instead of adding a breakpoint, kprobes will register a function callback
 as that is much more efficient.
 
 The function tracer requires to update modules before they run, and
 uses the module notifier to do so. But if something else in the module
 notifiers registers a kprobe at one of these locations, before ftrace
 can get to it, then the system could fail.
 
 The function tracer must be initialized early, otherwise module notifiers
 that probe will only work by chance.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQ/ZaaAAoJEOdOSU1xswtMQsoH/02S5ngidMgIXwYCGzwxR6Ym
 f2krSz76rBkFn9ivFPJFfq2Wi05UhhVfFr4gpKzoK+/rBqwVkV5tRIQ4AhJ1/Q3d
 wCXB2mcRI6ky/kB/ts8q6gjj+rlJ18NgZf79TA5y5q7FEYc6DvB2dZ+vE+rvCnXk
 q/Bx6q4phEdLbVqet+Ga36qzi9u+pW0P+ntZmi0EQLVnz9p+mtdVaRz32qKp0FYi
 XhHtPMHQDZoOu8utPNtPcfSOZp1sNN8tXqjEAJ4/Ba54badUfg4WJ+RXGPsYH+4T
 lOwpVv7Xp0Sp2b1HivEVfCs1bx2RNzluZisPahBEwdBv+XssvqMbAfG+X3EVk3w=
 =6EYI
 -----END PGP SIGNATURE-----

Merge tag 'trace-3.8-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fix from Steven Rostedt:
 "Kprobes now uses the function tracer if it can.  That is, if a probe
  is placed on a function mcount/nop location, and the arch supports it,
  instead of adding a breakpoint, kprobes will register a function
  callback as that is much more efficient.

  The function tracer requires to update modules before they run, and
  uses the module notifier to do so.  But if something else in the
  module notifiers registers a kprobe at one of these locations, before
  ftrace can get to it, then the system could fail.

  The function tracer must be initialized early, otherwise module
  notifiers that probe will only work by chance."

* tag 'trace-3.8-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Be first to run code modification on modules
2013-01-22 10:30:49 -08:00
Oleg Nesterov
9067ac85d5 wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task
wake_up_process() should never wakeup a TASK_STOPPED/TRACED task.
Change it to use TASK_NORMAL and add the WARN_ON().

TASK_ALL has no other users, probably can be killed.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-22 10:08:17 -08:00
Oleg Nesterov
9899d11f65 ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL
putreg() assumes that the tracee is not running and pt_regs_access() can
safely play with its stack.  However a killed tracee can return from
ptrace_stop() to the low-level asm code and do RESTORE_REST, this means
that debugger can actually read/modify the kernel stack until the tracee
does SAVE_REST again.

set_task_blockstep() can race with SIGKILL too and in some sense this
race is even worse, the very fact the tracee can be woken up breaks the
logic.

As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace()
call, this ensures that nobody can ever wakeup the tracee while the
debugger looks at it.  Not only this fixes the mentioned problems, we
can do some cleanups/simplifications in arch_ptrace() paths.

Probably ptrace_unfreeze_traced() needs more callers, for example it
makes sense to make the tracee killable for oom-killer before
access_process_vm().

While at it, add the comment into may_ptrace_stop() to explain why
ptrace_stop() still can't rely on SIGKILL and signal_pending_state().

Reported-by: Salman Qazi <sqazi@google.com>
Reported-by: Suleiman Souhlal <suleiman@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-22 10:08:00 -08:00
Steven Rostedt
0a71e4c6d7 tracing: Remove trace.h header from trace_clock.c
As trace_clock is used by other things besides tracing, and it
does not require anything from trace.h, it is best not to include
the header file in trace_clock.c.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-22 12:06:56 -05:00
Oleg Nesterov
910ffdb18a ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up()
Cleanup and preparation for the next change.

signal_wake_up(resume => true) is overused. None of ptrace/jctl callers
actually want to wakeup a TASK_WAKEKILL task, but they can't specify the
necessary mask.

Turn signal_wake_up() into signal_wake_up_state(state), reintroduce
signal_wake_up() as a trivial helper, and add ptrace_signal_wake_up()
which adds __TASK_TRACED.

This way ptrace_signal_wake_up() can work "inside" ptrace_request()
even if the tracee doesn't have the TASK_WAKEKILL bit set.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-22 08:50:08 -08:00
Steven Rostedt
b000c8065a tracing: Remove the extra 4 bytes of padding in events
Due to a userspace issue with PowerTop v2beta, which hardcoded
the offset of event fields that it was using, it broke when
we removed the Big Kernel Lock counter from the event header.

 (commit e6e1e2593 "tracing: Remove lock_depth from event entry")

Because this broke userspace, it was determined that we must
keep those 4 bytes around.

 (commit a3a4a5acd "Regression: partial revert "tracing: Remove lock_depth from event entry"")

This unfortunately wastes space in the ring buffer. 4 bytes per
event, where a lot of events are just 24 bytes. That's 16% of the
buffer wasted. A million events will add 4 megs of white space
into the buffer.

It was later noticed that PowerTop v2beta could not work on systems
where the kernel was 64 bit but the userspace was 32 bits.
The reason was because the offsets are different between the
two and the hard coded offset of one would not work with the other.

With PowerTop v2 final, it implemented the same interface that both
perf and trace-cmd use. That is, it reads the format file of
the event to find the offsets of the fields it needs. This fixes
the problem with running powertop on a 32 bit userspace running
on a 64 bit kernel. It also no longer requires the 4 byte padding.

As PowerTop v2 has been out for a while, and is included in all
major distributions, it is time that we can safely remove the
4 bytes of padding. Users of PowerTop v2beta should upgrade to
PowerTop v2 final.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-21 21:05:41 -05:00
Masami Hiramatsu
e7dbfe349d kprobes/x86: Move ftrace-based kprobe code into kprobes-ftrace.c
Split ftrace-based kprobes code from kprobes, and introduce
CONFIG_(HAVE_)KPROBES_ON_FTRACE Kconfig flags.
For the cleanup reason, this also moves kprobe_ftrace check
into skip_singlestep.

Link: http://lkml.kernel.org/r/20120928081520.3560.25624.stgit@ltc138.sdl.hitachi.co.jp

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-21 13:22:36 -05:00
Masami Hiramatsu
06aeaaeabf ftrace: Move ARCH_SUPPORTS_FTRACE_SAVE_REGS in Kconfig
Move SAVE_REGS support flag into Kconfig and rename
it to CONFIG_DYNAMIC_FTRACE_WITH_REGS. This also introduces
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which indicates
the architecture depending part of ftrace has a code
that saves full registers.
On the other hand, CONFIG_DYNAMIC_FTRACE_WITH_REGS indicates
the code is enabled.

Link: http://lkml.kernel.org/r/20120928081516.3560.72534.stgit@ltc138.sdl.hitachi.co.jp

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-21 13:22:35 -05:00