The dl_runtime_exceeded() function is supposed to ckeck if
a SCHED_DEADLINE task must be throttled, by checking if its
current runtime is <= 0. However, it also checks if the
scheduling deadline has been missed (the current time is
larger than the current scheduling deadline), further
decreasing the runtime if this happens.
This "double accounting" is wrong:
- In case of partitioned scheduling (or single CPU), this
happens if task_tick_dl() has been called later than expected
(due to small HZ values). In this case, the current runtime is
also negative, and replenish_dl_entity() can take care of the
deadline miss by recharging the current runtime to a value smaller
than dl_runtime
- In case of global scheduling on multiple CPUs, scheduling
deadlines can be missed even if the task did not consume more
runtime than expected, hence penalizing the task is wrong
This patch fix this problem by throttling a SCHED_DEADLINE task
only when its runtime becomes negative, and not modifying the runtime
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
According to global EDF, tasks should be migrated between runqueues
without checking if their scheduling deadlines and runtimes are valid.
However, SCHED_DEADLINE currently performs such a check:
a migration happens doing:
deactivate_task(rq, next_task, 0);
set_task_cpu(next_task, later_rq->cpu);
activate_task(later_rq, next_task, 0);
which ends up calling dequeue_task_dl(), setting the new CPU, and then
calling enqueue_task_dl().
enqueue_task_dl() then calls enqueue_dl_entity(), which calls
update_dl_entity(), which can modify scheduling deadline and runtime,
breaking global EDF scheduling.
As a result, some of the properties of global EDF are not respected:
for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
two cores can have unbounded response times for the third task even
if 30/80+40/80+120/170 = 1.5809 < 2
This can be fixed by invoking update_dl_entity() only in case of
wakeup, or if this is a new SCHED_DEADLINE task.
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In effective_load, we have (long w * unsigned long tg->shares) / long W,
when w is negative, it is cast to unsigned long and hence the product is
insanely large. Fix this by casting tg->shares to long.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On x86_64, at least, task_pt_regs may be only partially initialized
in many contexts, so x86_64 should not use it without extra care
from interrupt context, let alone NMI context.
This will allow x86_64 to override the logic and will supply some
scratch space to use to make a cleaner copy of user regs.
Tested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: chenggang.qcg@taobao.com
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/e431cd4c18c2e1c44c774f10758527fb2d1025c4.1420396372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are aborting a build in case when gcc doesn't support fentry on x86_64
(regs->ip modification can't really reliably work with mcount).
This however breaks allmodconfig for people with older gccs that don't
support -mfentry.
Turn the build-time failure into runtime failure, resulting in the whole
infrastructure not being initialized if CC_USING_FENTRY is unset.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
wait_consider_task() checks EXIT_ZOMBIE after EXIT_DEAD/EXIT_TRACE and
both checks can fail if we race with EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE
change in between, gcc needs to reload p->exit_state after
security_task_wait(). In this case ->notask_error will be wrongly
cleared and do_wait() can hang forever if it was the last eligible
child.
Many thanks to Arne who carefully investigated the problem.
Note: this bug is very old but it was pure theoretical until commit
b3ab03160d ("wait: completely ignore the EXIT_DEAD tasks"). Before
this commit "-O2" was probably enough to guarantee that compiler won't
read ->exit_state twice.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Arne Goedeke <el@laramies.com>
Tested-by: Arne Goedeke <el@laramies.com>
Cc: <stable@vger.kernel.org> [3.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Verify that the frequency value from userspace is valid and makes sense.
Unverified values can cause overflows later on.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: Fix up bug for negative values and drop redunent cap check]
Signed-off-by: John Stultz <john.stultz@linaro.org>
An unvalidated user input is multiplied by a constant, which can result in
an undefined behaviour for large values. While this is validated later,
we should avoid triggering undefined behaviour.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: include trivial milisecond->microsecond correction noticed
by Andy]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Subtle race conditions can result if a CPU stays in dyntick-idle mode
long enough for the ->gpnum and ->completed fields to wrap. For
example, consider the following sequence of events:
o CPU 1 encounters a quiescent state while waiting for grace period
5 to complete, but then enters dyntick-idle mode.
o While CPU 1 is in dyntick-idle mode, the grace-period counters
wrap around so that the grace period number is now 4.
o Just as CPU 1 exits dyntick-idle mode, grace period 4 completes
and grace period 5 begins.
o The quiescent state that CPU 1 passed through during the old
grace period 5 looks like it applies to the new grace period
5. Therefore, the new grace period 5 completes without CPU 1
having passed through a quiescent state.
This could clearly be a fatal surprise to any long-running RCU read-side
critical section that happened to be running on CPU 1 at the time. At one
time, this was not a problem, given that it takes significant time for
the grace-period counters to overflow even on 32-bit systems. However,
with the advent of NO_HZ_FULL and SMP embedded systems, arbitrarily long
idle periods are now becoming quite feasible. It is therefore time to
close this race.
This commit therefore avoids this race condition by having the
quiescent-state forcing code detect when a CPU is falling too far
behind, and setting a new rcu_data field ->gpwrap when this happens.
Whenever this new ->gpwrap field is set, the CPU's ->gpnum and ->completed
fields are known to be untrustworthy, and can be ignored, along with
any associated quiescent states.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current RCU CPU stall warning code will print "Stall ended before
state dump start" any time that the stall-warning code is triggered on
a CPU that has already reported a quiescent state for the current grace
period and if all quiescent states have been reported for the current
grace period. However, a true stall can result in these symptoms, for
example, by preventing RCU's grace-period kthreads from ever running
This commit therefore checks for this condition, reporting the end of
the stall only if one of the grace-period counters has actually advanced.
Otherwise, it reports the last time that the grace-period kthread made
meaningful progress. (In normal situations, the grace-period kthread
should make meaningful progress at least every jiffies_till_next_fqs
jiffies.)
Reported-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Miroslav Benes <mbenes@suse.cz>
One way that an RCU CPU stall warning can happen is if the grace-period
kthread is not allowed to execute. One proxy for this kthread's
forward progress is the number of force-quiescent-state (fqs) scans.
This commit therefore adds the number of fqs scans to the RCU CPU stall
warning printouts when CONFIG_RCU_CPU_STALL_INFO=y.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.
The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.
If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.
text data bss dec hex filename
2007 0 0 2007 7d7 kernel/rcu/srcu.o
Size of arch/powerpc/boot/zImage changes from
text data bss dec hex filename
831552 64180 23944 919676 e087c arch/powerpc/boot/zImage : before
829504 64180 23952 917636 e0084 arch/powerpc/boot/zImage : after
so the savings are about ~2000 bytes.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
When rcutorture used only the low-order 32 bits of the grace-period
number, it was not a problem for SRCU to use a 32-bit completed field.
However, rcutorture now uses the full 64 bits on 64-bit systems, so
this commit converts SRCU's ->completed field to unsigned long so as to
provide 64 bits on 64-bit systems.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The RCU callback lists are initialized in both rcu_boot_init_percpu_data()
and rcu_init_percpu_data(). The former is intended for initializing
immutable data, so this commit removes the initialization from
rcu_boot_init_percpu_data() and leaves it in rcu_init_percpu_data().
This change prepares for permitting callbacks to be queued very early
in boot.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that blocked tasks are no longer migrated to the root rcu_node
structure, there is no need to scan the root rcu_node structure for
blocked tasks stalling the current grace period. This commit therefore
removes this scan.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The patch dfeb9765ce ("Allow post-unlock reference for rt_mutex")
ensured rcu-boost safe even the rt_mutex has post-unlock reference.
But rt_mutex allowing post-unlock reference is definitely a bug and it was
fixed by the commit 27e35715df ("rtmutex: Plug slow unlock race").
This fix made the previous patch (dfeb9765ce) useless.
And even worse, the priority-inversion introduced by the the previous
patch still exists.
rcu_read_unlock_special() {
rt_mutex_unlock(&rnp->boost_mtx);
/* Priority-Inversion:
* the current task had been deboosted and preempted as a low
* priority task immediately, it could wait long before reschedule in,
* and the rcu-booster also waits on this low priority task and sleeps.
* This priority-inversion makes rcu-booster can't work
* as expected.
*/
complete(&rnp->boost_completion);
}
Just revert the patch to avoid it.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_cleanup_dead_cpu() function (called after a CPU has gone
completely offline) has not reported a quiescent state because there
was probably at least one synchronize_rcu() between the time the CPU
went offline and the CPU_DEAD notifier, and this would have detected
the CPU's offline state via quiescent-state forcing. However, the plan
is for CPUs to take themselves offline, at which point it makes sense
for them to report their own quiescent state. This commit makes this
change in preparation for the new CPU-hotplug setup.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When rcu_boost_kthread_setaffinity() sees that all CPUs for a given
rcu_node structure are now offline, it affinities the corresponding
RCU-boost ("rcub") kthread away from those CPUs. This is pointless
because the kthread cannot run on those offline CPUs in any case.
This commit therefore removes this unneeded code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because there is no longer any preempted tasks on the root rcu_node, and
because there is no longer ever an rcub kthread for the root rcu_node,
this commit drops the code in force_qs_rnp() that attempts to awaken
the non-existent root rcub kthread. This is strictly a performance
enhancement, removing a root rcu_node ->lock acquisition and release
along with some tests in rcu_initiate_boost(), ending with the test that
notes that there is no rcub kthread.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that offlining CPUs no longer moves leaf rcu_node structures'
->blkd_tasks lists to the root, there is no way for the root rcu_node
structure's ->blkd_task list to be nonempty, unless the root node is also
the sole leaf node. This commit therefore refrains from creating an rcub
kthread for the root rcu_node structure unless it is also the sole leaf.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Given that there is now arcu_preempt_has_tasks() function that checks
to see if the ->blkd_tasks list is non-empty, this commit makes use of it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that we are not migrating callbacks, there is no need to hold the
->orphan_lock across the the ->qsmaskinit bit-clearing process.
This commit therefore releases ->orphan_lock immediately after adopting
the orphaned RCU callbacks.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When the last CPU associated with a given leaf rcu_node structure
goes offline, something must be done about the tasks queued on that
rcu_node structure. Each of these tasks has been preempted on one of
the leaf rcu_node structure's CPUs while in an RCU read-side critical
section that it have not yet exited. Handling these tasks is the job of
rcu_preempt_offline_tasks(), which migrates them from the leaf rcu_node
structure to the root rcu_node structure.
Unfortunately, this migration has to be done one task at a time because
each tasks allegiance must be shifted from the original leaf rcu_node to
the root, so that future attempts to deal with these tasks will acquire
the root rcu_node structure's ->lock rather than that of the leaf.
Worse yet, this migration must be done with interrupts disabled, which
is not so good for realtime response, especially given that there is
no bound on the number of tasks on a given rcu_node structure's list.
(OK, OK, there is a bound, it is just that it is unreasonably large,
especially on 64-bit systems.) This was not considered a problem back
when rcu_preempt_offline_tasks() was first written because realtime
systems were assumed not to do CPU-hotplug operations while real-time
applications were running. This assumption has proved of dubious validity
given that people are starting to run multiple realtime applications
on a single SMP system and that it is common practice to offline then
online a CPU before starting its real-time application in order to clear
extraneous processing off of that CPU. So we now need CPU hotplug
operations to avoid undue latencies.
This commit therefore avoids migrating these tasks, instead letting
them be dequeued one by one from the original leaf rcu_node structure
by rcu_read_unlock_special(). This means that the clearing of bits
from the upper-level rcu_node structures must be deferred until the
last such task has been dequeued, because otherwise subsequent grace
periods won't wait on them. This commit has the beneficial side effect
of simplifying the CPU-hotplug code for TREE_PREEMPT_RCU, especially in
CONFIG_RCU_BOOST builds.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit causes rcu_read_unlock_special() to propagate ->qsmaskinit
bit clearing up the rcu_node tree once a given rcu_node structure's
blkd_tasks list becomes empty. This is the final commit in preparation
for the rework of RCU priority boosting: It enables preempted tasks to
remain queued on their rcu_node structure even after all of that rcu_node
structure's CPUs have gone offline.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit abstracts rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()
in preparation for the rework of RCU priority boosting. This new function
will be invoked from rcu_read_unlock_special() in the reworked scheme,
which is why rcu_cleanup_dead_rnp() assumes that the leaf rcu_node
structure's ->qsmaskinit field has already been updated.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit undertakes a simple variable renaming to make way for
some rework of RCU priority boosting.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit prevents random compiler optimizations by applying
ACCESS_ONCE() to lockless accesses.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The 48a7639ce8 ("rcu: Make callers awaken grace-period kthread")
removed the irq_work_queue(), so the TREE_RCU doesn't need
irq work any more. This commit therefore updates RCU's Kconfig and
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_barrier() no-callbacks check for no-CBs CPUs has race conditions.
It checks a given CPU's lists of callbacks, and if all three no-CBs lists
are empty, ignores that CPU. However, these three lists could potentially
be empty even when callbacks are present if the check executed just as
the callbacks were being moved from one list to another. It turns out
that recent versions of rcutorture can spot this race.
This commit plugs this hole by consolidating the per-list counts of
no-CBs callbacks into a single count, which is incremented before
the corresponding callback is posted and after it is invoked. Then
rcu_barrier() checks this single count to reliably determine whether
the corresponding CPU has no-CBs callbacks.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit b2c4623dcd ("rcu: More on deadlock between CPU hotplug and expedited
grace periods") introduced another problem that can easily be reproduced by
starting/stopping cpus in a loop.
E.g.:
for i in `seq 5000`; do
echo 1 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu1/online
done
Will result in:
INFO: task /cpu_start_stop:1 blocked for more than 120 seconds.
Call Trace:
([<00000000006a028e>] __schedule+0x406/0x91c)
[<0000000000130f60>] cpu_hotplug_begin+0xd0/0xd4
[<0000000000130ff6>] _cpu_up+0x3e/0x1c4
[<0000000000131232>] cpu_up+0xb6/0xd4
[<00000000004a5720>] device_online+0x80/0xc0
[<00000000004a57f0>] online_store+0x90/0xb0
...
And a deadlock.
Problem is that if the last ref in put_online_cpus() can't get the
cpu_hotplug.lock the puts_pending count is incremented, but a sleeping
active_writer might never be woken up, therefore never exiting the loop in
cpu_hotplug_begin().
This fix removes puts_pending and turns refcount into an atomic variable. We
also introduce a wait queue for the active_writer, to avoid possible races and
use-after-free. There is no need to take the lock in put_online_cpus() anymore.
Can't reproduce it with this fix.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
For RCU in UP, context-switch = QS = GP, thus we can force a
context-switch when any call_rcu_[bh|sched]() is happened on idle_task.
After doing so, rcu_idle/irq_enter/exit() are useless, so we can simply
make these functions empty.
More important, this change does not change the functionality logically.
Note: raise_softirq(RCU_SOFTIRQ)/rcu_sched_qs() in rcu_idle_enter() and
outmost rcu_irq_exit() will have to wake up the ksoftirqd
(due to in_interrupt() == 0).
Before this patch After this patch:
call_rcu_sched() in idle; call_rcu_sched() in idle
set resched
do other stuffs; do other stuffs
outmost rcu_irq_exit() outmost rcu_irq_exit() (empty function)
(or rcu_idle_enter()) (or rcu_idle_enter(), also empty function)
start to resched. (see above)
rcu_sched_qs() rcu_sched_qs()
QS,and GP and advance cb QS,and GP and advance cb
wake up the ksoftirqd wake up the ksoftirqd
set resched
resched to ksoftirqd (or other) resched to ksoftirqd (or other)
These two code patches are almost the same.
Size changed after patched:
size kernel/rcu/tiny-old.o kernel/rcu/tiny-patched.o
text data bss dec hex filename
3449 206 8 3663 e4f kernel/rcu/tiny-old.o
2406 144 8 2558 9fe kernel/rcu/tiny-patched.o
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull audit fix from Paul Moore:
"One audit patch to resolve a panic/oops when recording filenames in
the audit log, see the mail archive link below.
The fix isn't as nice as I would like, as it involves an allocate/copy
of the filename, but it solves the problem and the overhead should
only affect users who have configured audit rules involving file
names.
We'll revisit this issue with future kernels in an attempt to make
this suck less, but in the meantime I think this fix should go into
the next release of v3.19-rcX.
[ https://marc.info/?t=141986927600001&r=1&w=2 ]"
* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
audit: create private file name copies when auditing inodes
Despite what the comment says, it is only softirqs that are disabled,
not interrupts. This commit therefore fixes the comment.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Let's start assuming that something in the idle loop posts a callback,
and scheduling-clock interrupt occurs:
1. The system is idle and stays that way, no runnable tasks.
2. Scheduling-clock interrupt occurs, rcu_check_callbacks() is called
as result, which in turn calls rcu_is_cpu_rrupt_from_idle().
3. rcu_is_cpu_rrupt_from_idle() reports the CPU was interrupted from
idle, which results in rcu_sched_qs() call, which does a
raise_softirq(RCU_SOFTIRQ).
4. Upon return from interrupt, rcu_irq_exit() is invoked, which calls
rcu_idle_enter_common(), which in turn calls rcu_sched_qs() again,
which does another raise_softirq(RCU_SOFTIRQ).
5. The softirq happens shortly and invokes rcu_process_callbacks(),
which invokes __rcu_process_callbacks().
6. So now callbacks can be invoked. At least they can be if
->donetail has been updated. Which it will have been because
rcu_sched_qs() invokes rcu_qsctr_help().
In the described scenario rcu_sched_qs() and raise_softirq(RCU_SOFTIRQ)
get called twice in steps 3 and 4. This redundancy could be eliminated
by removing rcu_is_cpu_rrupt_from_idle() function.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The x86 architecture has multiple types of NMI-like interrupts: real
NMIs, machine checks, and, for some values of NMI-like, debugging
and breakpoint interrupts. These interrupts can nest inside each
other. Andy Lutomirski is adding RCU support to these interrupts,
so rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.
This commit therefore introduces nesting, using a clever NMI-coordination
algorithm suggested by Andy. The trick is to atomically increment
->dynticks (if needed) before manipulating ->dynticks_nmi_nesting on entry
(and, accordingly, after on exit). In addition, ->dynticks_nmi_nesting
is incremented by one if ->dynticks was incremented and by two otherwise.
This means that when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal
to one, it knows that ->dynticks must be atomically incremented.
This NMI-coordination algorithms has been validated by the following
Promela model:
------------------------------------------------------------------------
/*
* Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
* that allows nesting.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, you can access it online at
* http://www.gnu.org/licenses/gpl-2.0.html.
*
* Copyright IBM Corporation, 2014
*
* Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
*/
byte dynticks_nmi_nesting = 0;
byte dynticks = 0;
/*
* Promela verision of rcu_nmi_enter().
*/
inline rcu_nmi_enter()
{
byte incby;
byte tmp;
incby = BUSY_INCBY;
assert(dynticks_nmi_nesting >= 0);
if
:: (dynticks & 1) == 0 ->
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 1);
incby = 1;
:: else ->
skip;
fi;
tmp = dynticks_nmi_nesting;
tmp = tmp + incby;
dynticks_nmi_nesting = tmp;
assert(dynticks_nmi_nesting >= 1);
}
/*
* Promela verision of rcu_nmi_exit().
*/
inline rcu_nmi_exit()
{
byte tmp;
assert(dynticks_nmi_nesting > 0);
assert((dynticks & 1) != 0);
if
:: dynticks_nmi_nesting != 1 ->
tmp = dynticks_nmi_nesting;
tmp = tmp - BUSY_INCBY;
dynticks_nmi_nesting = tmp;
:: else ->
dynticks_nmi_nesting = 0;
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 0);
fi;
}
/*
* Base-level NMI runs non-atomically. Crudely emulates process-level
* dynticks-idle entry/exit.
*/
proctype base_NMI()
{
byte busy;
busy = 0;
do
:: /* Emulate base-level dynticks and not. */
if
:: 1 -> atomic {
dynticks = dynticks + 1;
}
busy = 1;
:: 1 -> skip;
fi;
/* Verify that we only sometimes have base-level dynticks. */
if
:: busy == 0 -> skip;
:: busy == 1 -> skip;
fi;
/* Model RCU's NMI entry and exit actions. */
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
/* Emulated re-entering base-level dynticks and not. */
if
:: !busy -> skip;
:: busy ->
atomic {
dynticks = dynticks + 1;
}
busy = 0;
fi;
/* We had better now be in dyntick-idle mode. */
assert((dynticks & 1) == 0);
od;
}
/*
* Nested NMI runs atomically to emulate interrupting base_level().
*/
proctype nested_NMI()
{
do
:: /*
* Use an atomic section to model a nested NMI. This is
* guaranteed to interleave into base_NMI() between a pair
* of base_NMI() statements, just as a nested NMI would.
*/
atomic {
/* Verify that we only sometimes are in dynticks. */
if
:: (dynticks & 1) == 0 -> skip;
:: (dynticks & 1) == 1 -> skip;
fi;
/* Model RCU's NMI entry and exit actions. */
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
}
od;
}
init {
run base_NMI();
run nested_NMI();
}
------------------------------------------------------------------------
The following script can be used to run this model if placed in
rcu_nmi.spin:
------------------------------------------------------------------------
if ! spin -a rcu_nmi.spin
then
echo Spin errors!!!
exit 1
fi
if ! cc -DSAFETY -o pan pan.c
then
echo Compilation errors!!!
exit 1
fi
./pan -m100000
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
The current timecounter implementation will drop a variable amount
of resolution, depending on the magnitude of the time delta. In
other words, reading the clock too often or too close to a time
stamp conversion will introduce errors into the time values. This
patch fixes the issue by introducing a fractional nanosecond field
that accumulates the low order bits.
Reported-by: Janusz Użycki <j.uzycki@elproma.com.pl>
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The timecounter code has almost nothing to do with the clocksource
code. Let it live in its own file. This will help isolate the
timecounter users from the clocksource users in the source tree.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix double SKB free in bluetooth 6lowpan layer, from Jukka Rissanen.
2) Fix receive checksum handling in enic driver, from Govindarajulu
Varadarajan.
3) Fix NAPI poll list corruption in virtio_net and caif_virtio, from
Herbert Xu. Also, add code to detect drivers that have this mistake
in the future.
4) Fix doorbell endianness handling in mlx4 driver, from Amir Vadai.
5) Don't clobber IP6CB() before xfrm6_policy_check() is called in TCP
input path,f rom Nicolas Dichtel.
6) Fix MPLS action validation in openvswitch, from Pravin B Shelar.
7) Fix double SKB free in vxlan driver, also from Pravin.
8) When we scrub a packet, which happens when we are switching the
context of the packet (namespace, etc.), we should reset the
secmark. From Thomas Graf.
9) ->ndo_gso_check() needs to do more than return true/false, it also
has to allow the driver to clear netdev feature bits in order for
the caller to be able to proceed properly. From Jesse Gross.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
genetlink: A genl_bind() to an out-of-range multicast group should not WARN().
netlink/genetlink: pass network namespace to bind/unbind
ne2k-pci: Add pci_disable_device in error handling
bonding: change error message to debug message in __bond_release_one()
genetlink: pass multicast bind/unbind to families
netlink: call unbind when releasing socket
netlink: update listeners directly when removing socket
genetlink: pass only network namespace to genl_has_listeners()
netlink: rename netlink_unbind() to netlink_undo_bind()
net: Generalize ndo_gso_check to ndo_features_check
net: incorrect use of init_completion fixup
neigh: remove next ptr from struct neigh_table
net: xilinx: Remove unnecessary temac_property in the driver
net: phy: micrel: use generic config_init for KSZ8021/KSZ8031
net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding
openvswitch: fix odd_ptr_err.cocci warnings
Bluetooth: Fix accepting connections when not using mgmt
Bluetooth: Fix controller configuration with HCI_QUIRK_INVALID_BDADDR
brcmfmac: Do not crash if platform data is not populated
ipw2200: select CFG80211_WEXT
...
Unfortunately, while commit 4a928436 ("audit: correctly record file
names with different path name types") fixed a problem where we were
not recording filenames, it created a new problem by attempting to use
these file names after they had been freed. This patch resolves the
issue by creating a copy of the filename which the audit subsystem
frees after it is done with the string.
At some point it would be nice to resolve this issue with refcounts,
or something similar, instead of having to allocate/copy strings, but
that is almost surely beyond the scope of a -rcX patch so we'll defer
that for later. On the plus side, only audit users should be impacted
by the string copying.
Reported-by: Toralf Foerster <toralf.foerster@gmx.de>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Netlink families can exist in multiple namespaces, and for the most
part multicast subscriptions are per network namespace. Thus it only
makes sense to have bind/unbind notifications per network namespace.
To achieve this, pass the network namespace of a given client socket
to the bind/unbind functions.
Also do this in generic netlink, and there also make sure that any
bind for multicast groups that only exist in init_net is rejected.
This isn't really a problem if it is accepted since a client in a
different namespace will never receive any notifications from such
a group, but it can confuse the family if not rejected (it's also
possible to silently (without telling the family) accept it, but it
would also have to be ignored on unbind so families that take any
kind of action on bind/unbind won't do unnecessary work for invalid
clients like that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull audit fixes from Paul Moore:
"Four patches to fix various problems with the audit subsystem, all are
fairly small and straightforward.
One patch fixes a problem where we weren't using the correct gfp
allocation flags (GFP_KERNEL regardless of context, oops), one patch
fixes a problem with old userspace tools (this was broken for a
while), one patch fixes a problem where we weren't recording pathnames
correctly, and one fixes a problem with PID based filters.
In general I don't think there is anything controversial with this
patchset, and it fixes some rather unfortunate bugs; the allocation
flag one can be particularly scary looking for users"
* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
audit: restore AUDIT_LOGINUID unset ABI
audit: correctly record file names with different path name types
audit: use supplied gfp_mask from audit_buffer in kauditd_send_multicast_skb
audit: don't attempt to lookup PIDs when changing PID filtering audit rules
A regression was caused by commit 780a7654ce:
audit: Make testing for a valid loginuid explicit.
(which in turn attempted to fix a regression caused by e1760bd)
When audit_krule_to_data() fills in the rules to get a listing, there was a
missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.
This broke userspace by not returning the same information that was sent and
expected.
The rule:
auditctl -a exit,never -F auid=-1
gives:
auditctl -l
LIST_RULES: exit,never f24=0 syscall=all
when it should give:
LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all
Tag it so that it is reported the same way it was set. Create a new
private flags audit_krule field (pflags) to store it that won't interact with
the public one from the API.
Cc: stable@vger.kernel.org # v3.10-rc1+
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
When allocating space for load_balance_mask, in sched_init, when
CPUMASK_OFFSTACK is set, we've managed to spill over
KMALLOC_MAX_SIZE on our 6144 core machine. The patch below
breaks up the allocations so that they don't overflow the max
alloc size. It also allocates the masks on the the node from
which they'll most commonly be accessed, to minimize remote
accesses on NUMA machines.
Suggested-by: George Beshers <gbeshers@sgi.com>
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: George Beshers <gbeshers@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418928270-148543-1-git-send-email-athorlton@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Taking the global mutex "trace_types_lock" in the trace_pipe files
causes a bottle neck as most the pipe files can be read per cpu
and there's no reason to serialize them.
The current_trace variable was given a ref count and it can not
change when the ref count is not zero. Opening the trace_pipe
files will up the ref count (and decremented on close), so that
the lock no longer needs to be taken when accessing the
current_trace variable.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I rebased Kees' 'param: do not set store func without write perm'
on top of my 'params: cleanup sysfs allocation'. However, my patch
uses krealloc which doesn't zero memory, leaving .store unset.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When one of the trace pipe files are being read (by either the trace_pipe
or trace_pipe_raw), do not allow the current_trace to change. By adding
a ref count that is incremented when the pipe files are opened, will
prevent the current_trace from being changed.
This will allow for the removal of the global trace_types_lock from
reading the pipe buffers (which is currently a bottle neck).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use the FTRACE_OPS_FL_IPMODIFY flag to prevent conflicts with other
ftrace users who also modify regs->ip.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>