Add appropriate #ifdef guards to ensure the smp-only easstats structs
are not used when smp is not enabled. Arnd got a report from buildbot,
analysed it, and pointed out exactly what the issue was.
Reported-by: "Arnd Bergmann" <arnd@arndb.de>
Suggested-by: "Arnd Bergmann" <arnd@arndb.de>
Fixes: 4b85765a3d ("sched/fair: Add eas (& cas)
specific rq, sd and task stats")
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I60554dea20137f6774db3f59b4afd40a06554cfc
When EAS is enabled during boot, we have to be careful not to use
schedtune from fair.c before it is ready or it will warn us and we'll
get a traceback in the console.
Change-Id: I1a5cf29b18af626545c636c51219f9ed497c19fa
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
sched_energy_diff tracepoint is in a place where it can never trace
payoff or nrg.delta. If CONFIG_SCHED_TUNE is enabled, put it in
a place where those values exist. If it is not enabled, trace from
the current location
Change-Id: Id5442f2b34ec76625491d27c0f4285433ca12699
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
We use 5 groups everywhere else, this should default to the same.
Change-Id: I05a20bdcf8046ea90a2e36979940cef11246e735
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
When using WALT we always used boosted cpu util for OPP selection.
This is the primary purpose for boosted cpu util, but we hadn't
changed the PELT utilization check to do the same thing.
Fix that here.
Change-Id: Id5ffb26eac23b25fe754255221f6d21b8cededfd
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
sched_group_energy() was supposed to support per-cpu capacity states
(DVFS), however, while fixing a hotplug issue this was broken as we bail
out if there is no SD_SHARE_CAP_STATES flag set.
This patch implements the hotplug race check differently and should
therefore reinstate support for per-cpu capacity states.
Change-Id: I5b865666c9ce833dcfa6514c574580d75aa0a195
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
In some cases, the new_util of a task can be the same on several
CPUs. This causes an issue because the target_util is only updated
if the current new_util is strictly smaller than target_util.
To fix that, the cpu_util_wake() return value is used alongside the
new_util value. If two CPUs compute the same new_util value,
we'll now also look at their cpu_util_wake() return value. In this
case, the CPU that last ran the task will be chosen in priority.
Change-Id: Ia1ea2c4b3ec39621372c2f748862317d5b497723
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
wake_cap performs task and cpu utilization synchronization which is
what allows us to subtract current task util from prev_cpu util and
have a sensible number to work with.
It looks as though if wake_wide returns 0, we could potentially not
execute wake_cap, which would result in unsynced signals we then use
for energy calculations.
This is not necessarily an issue we've seen in traces, but it looks
as though it should be changed.
Change-Id: Ic54a3cba2a10d946ea20113a04371dea04115e82
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
When we place a waking task with find_best_target, we calculate the
existing and new utilisation of each candidate cpu. However, we do
not remove any blocked load resulting from the waking task on the
previous cpu which might cause unnecessary migrations.
Switch to using cpu_util_wake which does this for us, which requires
moving cpu_util_wake a few functions earlier.
Also, we have multiple potential cpu utilization signals here, so
update the necessary bits to allow WALT to work properly (including
not subtracting task util for WALT).
When WALT is in use, cpu utilization is the utilization
in the previous completed window, whilst the task utilization
ignores fully idle windows. There seems to be no way to have a
decently accurate estimate of how much (if any) utilization from
this task remains on the prev cpu.
Instead, just return cpu_util when we're using WALT.
Change-Id: I448203ab98ffb5c020dfb6b218581eef1f5601f7
Reported-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
With the ability to choose between WALT and PELT for utilisation tracking
we can have the situation where we're using WALT to make all the
decisions and reporting PELT figures in the sched_load_avg_(cpu|task)
trace points. This is not too much of an issue, but when analysing trace
it is nice to see numbers representing what the scheduler is using rather
than needing to add in additional sched_walt_* traces to figure it out.
Add reporting for both types, and make the util_avg member reflect what
will be seen from cpu or task_util functions in the scheduler.
Change-Id: I2abbd2c5fa70822096d0f3372b4c12b1c6af1590
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
For Energy-Aware Scheduling (EAS) to work properly, even in the
case that there is only one cpu per cluster or that cpus are hot-plugged
out, the Energy Model (EM) data on all energy-aware sched domains (sd)
has to be present for all online cpus.
Mainline sd hierarchy setup code will remove sd's which are not useful
for task scheduling e.g. in the following situations:
1. Only 1 cpu is/remains in one cluster of a multi cluster system.
This remaining cpu only has DIE and no MC sd.
2. A complete cluster in a two cluster system is hot-plugged out.
The cpus of the remaining cluster only have MC and no DIE sd.
To make sure that all online cpus keep all their energy-aware sd's,
the sd degenerate functionality has been changed to not free a sd if
its first sched group (sg) contains EM data in case:
1. There is only 1 cpu left in the sd.
2. There have to be at least 2 sg's if certain sd flags are set.
Instead of freeing such a sd it now clears only its SD_LOAD_BALANCE
flag. This will make sure that the EAS functionality will always see
all energy-aware sd's for all online cpus.
It will introduce a tiny performance degradation for operations on
affected cpus since the hot-path macro for_each_domain() has to deal
with sd's not contributing to task scheduling at all now.
In most cases the exisiting code makes sure that task scheduling is not
invoked on a sd with !SD_LOAD_BALANCE.
However, a small change is necessary in update_sd_lb_stats() to make
sure that sd->parent is only initialized to !NULL in case the parent sd
contains more than 1 sg.
The handling of newidle decay values before the SD_LOAD_BALANCE check in
rebalance_domains() stays unchanged.
Test (w/ CONFIG_SCHED_DEBUG):
JUNO r0 default system:
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd07
CPU part : 0xd07
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags:
$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
Test 1: Hotplug-out one A57 (CPU part 0xd07) cpu:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd07
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags for remaining A57 (cpu2) cpu:
$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags`
832e <-- MC SD with !SD_LOAD_BALANCE
102f
Test 2: Hotplug-out the entire A57 cluster:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 0 > /sys/devices/system/cpu/cpu2/online
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags for the remaining A53 (CPU part 0xd03) cluster:
$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102e <-- DIE SD with !SD_LOAD_BALANCE
832f
102e
832f
102e
832f
102e
Change-Id: If24aa2b2628f334abbf0207d39e2a86168d9d673
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.
Let take the example of a task a that is dequeued of its task group A:
root
(cfs_rq)
\
(se)
A
(cfs_rq)
\
(se)
a
Task "a" was the only task in task group A which becomes idle when a is
dequeued.
We have the sequence:
- dequeue_entity a->se
- update_load_avg(a->se)
- dequeue_entity_load_avg(A->cfs_rq, a->se)
- update_cfs_shares(A->cfs_rq)
A->cfs_rq->load.weight == 0
A->se->load.weight is updated with the new share (0 in this case)
- dequeue_entity A->se
- update_load_avg(A->se) but its weight is now null so the last time
slot (up to a tick) will be accounted with a weight of 0 instead of
its real weight during the time slot. The last time slot will be
accounted as an idle one whereas it was a running one.
If the running time of task a is short enough that no tick happens when it
runs, all running time of group entity A->se will be accounted as idle
time.
Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.
update_cfs_shares() now takes the sched_entity as a parameter instead of the
cfs_rq, and the weight of the group_entity is updated only once its load_avg
has been synced with current time.
Change-Id: Id6ce3be1767b44b444ce2a77ed1ba063e57c0664
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 89ee048f3cc796db6f26906c6bef4edf0bee70fd)
[minor cherry pick stuff]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Commit:
fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
did something non-obvious but also did it buggy yet latent.
The problem was exposed for real by a later commit in the v4.7 merge window:
2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
... after which tg->load_avg and cfs_rq->load.weight had different
units (10 bit fixed point and 20 bit fixed point resp.).
Add a comment to explain the use of cfs_rq->load.weight over the
'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct
for the difference in unit.
Since this is (now, as per a previous commit) the only user of
calc_tg_weight(), collapse it.
The effects of this bug should be randomly inconsistent SMP-balancing
of cgroups workloads.
Change-Id: If1e565662ea163485edd94a12aef644d0e0dfe7a
Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit ea1dc6fc6242f991656e35e2ed3d90ec1cd13418)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
A scheduler performance regression has been reported by Joseph Salisbury,
which he bisected back to:
3d30544f0212 ("sched/fair: Apply more PELT fixes)
The regression triggers when several levels of task groups are involved
(read: SystemD) and cpu_possible_mask != cpu_present_mask.
The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
is initialized to scale_load_down(se->load.weight). During the creation of
a child task group, its group entities on possible CPUs are attached to
parent's cfs_rq (tg_parent) and their loads are added to the parent's load
(tg_parent->load_avg) with update_tg_load_avg().
But only the load on online CPUs will then be updated to reflect real load,
whereas load on other CPUs will stay at the initial value.
The result is a tg_parent->load_avg that is higher than the real load, the
weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
smaller than it should be, and the task group gets a less running time than
what it could expect.
( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
of the task group will be much higher than sum of ".tg_load_avg_contrib"
of online cfs_rqs of the task group. )
The load of group entities don't have to be intialized to something else
than 0 because their load will increase when an entity is attached.
Change-Id: Ie55021ff98ba49016adfddb2444e9c9709939226
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org> # 4.8.x
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: joonwoop@codeaurora.org
Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes)
Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b5a9b340789b2b24c6896bcf7a065c31a4db671c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Starting with the following commit:
fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
calc_tg_weight() doesn't compute the right value as expected by effective_load().
The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.
Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.
So stop using calc_tg_weight() and do it explicitly.
The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.
Change-Id: I1c0058ff674650cf295c8dc3b88a5a3de4bddab0
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v4.3+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7dd4912594daf769a46744848b05bd5bc6d62469)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.
During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.
The propagation relies on patch:
"sched: Fix hierarchical order in rq->leaf_cfs_rq_list"
... which orders children and parents, to ensure that it's done in one pass.
Change-Id: I33782e35fc4711f5901e8c23d6aa7ec5f2ff7ee5
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4e5160766fcc9f41bbd38bac11f92dce993644aa)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.
For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.
For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.
Change-Id: Id34a9888484716961c9027299c0b4d82881a39d1
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 09a43ace1f986b003c118fdf6ddf1fd685692d49)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.
The hierarchical order in shares update list has been introduced by
commit:
67e86250f8 ("sched: Introduce hierarchal order on shares update list")
With the current implementation a child can be still put after its
parent.
Lets take the example of:
root
\
b
/\
c d*
|
e*
with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail
The branch d -> e will be added the first time that they are enqueued,
starting with e then d.
When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail
Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail
e is not placed at the right position and will be called the last
whereas it should be called at the beginning.
Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.
Change-Id: I4fe0b8502ea628c13d14e8e5c5279bce67fb8845
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9c2791f936ef5fd04a118b5c284f2c9a95f4a647)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Every time we modify load/utilization of sched_entity, we start to
sync it with its cfs_rq. This update is done in different ways:
- when attaching/detaching a sched_entity, we update cfs_rq and then
we sync the entity with the cfs_rq.
- when enqueueing/dequeuing the sched_entity, we update both
sched_entity and cfs_rq metrics to now.
Use update_load_avg() everytime we have to update and sync cfs_rq and
sched_entity before changing the state of a sched_enity.
Change-Id: Ibde9a7e07ac80e9d5753bb4a0c30dfb3643cc666
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[backported FROMLIST]
Signed-off-by: Andres Oportus <andresoportus@google.com>
(cherry picked from commit d31b1a66cbe0931733583ad9d9e8c6cfd710907d)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Vincent noted that the update_tg_load_avg() usage in commit:
3d30544f0212 ("sched/fair: Apply more PELT fixes")
isn't entirely sufficient. We need to call this function every time
cfs_rq->avg.load changes, this includes when update_cfs_rq_load_avg()
returns true, but {attach,detach}_entity_load_avg() themselves also
change it. This means we need to unconditionally call
update_tg_load_avg().
Also, add more comments.
Change-Id: I7e55fceb587601f73c760c8b0d47a7ef2b777b9e
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7c3edd2c300b7ef2005a69dc727692ee07434aa5)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
One additional 'rule' for using update_cfs_rq_load_avg() is that one
should call update_tg_load_avg() if it returns true.
Add a bunch of comments to hopefully clarify some of the rules:
o You need to update cfs_rq _before_ any entity attach/detach,
this is important, because while for mathmatical consisency this
isn't strictly needed, it is required for the physical
interpretation of the model, you attach/detach _now_.
o When you modify the cfs_rq avg, you have to then call
update_tg_load_avg() in order to propagate changes upwards.
o (Fair) entities are always attached, switched_{to,from}_fair()
deal with !fair. This directly follows from the definition of the
cfs_rq averages, namely that they are a direct sum of all
(runnable or blocked) entities on that rq.
It is the second rule that this patch enforces, but it adds comments
pertaining to all of them.
Change-Id: Icdc906e98c67b84cb9582c893bc761a9886be57a
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 3d30544f02120b884bba2a9466c87dba980e3be5)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
A new task's util_avg is set to full utilization of a CPU (100% time
running). This accelerates a new task's utilization ramp-up, useful to
boost its execution in early time. However, it may result in
(insanely) high utilization for a transient time period when a flood
of tasks are spawned. Importantly, it violates the "fundamentally
bounded" CPU utilization, and its side effect is negative if we don't
take any measure to bound it.
This patch proposes an algorithm to address this issue. It has
two methods to approach a sensible initial util_avg:
(1) An expected (or average) util_avg based on its cfs_rq's util_avg:
util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
(2) A trajectory of how successive new tasks' util develops, which
gives 1/2 of the left utilization budget to a new task such that
the additional util is noticeably large (when overall util is low) or
unnoticeably small (when overall util is high enough). In the meantime,
the aggregate utilization is well bounded:
util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
where n denotes the nth task.
If util_avg is larger than util_avg_cap, then the effective util is
clamped to the util_avg_cap.
Change-Id: Idafe989b24d9e70911666f09800bf1d5a011e1f4
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2b8c41daba327c633228169e8bd8ec067ab443f8)
[integrate with schedfreq - schedfreq has a tuneable for init task util
but this commit removes the use of the tuneable since we have a new
algorithm for calculating an initial utilisation. I've left the tuneable
in place, but it is no longer used even when schedfreq is the CPUFreq
governor]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Rename best_idle to best_idle_cpu so the same name is used like in
find_best_target().
Fix if (best_idle > 0) since best_idle_cpu = 0 is a valid target.
Use 'unsigned long' data type for best_idle_capacity.
Since we're looking for the shallowest best_idle_cstate initialize
best_idle_cstate = INT_MAX. For cpus which are not idle (idle_idx = -1)
the condition 'if (idle_idx < best_idle_cstate && ...)' is never
executed.
Change-Id: Ic5b63d58478696b3d1ec6253cf739a69a574cf99
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 8bff5e9c0968108d465e1f2a4624fc5ec2f00849)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Simplify backup_capacity handling and use 'unsigned long'
data type for cpu capacity, simplify target_util handling,
simplify idle_idx handling & refactor min_util, new_util.
Also return first idle cpu for prefer_idle task immediately.
Change-Id: Ic89e140f7b369f3965703fdc8463013d16e9b94a
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
The schedtune task parameter 'boosted' is mapped into the cpu iteration
order. Currently for 'boosted' equal true the iteration starts at the
last cpu (NR_CPUS-1) whereas for 'boosted' equal false it starts at the
first cpu (0).
This only has the desired effect if the cpu topology oerdering matches
the underlying assumption. This e.g. is the case for the
Qc snapdragon 821 with its [L0 L1 b0 b1] cpu topology layout
(L=lower max freq, b=higher max freq). This results in cpus with higher
maximum capacity being given the highest logical cpu ids. However not
all big.LITTLE systems enumerate their cpus in the same way. For example,
the ARM Versatile Express Juno board has 6 cpus for which the default
configuration has topology [L0 b0 b1 L1 L2 L3].
To make this approach independent from the cpu topology layout it now
iterates over the cpus in the order of the sched_groups of the EAS
sched_domain (sd_ea). The order of cpu iteration is different for the
different cpu types in case the cpu is used to dereference sd_ea.
Considering the Qc snapdragon 821 again, for cpu L0 and L1 the order is
'b0->b1->L0->L1' whereas for b0 and b1 the order is 'b0->b1->L0->L1'.
This approach does not allow the exact same iteration order as with the
currently used flat iteration over [0 .. NR_CPUS-1] but the cpus
are ordered by the original cpu capacity.
The cpu iteration is now done in the sd_ea sched_group order required by
the 'boosted' value ['L0->L1->b0->b1'/'b0->b1->L0->L1'] rather than
forward/backward over the flat cpu space ['L0->L1->b0->b1'/
'b1->b0->L1->L0'].
Change-Id: I8fbe2073dedd2ecb1c750620c6000c11a5ff4358
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit a0c6a4272c3968c0ff50d3fed65f5865b72d777b)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
This will allow to start iterating from a cpu with max or min original
capacity in the wakeup path regardless on which cpu the scheduler is
currently running (smp_processor_id()) or the previous cpu of the task
(task_cpu(p)). This iteration has to happen on a sched_domain spanning
all cpus in the order of the sched_groups of this sched_domain seen by
the starting cpu.
In case of an SMP system the first cpu with max orig capacity and the
the one with min orig capacity is the same. This can temporally happen
on a big.LITTLE system with hotplug as well.
E.g. the different order of cpu iteration can be used to map schedtune
task parameter 'boosted' into the cpu iteration order in
find_best_target().
Use of READ_ONCE()/WRITE_ONCE() to avoid load/store tearing.
Change-Id: I812fbd9c7e5f506617e456c0eec3edcd2c016e92
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit fd6e9543c1fd8971a5e2e68e39b2f6e591d46114)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Commit fd5c98da1a42 "WIP: sched: Store system-wide maximum cpu capacity
in root domain" was repalced by commit 8148bdfff4f5 "WIP: sched: Update
max cpu capacity in case of max frequency constraints" which didn't
remove all the now unused bits.
Change-Id: I067f6366431f43337cffa7a2a8e0de32dd33d2f9
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 6d284a607cec51bcafca313bc396bc3103b1e876)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
With the new wakeup approach this sysctl is not necessary any more.
Change-Id: I52114b3c918791f6a4f9f30f50002919ccbc1a9c
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 885c0d503bcdf0ef4e9b46822496f16b20aa3bbd)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
This patch replaces the existing EAS upstream implementation of
select_energy_cpu_brute() with the one of find_best_target() used
in Android previously.
It also removes the cpumask 'and' from select_energy_cpu_brute,
see the existing use of 'cpu = smp_processor_id()' in
select_task_rq_fair().
Change-Id: If678c002efaa87d1ba3ec9989a4e9f8df98b83ec
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[ added guarding for non-schedtune builds ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
This patch re-integrates the part which was initially provided by
3b9d7554aeec ("EAS: sched/fair: tunable to honor sync wakeups") into
energy_aware_wake_cpu() into select_energy_cpu_brute().
Change-Id: I748fde3ecdeb44651179bce0a5bb8dd82d1903f6
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit b75b7286cb068d5761621ea134c23dd131db953f)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
This has to be done in the caller function of energy_diff() version of
SchedTune to avoid Null pointer dereference in energy_diff().
Change-Id: I3f0f68dbd11efb15bbb3b1832f8294419ed85241
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 14531d4e245d063f713ee5ed835df958e6c7838f)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
When the systems is not overutilized, place waking tasks on the most
energy efficient cpu. Previous attempts reduced the search space by
matching task utilization to cpu capacity before consulting the energy
model as this is an expensive operation. The search heuristics didn't
work very well and lacking any better alternatives this patch takes the
brute-force route and tries all potential targets.
This approach doesn't scale, but it might be sufficient for many
embedded applications while work is continuing on a heuristic that can
minimize the necessary computations. The heuristic must be derrived from
the platform energy model rather than make additional assumptions, such
lower capacity implies better energy efficiency. PeterZ mentioned in the
past that we might be able to derrive some simpler deciding functions
using mathematical (modal?) analysis.
Change-Id: I772bacb4c8fd599f8006fa422f842e66377a9c6c
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[rebase: on top of msm-google/android-msm-marlin-3.18]
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit a894422dbdb7b77ea2acfe7ff909ccb5ded23514)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
It is not worth the overhead to migrate tasks for tiny insignificant
energy savings. To prevent this, an energy margin is introduced in
energy_diff() which effectively adds a dead-zone that rounds tiny energy
differences to zero. Since no scale is enforced for energy model data
the margin can't be absolute. Instead it is defined as +/-1.56% energy
saving compared to the current total estimated energy consumption.
Change-Id: I6be069c752c701fb825430896b3b768a7ab2fee4
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[rebase: on top of msm-google/android-msm-marlin-3.18,
massage original patch which changes code in energy_diff()
into __energy_diff() introduced by SchedTune]
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 780cb5a5fa47adf13d4fc2b77e8e94448cd56098)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
The EAS functionality in the wakeup path will be brought back by the
following patch ("sched/fair: Energy-aware wake-up task placement")
providing the function select_energy_cpu_brute().
Change-Id: I927fb9e8261cfacfe404695f853941c7959aa146
[ Trivial merge conflicts resolved. ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 80aee424fb7765a777267e144037642625a71304)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
This lets us use Capacity-Aware Scheduling (CAS) if EAS is enabled.
Change-Id: I2e647a201ea0b733d1487c3e153047a49fb22847
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 00b7da2ae58bf568529e67614980f77e275b8d29)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
For asymmetric CPU capacity systems it is counter-productive for
throughput if low capacity CPUs are pulling tasks from non-overloaded
CPUs with higher capacity. The assumption is that higher CPU capacity is
preferred over running alone in a group with lower CPU capacity.
This patch rejects higher CPU capacity groups with one or less task per
CPU as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.
Change-Id: I428875bb6267c780026ef75e2882300738d016e7
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9e0994c0a1c1f82c705f1f66388e1bcffcee8bb9)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
struct sched_group_capacity currently represents the compute capacity
sum of all CPUs in the sched_group.
Unless it is divided by the group_weight to get the average capacity
per CPU, it hides differences in CPU capacity for mixed capacity systems
(e.g. high RT/IRQ utilization or ARM big.LITTLE).
But even the average may not be sufficient if the group covers CPUs of
different capacities.
Instead, by extending struct sched_group_capacity to indicate min per-CPU
capacity in the group a suitable group for a given task utilization can
more easily be found such that CPUs with reduced capacity can be avoided
for tasks with high utilization (not implemented by this patch).
Change-Id: If3cae1be62d01a199e752bca5abb45357d5d0fbd
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit bf475ce0a3dd75b5d1df6c6c14ae25168caa15ac)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.
In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.
Change-Id: I6097af76c302a5a12e240ca24c70f707ad118242
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6a0b19c0f39a7a7b7fb77d3867a733136ff059a3)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>