Commit graph

2368 commits

Author SHA1 Message Date
Linux Build Service Account
1d5844ba9d Merge "sched: hmp: Optimize cycle counter reads" 2017-06-06 13:21:50 -07:00
Linux Build Service Account
6ed51e0bab Merge "sched: Don't active migrate tasks to CPUs in the same cluster" 2017-06-06 13:21:48 -07:00
Linux Build Service Account
0d1b465cb8 Merge "sched: Fix load tracking bug to avoid adding phantom task demand" 2017-06-06 13:21:39 -07:00
Chris Redpath
fce0ecf04a schedstats/eas: guard properly to avoid breaking non-smp schedstats users
Add appropriate #ifdef guards to ensure the smp-only easstats structs
are not used when smp is not enabled. Arnd got a report from buildbot,
analysed it, and pointed out exactly what the issue was.

Reported-by: "Arnd Bergmann" <arnd@arndb.de>
Suggested-by: "Arnd Bergmann" <arnd@arndb.de>
Fixes: 4b85765a3d ("sched/fair: Add eas (& cas)
 specific rq, sd and task stats")
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I60554dea20137f6774db3f59b4afd40a06554cfc
2017-06-03 15:03:03 +01:00
Chris Redpath
c47d00b57b sched/tune: don't use schedtune before it is ready
When EAS is enabled during boot, we have to be careful not to use
schedtune from fair.c before it is ready or it will warn us and we'll
get a traceback in the console.

Change-Id: I1a5cf29b18af626545c636c51219f9ed497c19fa
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:55 -07:00
Patrick Bellasi
9e3c04bef7 sched/fair: use SCHED_CAPACITY_SCALE for energy normalization
Change-Id: I686d26975f4a7dd830ff8441ff986e35461a7d55
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Patrick Bellasi
7b8577d94c sched/{fair,tune}: use reciprocal_value to compute boost margin
Change-Id: I493b07360c46eee0b72c2a046dab9ec6cb3427ef
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Srinath Sridharan
41d9288e3e sched/tune: Initialize raw_spin_lock in boosted_groups
bug: 32668852
Change-Id: Ice96230d88939d5973b1b6310085d1b3df9c47d9
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Patrick Bellasi
3757f95741 sched/tune: report when SchedTune has not been initialized
Change-Id: Iba4e5e3d220451f04272d555e6b8e0af83a7f09d
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Chris Redpath
f9b83b3e6e sched/tune: fix sched_energy_diff tracepoint
sched_energy_diff tracepoint is in a place where it can never trace
payoff or nrg.delta. If CONFIG_SCHED_TUNE is enabled, put it in
a place where those values exist. If it is not enabled, trace from
the current location

Change-Id: Id5442f2b34ec76625491d27c0f4285433ca12699
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:55 -07:00
Chris Redpath
2e829cf17f sched/tune: increase group count to 5
We use 5 groups everywhere else, this should default to the same.

Change-Id: I05a20bdcf8046ea90a2e36979940cef11246e735
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2017-06-02 08:01:55 -07:00
Chris Redpath
4c031f0e6f cpufreq/schedutil: use boosted_cpu_util for PELT to match WALT
When using WALT we always used boosted cpu util for OPP selection.
This is the primary purpose for boosted cpu util, but we hadn't
changed the PELT utilization check to do the same thing.

Fix that here.

Change-Id: Id5ffb26eac23b25fe754255221f6d21b8cededfd
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:55 -07:00
Morten Rasmussen
fc969e3bfa sched/fair: Fix sched_group_energy() to support per-cpu capacity states
sched_group_energy() was supposed to support per-cpu capacity states
(DVFS), however, while fixing a hotplug issue this was broken as we bail
out if there is no SD_SHARE_CAP_STATES flag set.

This patch implements the hotplug race check differently and should
therefore reinstate support for per-cpu capacity states.

Change-Id: I5b865666c9ce833dcfa6514c574580d75aa0a195
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2017-06-02 08:01:55 -07:00
Valentin Schneider
fef0112a63 sched/fair: discount task contribution to find CPU with lowest utilization
In some cases, the new_util of a task can be the same on several
CPUs. This causes an issue because the target_util is only updated
if the current new_util is strictly smaller than target_util.

To fix that, the cpu_util_wake() return value is used alongside the
new_util value. If two CPUs compute the same new_util value,
we'll now also look at their cpu_util_wake() return value. In this
case, the CPU that last ran the task will be chosen in priority.

Change-Id: Ia1ea2c4b3ec39621372c2f748862317d5b497723
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
2017-06-02 08:01:54 -07:00
Chris Redpath
83f462daa3 sched/fair: ensure utilization signals are synchronized before use
wake_cap performs task and cpu utilization synchronization which is
what allows us to subtract current task util from prev_cpu util and
have a sensible number to work with.

It looks as though if wake_wide returns 0, we could potentially not
execute wake_cap, which would result in unsynced signals we then use
for energy calculations.

This is not necessarily an issue we've seen in traces, but it looks
as though it should be changed.

Change-Id: Ic54a3cba2a10d946ea20113a04371dea04115e82
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Chris Redpath
8865f07600 sched/fair: remove task util from own cpu when placing waking task
When we place a waking task with find_best_target, we calculate the
existing and new utilisation of each candidate cpu. However, we do
not remove any blocked load resulting from the waking task on the
previous cpu which might cause unnecessary migrations.

Switch to using cpu_util_wake which does this for us, which requires
moving cpu_util_wake a few functions earlier.

Also, we have multiple potential cpu utilization signals here, so
update the necessary bits to allow WALT to work properly (including
not subtracting task util for WALT).

When WALT is in use, cpu utilization is the utilization
in the previous completed window, whilst the task utilization
ignores fully idle windows. There seems to be no way to have a
decently accurate estimate of how much (if any) utilization from
this task remains on the prev cpu.

Instead, just return cpu_util when we're using WALT.

Change-Id: I448203ab98ffb5c020dfb6b218581eef1f5601f7
Reported-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Chris Redpath
8ac52cbaf4 trace:sched: Make util_avg in load_avg trace reflect PELT/WALT as used
With the ability to choose between WALT and PELT for utilisation tracking
we can have the situation where we're using WALT to make all the
decisions and reporting PELT figures in the sched_load_avg_(cpu|task)
trace points. This is not too much of an issue, but when analysing trace
it is nice to see numbers representing what the scheduler is using rather
than needing to add in additional sched_walt_* traces to figure it out.

Add reporting for both types, and make the util_avg member reflect what
will be seen from cpu or task_util functions in the scheduler.

Change-Id: I2abbd2c5fa70822096d0f3372b4c12b1c6af1590
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Dietmar Eggemann
4b85765a3d sched/fair: Add eas (& cas) specific rq, sd and task stats
The statistic counter are placed in the eas (& cas) wakeup path. Each
of them has one representation for the runqueue (rq), the sched_domain
(sd) and the task.
A task counter is always incremented. A rq counter is always
incremented for the rq the scheduler is currently running on. A sd
counter is only incremented if a relation to a sd exists.

The counters are exposed:

(1) In /proc/schedstat for rq's and sd's:

$ cat /proc/schedstat
...
cpu0 71422 0 2321254 ...
eas  44144 0 0 19446 0 24698 568435 51621 156932 133 222011 17459 120279 516814 83 0 156962 359235 176439 139981
  <- runqueue for cpu0
...
domain0 3 42430 42331 ...
eas 0 0 0 14200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 66355 0  <- MC sched domain for cpu0
...

The per-cpu eas vector has the following elements:

sis_attempts  sis_idle   sis_cache_affine sis_suff_cap    sis_idle_cpu    sis_count               ||
secb_attempts secb_sync  secb_idle_bt     secb_insuff_cap secb_no_nrg_sav secb_nrg_sav secb_count ||
fbt_attempts  fbt_no_cpu fbt_no_sd        fbt_pref_idle   fbt_count                               ||
cas_attempts  cas_count

The following relations exist between these counters (from cpu0 eas
vector above):

sis_attempts = sis_idle + sis_cache_affine + sis_suff_cap + sis_idle_cpu + sis_count

44144        = 0        + 0                + 19446        + 0            + 24698

secb_attempts = secb_sync + secb_idle_bt + secb_insuff_cap + secb_no_nrg_sav + secb_nrg_sav + secb_count

568435        = 51621     + 156932       + 133             + 222011          + 17459        + 120279

fbt_attempts = fbt_no_cpu + fbt_no_sd + fbt_pref_idle + fbt_count + (return -1)

516814       = 83         + 0         + 156962        + 359235    + (534)

cas_attempts = cas_count + (return -1 or smp_processor_id())

176439       = 139981    + (36458)

(2) In /proc/$PROCESS_PID/task/$TASK_PID/sched for a task.

example: main thread of system_server

$ cat /proc/1083/task/1083/sched

...
se.statistics.nr_wakeups_sis_attempts        :                  945
se.statistics.nr_wakeups_sis_idle            :                    0
se.statistics.nr_wakeups_sis_cache_affine    :                    0
se.statistics.nr_wakeups_sis_suff_cap        :                  219
se.statistics.nr_wakeups_sis_idle_cpu        :                    0
se.statistics.nr_wakeups_sis_count           :                  726
se.statistics.nr_wakeups_secb_attempts       :                10376
se.statistics.nr_wakeups_secb_sync           :                 1462
se.statistics.nr_wakeups_secb_idle_bt        :                 6984
se.statistics.nr_wakeups_secb_insuff_cap     :                    3
se.statistics.nr_wakeups_secb_no_nrg_sav     :                  927
se.statistics.nr_wakeups_secb_nrg_sav        :                  206
se.statistics.nr_wakeups_secb_count          :                  794
se.statistics.nr_wakeups_fbt_attempts        :                 8914
se.statistics.nr_wakeups_fbt_no_cpu          :                    0
se.statistics.nr_wakeups_fbt_no_sd           :                    0
se.statistics.nr_wakeups_fbt_pref_idle       :                 6987
se.statistics.nr_wakeups_fbt_count           :                 1554
se.statistics.nr_wakeups_cas_attempts        :                 3107
se.statistics.nr_wakeups_cas_count           :                 1195
...

The same relation between the counters as in the per-cpu case apply.

Change-Id: Ie7d01267c78a3f41f60a3ef52917d5a5d463f195
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Andres Oportus
aa8882923a sched/core: Fix PELT jump to max OPP upon util increase
Change-Id: Ic80b588ec466ef707f658dcea039fd0d6b384b63
Signed-off-by: Andres Oportus <andresoportus@google.com>
2017-06-02 08:01:54 -07:00
Dietmar Eggemann
55af384815 sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability
For Energy-Aware Scheduling (EAS) to work properly, even in the
case that there is only one cpu per cluster or that cpus are hot-plugged
out, the Energy Model (EM) data on all energy-aware sched domains (sd)
has to be present for all online cpus.

Mainline sd hierarchy setup code will remove sd's which are not useful
for task scheduling e.g. in the following situations:

1. Only 1 cpu is/remains in one cluster of a multi cluster system.

   This remaining cpu only has DIE and no MC sd.

2. A complete cluster in a two cluster system is hot-plugged out.

   The cpus of the remaining cluster only have MC and no DIE sd.

To make sure that all online cpus keep all their energy-aware sd's,
the sd degenerate functionality has been changed to not free a sd if
its first sched group (sg) contains EM data in case:

1. There is only 1 cpu left in the sd.

2. There have to be at least 2 sg's if certain sd flags are set.

Instead of freeing such a sd it now clears only its SD_LOAD_BALANCE
flag. This will make sure that the EAS functionality will always see
all energy-aware sd's for all online cpus.

It will introduce a tiny performance degradation for operations on
affected cpus since the hot-path macro for_each_domain() has to deal
with sd's not contributing to task scheduling at all now.

In most cases the exisiting code makes sure that task scheduling is not
invoked on a sd with !SD_LOAD_BALANCE.

However, a small change is necessary in update_sd_lb_stats() to make
sure that sd->parent is only initialized to !NULL in case the parent sd
contains more than 1 sg.

The handling of newidle decay values before the SD_LOAD_BALANCE check in
rebalance_domains() stays unchanged.

Test (w/ CONFIG_SCHED_DEBUG):

JUNO r0 default system:

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd07
CPU part        : 0xd07
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags:

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f

Test 1: Hotplug-out one A57 (CPU part 0xd07) cpu:

$ echo 0 > /sys/devices/system/cpu/cpu1/online

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd07
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags for remaining A57 (cpu2) cpu:

$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags`
832e <-- MC SD with !SD_LOAD_BALANCE
102f

Test 2: Hotplug-out the entire A57 cluster:

$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 0 > /sys/devices/system/cpu/cpu2/online

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags for the remaining A53 (CPU part 0xd03) cluster:

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102e <-- DIE SD with !SD_LOAD_BALANCE
832f
102e
832f
102e
832f
102e

Change-Id: If24aa2b2628f334abbf0207d39e2a86168d9d673
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2017-06-02 08:01:54 -07:00
Vincent Guittot
e62a1ca36b UPSTREAM: sched/core: Fix group_entity's share update
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.

Let take the example of a task a that is dequeued of its task group A:
   root
  (cfs_rq)
    \
    (se)
     A
    (cfs_rq)
      \
      (se)
       a

Task "a" was the only task in task group A which becomes idle when a is
dequeued.

We have the sequence:

- dequeue_entity a->se
    - update_load_avg(a->se)
    - dequeue_entity_load_avg(A->cfs_rq, a->se)
    - update_cfs_shares(A->cfs_rq)
	A->cfs_rq->load.weight == 0
        A->se->load.weight is updated with the new share (0 in this case)
- dequeue_entity A->se
    - update_load_avg(A->se) but its weight is now null so the last time
      slot (up to a tick) will be accounted with a weight of 0 instead of
      its real weight during the time slot. The last time slot will be
      accounted as an idle one whereas it was a running one.

If the running time of task a is short enough that no tick happens when it
runs, all running time of group entity A->se will be accounted as idle
time.

Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.

update_cfs_shares() now takes the sched_entity as a parameter instead of the
cfs_rq, and the weight of the group_entity is updated only once its load_avg
has been synced with current time.

Change-Id: Id6ce3be1767b44b444ce2a77ed1ba063e57c0664
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 89ee048f3cc796db6f26906c6bef4edf0bee70fd)
[minor cherry pick stuff]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Peter Zijlstra
baaa21b59b UPSTREAM: sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion
Commit:

  fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")

did something non-obvious but also did it buggy yet latent.

The problem was exposed for real by a later commit in the v4.7 merge window:

  2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")

... after which tg->load_avg and cfs_rq->load.weight had different
units (10 bit fixed point and 20 bit fixed point resp.).

Add a comment to explain the use of cfs_rq->load.weight over the
'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct
for the difference in unit.

Since this is (now, as per a previous commit) the only user of
calc_tg_weight(), collapse it.

The effects of this bug should be randomly inconsistent SMP-balancing
of cgroups workloads.

Change-Id: If1e565662ea163485edd94a12aef644d0e0dfe7a
Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit ea1dc6fc6242f991656e35e2ed3d90ec1cd13418)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Vincent Guittot
20bbd92679 UPSTREAM: sched/fair: Fix incorrect task group ->load_avg
A scheduler performance regression has been reported by Joseph Salisbury,
which he bisected back to:

  3d30544f0212 ("sched/fair: Apply more PELT fixes)

The regression triggers when several levels of task groups are involved
(read: SystemD) and cpu_possible_mask != cpu_present_mask.

The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
is initialized to scale_load_down(se->load.weight). During the creation of
a child task group, its group entities on possible CPUs are attached to
parent's cfs_rq (tg_parent) and their loads are added to the parent's load
(tg_parent->load_avg) with update_tg_load_avg().

But only the load on online CPUs will then be updated to reflect real load,
whereas load on other CPUs will stay at the initial value.

The result is a tg_parent->load_avg that is higher than the real load, the
weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
smaller than it should be, and the task group gets a less running time than
what it could expect.

( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
  of the task group will be much higher than sum of ".tg_load_avg_contrib"
  of online cfs_rqs of the task group. )

The load of group entities don't have to be intialized to something else
than 0 because their load will increase when an entity is attached.

Change-Id: Ie55021ff98ba49016adfddb2444e9c9709939226
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org> # 4.8.x
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: joonwoop@codeaurora.org
Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes)
Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b5a9b340789b2b24c6896bcf7a065c31a4db671c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Peter Zijlstra
640c909c34 UPSTREAM: sched/fair: Fix effective_load() to consistently use smoothed load
Starting with the following commit:

  fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")

calc_tg_weight() doesn't compute the right value as expected by effective_load().

The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.

Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.

So stop using calc_tg_weight() and do it explicitly.

The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.

Change-Id: I1c0058ff674650cf295c8dc3b88a5a3de4bddab0
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v4.3+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7dd4912594daf769a46744848b05bd5bc6d62469)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Vincent Guittot
89e4d18a67 UPSTREAM: sched/fair: Propagate asynchrous detach
A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.

During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.

The propagation relies on patch:

  "sched: Fix hierarchical order in rq->leaf_cfs_rq_list"

... which orders children and parents, to ensure that it's done in one pass.

Change-Id: I33782e35fc4711f5901e8c23d6aa7ec5f2ff7ee5
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4e5160766fcc9f41bbd38bac11f92dce993644aa)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Vincent Guittot
e875665411 UPSTREAM: sched/fair: Propagate load during synchronous attach/detach
When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.

For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.

For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.

Change-Id: Id34a9888484716961c9027299c0b4d82881a39d1
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 09a43ace1f986b003c118fdf6ddf1fd685692d49)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Vincent Guittot
8370e07d82 UPSTREAM: sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.

The hierarchical order in shares update list has been introduced by
commit:

  67e86250f8 ("sched: Introduce hierarchal order on shares update list")

With the current implementation a child can be still put after its
parent.

Lets take the example of:

       root
        \
         b
         /\
         c d*
           |
           e*

with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail

The branch d -> e will be added the first time that they are enqueued,
starting with e then d.

When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail

Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail

e is not placed at the right position and will be called the last
whereas it should be called at the beginning.

Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.

Change-Id: I4fe0b8502ea628c13d14e8e5c5279bce67fb8845
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9c2791f936ef5fd04a118b5c284f2c9a95f4a647)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Vincent Guittot
723dab7871 BACKPORT: sched/fair: Factorize PELT update
Every time we modify load/utilization of sched_entity, we start to
sync it with its cfs_rq. This update is done in different ways:

 - when attaching/detaching a sched_entity, we update cfs_rq and then
   we sync the entity with the cfs_rq.

 - when enqueueing/dequeuing the sched_entity, we update both
   sched_entity and cfs_rq metrics to now.

Use update_load_avg() everytime we have to update and sync cfs_rq and
sched_entity before changing the state of a sched_enity.

Change-Id: Ibde9a7e07ac80e9d5753bb4a0c30dfb3643cc666
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[backported FROMLIST]
Signed-off-by: Andres Oportus <andresoportus@google.com>
(cherry picked from commit d31b1a66cbe0931733583ad9d9e8c6cfd710907d)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Vincent Guittot
18d09a45ec UPSTREAM: sched/fair: Factorize attach/detach entity
Factorize post_init_entity_util_avg() and part of attach_task_cfs_rq()
in one function attach_entity_cfs_rq().

Create symmetric detach_entity_cfs_rq() function.

Change-Id: I44fc6bb5e71460be65f6b8928d4620c6c27a6a67
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit df217913e72ec7e603d8b68cc4c70646cf7000db)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Peter Zijlstra
f9bef52c85 UPSTREAM: sched/fair: Improve PELT stuff some more
Vincent noted that the update_tg_load_avg() usage in commit:

  3d30544f0212 ("sched/fair: Apply more PELT fixes")

isn't entirely sufficient. We need to call this function every time
cfs_rq->avg.load changes, this includes when update_cfs_rq_load_avg()
returns true, but {attach,detach}_entity_load_avg() themselves also
change it. This means we need to unconditionally call
update_tg_load_avg().

Also, add more comments.

Change-Id: I7e55fceb587601f73c760c8b0d47a7ef2b777b9e
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7c3edd2c300b7ef2005a69dc727692ee07434aa5)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Peter Zijlstra
dc1386b6f7 UPSTREAM: sched/fair: Apply more PELT fixes
One additional 'rule' for using update_cfs_rq_load_avg() is that one
should call update_tg_load_avg() if it returns true.

Add a bunch of comments to hopefully clarify some of the rules:

 o  You need to update cfs_rq _before_ any entity attach/detach,
    this is important, because while for mathmatical consisency this
    isn't strictly needed, it is required for the physical
    interpretation of the model, you attach/detach _now_.

 o  When you modify the cfs_rq avg, you have to then call
    update_tg_load_avg() in order to propagate changes upwards.

 o  (Fair) entities are always attached, switched_{to,from}_fair()
    deal with !fair. This directly follows from the definition of the
    cfs_rq averages, namely that they are a direct sum of all
    (runnable or blocked) entities on that rq.

It is the second rule that this patch enforces, but it adds comments
pertaining to all of them.

Change-Id: Icdc906e98c67b84cb9582c893bc761a9886be57a
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 3d30544f02120b884bba2a9466c87dba980e3be5)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Peter Zijlstra
3fd734a8f9 UPSTREAM: sched/fair: Fix post_init_entity_util_avg() serialization
Chris Wilson reported a divide by 0 at:

 post_init_entity_util_avg():

 >    725	if (cfs_rq->avg.util_avg != 0) {
 >    726		sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
 > -> 727		sa->util_avg /= (cfs_rq->avg.load_avg + 1);
 >    728
 >    729		if (sa->util_avg > cap)
 >    730			sa->util_avg = cap;
 >    731	} else {

Which given the lack of serialization, and the code generated from
update_cfs_rq_load_avg() is entirely possible:

	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
		sa->load_avg = max_t(long, sa->load_avg - r, 0);
		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
		removed_load = 1;
	}

turns into:

  ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
  ffffffff8108706b:       48 85 c0                test   %rax,%rax
  ffffffff8108706e:       74 40                   je     ffffffff810870b0
  ffffffff81087070:       4c 89 f8                mov    %r15,%rax
  ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
  ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
  ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
  ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
  ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
  ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx

Which you'll note ends up with 'sa->load_avg - r' in memory at
ffffffff8108707a.

By calling post_init_entity_util_avg() under rq->lock we're sure to be
fully serialized against PELT updates and cannot observe intermediate
state like this.

Change-Id: I56c11886102b7859df82e26c88b1b7c200a39f6e
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value")
Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b7fa30c9cc48c4f55663420472505d3b4f6e1705)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Yuyang Du
9de438d27c BACKPORT: sched/fair: Initiate a new task's util avg to a bounded value
A new task's util_avg is set to full utilization of a CPU (100% time
running). This accelerates a new task's utilization ramp-up, useful to
boost its execution in early time. However, it may result in
(insanely) high utilization for a transient time period when a flood
of tasks are spawned. Importantly, it violates the "fundamentally
bounded" CPU utilization, and its side effect is negative if we don't
take any measure to bound it.

This patch proposes an algorithm to address this issue. It has
two methods to approach a sensible initial util_avg:

(1) An expected (or average) util_avg based on its cfs_rq's util_avg:

  util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight

(2) A trajectory of how successive new tasks' util develops, which
gives 1/2 of the left utilization budget to a new task such that
the additional util is noticeably large (when overall util is low) or
unnoticeably small (when overall util is high enough). In the meantime,
the aggregate utilization is well bounded:

  util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n

where n denotes the nth task.

If util_avg is larger than util_avg_cap, then the effective util is
clamped to the util_avg_cap.

Change-Id: Idafe989b24d9e70911666f09800bf1d5a011e1f4
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2b8c41daba327c633228169e8bd8ec067ab443f8)
[integrate with schedfreq - schedfreq has a tuneable for init task util
 but this commit removes the use of the tuneable since we have a new
 algorithm for calculating an initial utilisation. I've left the tuneable
 in place, but it is no longer used even when schedfreq is the CPUFreq
 governor]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Dietmar Eggemann
4e18c8a10d sched/fair: Simplify idle_idx handling in select_idle_sibling()
Rename best_idle to best_idle_cpu so the same name is used like in
find_best_target().

Fix if (best_idle > 0) since best_idle_cpu = 0 is a valid target.

Use 'unsigned long' data type for best_idle_capacity.

Since we're looking for the shallowest best_idle_cstate initialize
best_idle_cstate = INT_MAX. For cpus which are not idle (idle_idx = -1)
the condition 'if (idle_idx < best_idle_cstate && ...)' is never
executed.

Change-Id: Ic5b63d58478696b3d1ec6253cf739a69a574cf99
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 8bff5e9c0968108d465e1f2a4624fc5ec2f00849)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Dietmar Eggemann
b31ae71ef7 sched/fair: refactor find_best_target() for simplicity
Simplify backup_capacity handling and use 'unsigned long'
data type for cpu capacity, simplify target_util handling,
simplify idle_idx handling & refactor min_util, new_util.

Also return first idle cpu for prefer_idle task immediately.

Change-Id: Ic89e140f7b369f3965703fdc8463013d16e9b94a
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Dietmar Eggemann
d3f5e8c3e9 sched/fair: Change cpu iteration order in find_best_target()
The schedtune task parameter 'boosted' is mapped into the cpu iteration
order. Currently for 'boosted' equal true the iteration starts at the
last cpu (NR_CPUS-1) whereas for 'boosted' equal false it starts at the
first cpu (0).

This only has the desired effect if the cpu topology oerdering matches
the underlying assumption. This e.g. is the case for the
Qc snapdragon 821 with its [L0 L1 b0 b1] cpu topology layout
(L=lower max freq, b=higher max freq). This results in cpus with higher
maximum capacity being given the highest logical cpu ids. However not
all big.LITTLE systems enumerate their cpus in the same way. For example,
the ARM Versatile Express Juno board has 6 cpus for which the default
configuration has topology [L0 b0 b1 L1 L2 L3].

To make this approach independent from the cpu topology layout it now
iterates over the cpus in the order of the sched_groups of the EAS
sched_domain (sd_ea). The order of cpu iteration is different for the
different cpu types in case the cpu is used to dereference sd_ea.

Considering the Qc snapdragon 821 again, for cpu L0 and L1 the order is
'b0->b1->L0->L1' whereas for b0 and b1 the order is 'b0->b1->L0->L1'.

This approach does not allow the exact same iteration order as with the
currently used flat iteration over [0 .. NR_CPUS-1] but the cpus
are ordered by the original cpu capacity.

The cpu iteration is now done in the sd_ea sched_group order required by
the 'boosted' value ['L0->L1->b0->b1'/'b0->b1->L0->L1'] rather than
forward/backward over the flat cpu space ['L0->L1->b0->b1'/
'b1->b0->L1->L0'].

Change-Id: I8fbe2073dedd2ecb1c750620c6000c11a5ff4358
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit a0c6a4272c3968c0ff50d3fed65f5865b72d777b)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Dietmar Eggemann
633b98b651 sched/core: Add first cpu w/ max/min orig capacity to root domain
This will allow to start iterating from a cpu with max or min original
capacity in the wakeup path regardless on which cpu the scheduler is
currently running (smp_processor_id()) or the previous cpu of the task
(task_cpu(p)). This iteration has to happen on a sched_domain spanning
all cpus in the order of the sched_groups of this sched_domain seen by
the starting cpu.

In case of an SMP system the first cpu with max orig capacity and the
the one with min orig capacity is the same. This can temporally happen
on a big.LITTLE system with hotplug as well.

E.g. the different order of cpu iteration can be used to map schedtune
task parameter 'boosted' into the cpu iteration order in
find_best_target().

Use of READ_ONCE()/WRITE_ONCE() to avoid load/store tearing.

Change-Id: I812fbd9c7e5f506617e456c0eec3edcd2c016e92
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit fd6e9543c1fd8971a5e2e68e39b2f6e591d46114)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Dietmar Eggemann
3e44a647c0 sched/core: Remove remnants of commit fd5c98da1a42
Commit fd5c98da1a42 "WIP: sched: Store system-wide maximum cpu capacity
in root domain" was repalced by commit 8148bdfff4f5 "WIP: sched: Update
max cpu capacity in case of max frequency constraints" which didn't
remove all the now unused bits.

Change-Id: I067f6366431f43337cffa7a2a8e0de32dd33d2f9
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 6d284a607cec51bcafca313bc396bc3103b1e876)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Dietmar Eggemann
242695407a sched: Remove sysctl_sched_is_big_little
With the new wakeup approach this sysctl is not necessary any more.

Change-Id: I52114b3c918791f6a4f9f30f50002919ccbc1a9c
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 885c0d503bcdf0ef4e9b46822496f16b20aa3bbd)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Dietmar Eggemann
9e92e8a24f sched/fair: Code !is_big_little path into select_energy_cpu_brute()
This patch replaces the existing EAS upstream implementation of
select_energy_cpu_brute() with the one of find_best_target() used
in Android previously.

It also removes the cpumask 'and' from select_energy_cpu_brute,
see the existing use of 'cpu = smp_processor_id()' in
select_task_rq_fair().

Change-Id: If678c002efaa87d1ba3ec9989a4e9f8df98b83ec
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[ added guarding for non-schedtune builds ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Dietmar Eggemann
f6f9314893 EAS: sched/fair: Re-integrate 'honor sync wakeups' into wakeup path
This patch re-integrates the part which was initially provided by
3b9d7554aeec ("EAS: sched/fair: tunable to honor sync wakeups") into
energy_aware_wake_cpu() into select_energy_cpu_brute().

Change-Id: I748fde3ecdeb44651179bce0a5bb8dd82d1903f6
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit b75b7286cb068d5761621ea134c23dd131db953f)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Dietmar Eggemann
81bd5ed393 Fixup!: sched/fair.c: Set SchedTune specific struct energy_env.task
This has to be done in the caller function of energy_diff() version of
SchedTune to avoid Null pointer dereference in energy_diff().

Change-Id: I3f0f68dbd11efb15bbb3b1832f8294419ed85241
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 14531d4e245d063f713ee5ed835df958e6c7838f)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:52 -07:00
Morten Rasmussen
3935105f57 sched/fair: Energy-aware wake-up task placement
When the systems is not overutilized, place waking tasks on the most
energy efficient cpu. Previous attempts reduced the search space by
matching task utilization to cpu capacity before consulting the energy
model as this is an expensive operation. The search heuristics didn't
work very well and lacking any better alternatives this patch takes the
brute-force route and tries all potential targets.

This approach doesn't scale, but it might be sufficient for many
embedded applications while work is continuing on a heuristic that can
minimize the necessary computations. The heuristic must be derrived from
the platform energy model rather than make additional assumptions, such
lower capacity implies better energy efficiency. PeterZ mentioned in the
past that we might be able to derrive some simpler deciding functions
using mathematical (modal?) analysis.

Change-Id: I772bacb4c8fd599f8006fa422f842e66377a9c6c
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[rebase: on top of msm-google/android-msm-marlin-3.18]
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit a894422dbdb7b77ea2acfe7ff909ccb5ded23514)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:52 -07:00
Morten Rasmussen
02cbde61f4 sched/fair: Add energy_diff dead-zone margin
It is not worth the overhead to migrate tasks for tiny insignificant
energy savings. To prevent this, an energy margin is introduced in
energy_diff() which effectively adds a dead-zone that rounds tiny energy
differences to zero. Since no scale is enforced for energy model data
the margin can't be absolute. Instead it is defined as +/-1.56% energy
saving compared to the current total estimated energy consumption.

Change-Id: I6be069c752c701fb825430896b3b768a7ab2fee4
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[rebase: on top of msm-google/android-msm-marlin-3.18,
         massage original patch which changes code in energy_diff()
	 into __energy_diff() introduced by SchedTune]
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 780cb5a5fa47adf13d4fc2b77e8e94448cd56098)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:52 -07:00
Dietmar Eggemann
3b6ba235bc sched/fair: Decommission energy_aware_wake_cpu()
The EAS functionality in the wakeup path will be brought back by the
following patch ("sched/fair: Energy-aware wake-up task placement")
providing the function select_energy_cpu_brute().

Change-Id: I927fb9e8261cfacfe404695f853941c7959aa146
[ Trivial merge conflicts resolved. ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 80aee424fb7765a777267e144037642625a71304)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:52 -07:00
Dietmar Eggemann
168228463c sched/fair: Do not force want_affine eq. true if EAS is enabled
This lets us use Capacity-Aware Scheduling (CAS) if EAS is enabled.

Change-Id: I2e647a201ea0b733d1487c3e153047a49fb22847
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 00b7da2ae58bf568529e67614980f77e275b8d29)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:52 -07:00
Morten Rasmussen
c6cc7ca915 UPSTREAM: sched/fair: Fix incorrect comment for capacity_margin
The comment for capacity_margin introduced in:

  3273163c6775 ("sched/fair: Let asymmetric CPU configurations balance at wake-up")

... got its usage the wrong way round - fix it.

Change-Id: Ie46eac3e5ff43397b5bed61d0999d2817f1a1d96
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 893c5d2279041afeb593f1fa8edd9d02edf5b7cb)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:52 -07:00
Morten Rasmussen
adc7f08b2f UPSTREAM: sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
For asymmetric CPU capacity systems it is counter-productive for
throughput if low capacity CPUs are pulling tasks from non-overloaded
CPUs with higher capacity. The assumption is that higher CPU capacity is
preferred over running alone in a group with lower CPU capacity.

This patch rejects higher CPU capacity groups with one or less task per
CPU as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

Change-Id: I428875bb6267c780026ef75e2882300738d016e7
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9e0994c0a1c1f82c705f1f66388e1bcffcee8bb9)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:52 -07:00
Morten Rasmussen
60cc9f4e1e UPSTREAM: sched/fair: Add per-CPU min capacity to sched_group_capacity
struct sched_group_capacity currently represents the compute capacity
sum of all CPUs in the sched_group.

Unless it is divided by the group_weight to get the average capacity
per CPU, it hides differences in CPU capacity for mixed capacity systems
(e.g. high RT/IRQ utilization or ARM big.LITTLE).

But even the average may not be sufficient if the group covers CPUs of
different capacities.

Instead, by extending struct sched_group_capacity to indicate min per-CPU
capacity in the group a suitable group for a given task utilization can
more easily be found such that CPUs with reduced capacity can be avoided
for tasks with high utilization (not implemented by this patch).

Change-Id: If3cae1be62d01a199e752bca5abb45357d5d0fbd
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit bf475ce0a3dd75b5d1df6c6c14ae25168caa15ac)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:52 -07:00
Morten Rasmussen
f3f132b8e5 UPSTREAM: sched/fair: Consider spare capacity in find_idlest_group()
In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

Change-Id: I6097af76c302a5a12e240ca24c70f707ad118242
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6a0b19c0f39a7a7b7fb77d3867a733136ff059a3)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:52 -07:00