Commit graph

22227 commits

Author SHA1 Message Date
Peter Zijlstra
c5174678e2 perf/core: Fix perf_event_open() vs. execve() race
commit 79c9ce57eb2d5f1497546a3946b4ae21b6fdc438 upstream.

Jann reported that the ptrace_may_access() check in
find_lively_task_by_vpid() is racy against exec().

Specifically:

  perf_event_open()		execve()

  ptrace_may_access()
				commit_creds()
  ...				if (get_dumpable() != SUID_DUMP_USER)
				  perf_event_exit_task();
  perf_install_in_context()

would result in installing a counter across the creds boundary.

Fix this by wrapping lots of perf_event_open() in cred_guard_mutex.
This should be fine as perf_event_exit_task() is already called with
cred_guard_mutex held, so all perf locks already nest inside it.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: He Kuang <hekuang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:47 -07:00
Jeff Vander Stoep
9d5f5d9346 FROMLIST: security,perf: Allow further restriction of perf_event_open
When kernel.perf_event_open is set to 3 (or greater), disallow all
access to performance events by users without CAP_SYS_ADMIN.
Add a Kconfig symbol CONFIG_SECURITY_PERF_EVENTS_RESTRICT that
makes this value the default.

This is based on a similar feature in grsecurity
(CONFIG_GRKERNSEC_PERF_HARDEN).  This version doesn't include making
the variable read-only.  It also allows enabling further restriction
at run-time regardless of whether the default is changed.

https://lkml.org/lkml/2016/1/11/587

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>

Bug: 29054680
Change-Id: Iff5bff4fc1042e85866df9faa01bce8d04335ab8
2016-05-31 22:22:16 -07:00
Wanpeng Li
cf73d8ad76 workqueue: fix rebind bound workers warning
commit f7c17d26f43d5cc1b7a6b896cd2fa24a079739b9 upstream.

------------[ cut here ]------------
WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
Modules linked in:
CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
 0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
 0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
 ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
Call Trace:
 dump_stack+0x89/0xd4
 __warn+0xfd/0x120
 warn_slowpath_null+0x1d/0x20
 rebind_workers+0x1c0/0x1d0
 workqueue_cpu_up_callback+0xf5/0x1d0
 notifier_call_chain+0x64/0x90
 ? trace_hardirqs_on_caller+0xf2/0x220
 ? notify_prepare+0x80/0x80
 __raw_notifier_call_chain+0xe/0x10
 __cpu_notify+0x35/0x50
 notify_down_prepare+0x5e/0x80
 ? notify_prepare+0x80/0x80
 cpuhp_invoke_callback+0x73/0x330
 ? __schedule+0x33e/0x8a0
 cpuhp_down_callbacks+0x51/0xc0
 cpuhp_thread_fun+0xc1/0xf0
 smpboot_thread_fn+0x159/0x2a0
 ? smpboot_create_threads+0x80/0x80
 kthread+0xef/0x110
 ? wait_for_completion+0xf0/0x120
 ? schedule_tail+0x35/0xf0
 ret_from_fork+0x22/0x50
 ? __init_kthread_worker+0x70/0x70
---[ end trace eb12ae47d2382d8f ]---
notify_down_prepare: attempt to take down CPU 0 failed

This bug can be reproduced by below config w/ nohz_full= all cpus:

CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
CONFIG_DEBUG_HOTPLUG_CPU0=y
CONFIG_NO_HZ_FULL=y

As Thomas pointed out:

| If a down prepare callback fails, then DOWN_FAILED is invoked for all
| callbacks which have successfully executed DOWN_PREPARE.
|
| But, workqueue has actually two notifiers. One which handles
| UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
|
| Now look at the priorities of those callbacks:
|
| CPU_PRI_WORKQUEUE_UP        = 5
| CPU_PRI_WORKQUEUE_DOWN      = -5
|
| So the call order on DOWN_PREPARE is:
|
| CB 1
| CB ...
| CB workqueue_up() -> Ignores DOWN_PREPARE
| CB ...
| CB X ---> Fails
|
| So we call up to CB X with DOWN_FAILED
|
| CB 1
| CB ...
| CB workqueue_up() -> Handles DOWN_FAILED
| CB ...
| CB X-1
|
| So the problem is that the workqueue stuff handles DOWN_FAILED in the up
| callback, while it should do it in the down callback. Which is not a good idea
| either because it wants to be called early on rollback...
|
| Brilliant stuff, isn't it? The hotplug rework will solve this problem because
| the callbacks become symetric, but for the existing mess, we need some
| workaround in the workqueue code.

The boot CPU handles housekeeping duty(unbound timers, workqueues,
timekeeping, ...) on behalf of full dynticks CPUs. It must remain
online when nohz full is enabled. There is a priority set to every
notifier_blocks:

workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down

So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
notifier_blocks behind tick_nohz_cpu_down will not be called any
more, which leads to workers are actually not unbound. Then hotplug
state machine will fallback to undo and online cpu 0 again. Workers
will be rebound unconditionally even if they are not unbound and
trigger the warning in this progress.

This patch fix it by catching !DISASSOCIATED to avoid rebind bound
workers.

Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:50 -07:00
Alexander Shishkin
e54c41be42 perf/core: Disable the event on a truncated AUX record
commit 9f448cd3cbcec8995935e60b27802ae56aac8cc0 upstream.

When the PMU driver reports a truncated AUX record, it effectively means
that there is no more usable room in the event's AUX buffer (even though
there may still be some room, so that perf_aux_output_begin() doesn't take
action). At this point the consumer still has to be woken up and the event
has to be disabled, otherwise the event will just keep spinning between
perf_aux_output_begin() and perf_aux_output_end() until its context gets
unscheduled.

Again, for cpu-wide events this means never, so once in this condition,
they will be forever losing data.

Fix this by disabling the event and waking up the consumer in case of a
truncated AUX record.

Reported-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1462886313-13660-3-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:48 -07:00
Alexei Starovoitov
bb10156f57 bpf: fix check_map_func_compatibility logic
[ Upstream commit 6aff67c85c9e5a4bc99e5211c1bac547936626ca ]

The commit 35578d7984 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
introduced clever way to check bpf_helper<->map_type compatibility.
Later on commit a43eec3042 ("bpf: introduce bpf_perf_event_output() helper") adjusted
the logic and inadvertently broke it.
Get rid of the clever bool compare and go back to two-way check
from map and from helper perspective.

Fixes: a43eec3042 ("bpf: introduce bpf_perf_event_output() helper")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:38 -07:00
Alexei Starovoitov
3899251bdb bpf: fix refcnt overflow
[ Upstream commit 92117d8443bc5afacc8d5ba82e541946310f106e ]

On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK,
the malicious application may overflow 32-bit bpf program refcnt.
It's also possible to overflow map refcnt on 1Tb system.
Impose 32k hard limit which means that the same bpf program or
map cannot be shared by more than 32k processes.

Fixes: 1be7f75d16 ("bpf: enable non-root eBPF programs")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:37 -07:00
Jann Horn
608d2c3c7a bpf: fix double-fdput in replace_map_fd_with_map_ptr()
[ Upstream commit 8358b02bf67d3a5d8a825070e1aa73f25fb2e4c7 ]

When bpf(BPF_PROG_LOAD, ...) was invoked with a BPF program whose bytecode
references a non-map file descriptor as a map file descriptor, the error
handling code called fdput() twice instead of once (in __bpf_map_get() and
in replace_map_fd_with_map_ptr()). If the file descriptor table of the
current task is shared, this causes f_count to be decremented too much,
allowing the struct file to be freed while it is still in use
(use-after-free). This can be exploited to gain root privileges by an
unprivileged user.

This bug was introduced in
commit 0246e64d9a ("bpf: handle pseudo BPF_LD_IMM64 insn"), but is only
exploitable since
commit 1be7f75d16 ("bpf: enable non-root eBPF programs") because
previously, CAP_SYS_ADMIN was required to reach the vulnerable code.

(posted publicly according to request by maintainer)

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:37 -07:00
Alexei Starovoitov
8427d5547d bpf/verifier: reject invalid LD_ABS | BPF_DW instruction
[ Upstream commit d82bccc69041a51f7b7b9b4a36db0772f4cdba21 ]

verifier must check for reserved size bits in instruction opcode and
reject BPF_LD | BPF_ABS | BPF_DW and BPF_LD | BPF_IND | BPF_DW instructions,
otherwise interpreter will WARN_RATELIMIT on them during execution.

Fixes: ddd872bc30 ("bpf: verifier: add checks for BPF_ABS | BPF_IND instructions")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:35 -07:00
Chunyu Hu
8d2923930b tracing: Don't display trigger file for events that can't be enabled
commit 854145e0a8e9a05f7366d240e2f99d9c1ca6d6dd upstream.

Currently register functions for events will be called
through the 'reg' field of event class directly without
any check when seting up triggers.

Triggers for events that don't support register through
debug fs (events under events/ftrace are for trace-cmd to
read event format, and most of them don't have a register
function except events/ftrace/functionx) can't be enabled
at all, and an oops will be hit when setting up trigger
for those events, so just not creating them is an easy way
to avoid the oops.

Link: http://lkml.kernel.org/r/1462275274-3911-1-git-send-email-chuhu@redhat.com

Fixes: 85f2b08268 ("tracing: Add basic event trigger framework")
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-11 11:21:13 +02:00
Patrick Bellasi
fed95d9ef1 DEBUG: schedtune: add tracepoint for schedtune_tasks_update() values
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:54:43 +08:00
Patrick Bellasi
a09a25c5df DEBUG: schedtune: add tracepoint for CPU boost signal
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:54:42 +08:00
Patrick Bellasi
c8a65d2e9a DEBUG: schedtune: add tracepoint for SchedTune configuration update
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:54:42 +08:00
Dietmar Eggemann
0a0f4aa12f DEBUG: sched: add energy procfs interface
This patch makes the energy data available via procfs. The related files
are placed as sub-directory named 'energy' inside the
/proc/sys/kernel/sched_domain/cpuX/domainY/groupZ directory for those
cpu/domain/group tuples which have energy information.

The following example depicts the contents of
/proc/sys/kernel/sched_domain/cpu0/domain0/group[01] for a system which
has energy information attached to domain level 0.

├── cpu0
│   ├── domain0
│   │   ├── busy_factor
│   │   ├── busy_idx
│   │   ├── cache_nice_tries
│   │   ├── flags
│   │   ├── forkexec_idx
│   │   ├── group0
│   │   │   └── energy
│   │   │       ├── cap_states
│   │   │       ├── idle_states
│   │   │       ├── nr_cap_states
│   │   │       └── nr_idle_states
│   │   ├── group1
│   │   │   └── energy
│   │   │       ├── cap_states
│   │   │       ├── idle_states
│   │   │       ├── nr_cap_states
│   │   │       └── nr_idle_states
│   │   ├── idle_idx
│   │   ├── imbalance_pct
│   │   ├── max_interval
│   │   ├── max_newidle_lb_cost
│   │   ├── min_interval
│   │   ├── name
│   │   ├── newidle_idx
│   │   └── wake_idx
│   └── domain1
│       ├── busy_factor
│       ├── busy_idx
│       ├── cache_nice_tries
│       ├── flags
│       ├── forkexec_idx
│       ├── idle_idx
│       ├── imbalance_pct
│       ├── max_interval
│       ├── max_newidle_lb_cost
│       ├── min_interval
│       ├── name
│       ├── newidle_idx
│       └── wake_idx

The files 'nr_idle_states' and 'nr_cap_states' contain a scalar value
whereas 'idle_states' and 'cap_states' contain a vector of power
consumption at this idle state respectively (compute capacity, power
consumption) at this capacity state.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:54:42 +08:00
Juri Lelli
dd2460f387 DEBUG: sched,cpufreq: add cpu_capacity change tracepoint
This is useful when we want to compare cpu utilization and
cpu curr capacity side by side.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:54:42 +08:00
Juri Lelli
e5a25996d3 DEBUG: sched: add tracepoint for CPU load/util signals
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:53:25 +08:00
Juri Lelli
8017fd7418 DEBUG: sched: add tracepoint for task load/util signals
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:53:25 +08:00
Juri Lelli
99ed4e57cb DEBUG: sched: add tracepoint for cpu/freq scale invariance
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:53:25 +08:00
Patrick Bellasi
a2a6dc7508 sched/fair: filter energy_diff() based on energy_payoff value
Once the SchedTune support is enabled and the CPU bandwidth demand of a
task is boosted, we could expect increased energy consumptions which are
balanced by corresponding increases of tasks performance.
However, the current implementation of the energy_diff() function
accepts all and _only_ the schedule candidates which results into a
reduced expected system energy, which works against the boosting
strategy.

This patch links the energy_diff() function with the "energy payoff"
engine provided by SchedTune. The energy variation computed by the
energy_diff() function is now filtered using the SchedTune support to
evaluated the energy payoff for a boosted task.

With that patch, the energy_diff() function is going to reported as
"acceptable schedule candidate" only the schedule candidate which
corresponds to a positive energy_payoff.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:25 +08:00
Patrick Bellasi
637ee3715a sched/tune: add support to compute normalized energy
The current EAS implementation considers only energy variations, while it
disregards completely the impact on performance for the selection of
a certain schedule candidate. Moreover, it also makes its decision based
on the "absolute" value of expected energy variations.

In order to properly define a trade-off strategy between increased energy
consumption and performances benefits it is required to compare energy
variations with performance variations.

Thus, both performance and energy metrics must be expressed in comparable
units. While the performance variations are expressed in terms of capacity
deltas, which are defined in the range [0..SCHED_LOAD_SCALE], the same
scale is not used for energy variations.

This patch introduces the function:
  schedtune_normalize_energy(energy_diff)
which returns a normalized value in the same range of capacity variations,
i.e. [0..SCHED_LOAD_SCALE].

A proper set of energy normalization constants are required to provide
a fast division by a constant during the normalziation of the energy_diff.
The value of these constants depends on the specific energy model and
topology of a target device.
Thus, this patch provides also the required support for the computation
at boot time of this set of variables.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:25 +08:00
Patrick Bellasi
36967b2412 sched/fair: keep track of energy/capacity variations
The current EAS implementation does not allow "to boost" tasks
performances, for example by running them at an higher OPP (or a more
capable CPU), even if that could require a "reasonable" increase in
energy consumption.  To defined how much reasonable is an energy
increase with respect to a required boost value, it is required to
define and compute a trade-off between the expected energy and
performance variations.
However, the current EAS implementation considers only energy variations
while completely disregard the impact on performance for the selection
of a certain schedule candidate.

This patch extends the eenv energy environment to keep track of both
energy and performance deltas which are implied by the activation of a
schedule candidate.
The performance variation is estimated considering the different
capacities of the CPUs in which the task could be scheduled. The idea is
that while running on a CPU with higher capacity (e.g. higher operating
point) the task could (potentially) complete faster and thus get better
performance.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:25 +08:00
Patrick Bellasi
a8f65584fc sched/fair: add boosted task utilization
The task utilization signal, which is derived from PELT signals and
properly scaled to be architecture and frequency invariant, is used by
EAS as an estimation of the task requirements in terms of CPU bandwidth.

When the energy aware scheduler is in use, this signal affects the CPU
selection. Thus, a convenient way to bias that decision, which is also
little intrusive, is to boost the task utilization signal each time it
is required to support them.

This patch introduces the new function:
  boosted_task_util(task)
which returns a boosted value for the utilization of the specified task.
The margin added to the original utilization is:
  1. computed based on the "boosting strategy" in use
  2. proportional to boost value defined either by the sysctl interface,
     when global boosting is in use, or the "taskgroup" value, when
     per-task boosting is enabled.

The boosted signal is used by EAS
  a. transparently, via its integration into the task_fits() function
  b. explicitly, in the energy-aware wakeup path

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:25 +08:00
Patrick Bellasi
a515b887ec sched/{fair,tune}: track RUNNABLE tasks impact on per CPU boost value
When per-task boosting is enabled, every time a task enters/exits a CPU
its boost value could impact the currently selected OPP for that CPU.
Thus, the "aggregated" boost value for that CPU potentially needs to
be updated to match the current maximum boost value among all the tasks
currently RUNNABLE on that CPU.

This patch introduces the required support to keep track of which boost
groups are impacting a CPU. Each time a task is enqueued/dequeued to/from
a CPU its boost group is used to increment a per-cpu counter of RUNNABLE
tasks on that CPU.
Only when the number of runnable tasks for a specific boost group
becomes 1 or 0 the corresponding boost group changes its effects on
that CPU, specifically:
  a) boost_group::tasks == 1: this boost group starts to impact the CPU
  b) boost_group::tasks == 0: this boost group stops to impact the CPU
In each of these two conditions the aggregation function:
  sched_cpu_update(cpu)
could be required to run in order to identify the new maximum boost
value required for the CPU.

The proposed patch minimizes the number of times the aggregation
function is executed while still providing the required support to
always boost a CPU to the maximum boost value required by all its
currently RUNNABLE tasks.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
9cd53fbeb2 sched/tune: compute and keep track of per CPU boost value
When per task boosting is enabled, we could have multiple RUNNABLE tasks
which are concurrently scheduled on the same CPU but each one with a
different boost value.
For example, we could have a scenarios like this:

  Task   SchedTune CGroup   Boost Value
    T1               root            0
    T2       low-priority           10
    T3        interactive           90

In these conditions we expect a CPU to be configured according to a
proper "aggregation" of the required boost values for all the tasks
currently scheduled on this CPU.

A suitable aggregation function is the one which tracks the MAX boost
value for all the tasks RUNNABLE on a CPU. This approach allows to
always satisfy the most boost demanding task while at the same time:
 a) boosting all the concurrently scheduled tasks thus reducing
    potential co-scheduling side-effects on demanding tasks
 b) reduce the number of frequency switch requested towards SchedDVFS,
    thus being more friendly to architectures with slow frequency
    switching times

Every time a task enters/exits the RQ of a CPU the max boost value
should be updated considering all the boost groups currently "affecting"
that CPU, i.e. which have at least one RUNNABLE task currently allocated
on that CPU.

This patch introduces the required support to keep track of the boost
groups currently affecting CPUs. Thanks to the limited number of boost
groups, a small and memory efficient per-cpu array of boost groups
values (cpu_boost_groups) is used which is updated for each CPU entry by
schedtune_boostgroup_update() but only when a schedtune CGroup boost
value is updated. However, this is expected to be a rare operation,
perhaps done just one time at system boot time.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
13001f47c9 sched/tune: add initial support for CGroups based boosting
To support task performance boosting, the usage of a single knob has the
advantage to be a simple solution, both from the implementation and the
usability standpoint.  However, on a real system it can be difficult to
identify a single value for the knob which fits the needs of multiple
different tasks. For example, some kernel threads and/or user-space
background services should be better managed the "standard" way while we
still want to be able to boost the performance of specific workloads.

In order to improve the flexibility of the task boosting mechanism this
patch is the first of a small series which extends the previous
implementation to introduce a "per task group" support.
This first patch introduces just the basic CGroups support, a new
"schedtune" CGroups controller is added which allows to configure
different boost value for different groups of tasks.
To keep the implementation simple but still effective for a boosting
strategy, the new controller:
  1. allows only a two layer hierarchy
  2. supports only a limited number of boost groups

A two layer hierarchy allows to place each task either:
  a) in the root control group
     thus being subject to a system-wide boosting value
  b) in a child of the root group
     thus being subject to the specific boost value defined by that
     "boost group"

The limited number of "boost groups" supported is mainly motivated by
the observation that in a real system it could be useful to have only
few classes of tasks which deserve different treatment.
For example, background vs foreground or interactive vs low-priority.
As an additional benefit, a limited number of boost groups allows also
to have a simpler implementation especially for the code required to
compute the boost value for CPUs which have runnable tasks belonging to
different boost groups.

cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
07e22949de sched/fair: add boosted CPU usage
The CPU usage signal is used by the scheduler as an estimation of the
overall bandwidth currently allocated on a CPU. When SchedDVFS is in
use, this signal affects the selection of the operating points (OPP)
required to accommodate all the workload allocated in a CPU.
A convenient way to boost the performance of tasks running on a CPU,
which is also little intrusive, is to boost the CPU usage signal each
time it is used to select an OPP.

This patch introduces a new function:
  get_boosted_cpu_usage(cpu)
to return a boosted value for the usage of a specified CPU.
The margin added to the original usage is:
  1. computed based on the "boosting strategy" in use
  2. proportional to the system-wide boost value defined by provided
     user-space interface

The boosted signal is used by SchedDVFS (transparently) each time it
requires to get an estimation of the capacity required for a CPU.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
6ed2714501 sched/fair: add function to convert boost value into "margin"
The basic idea of the boost knob is to "artificially inflate" a signal
to make a task or logical CPU appears more demanding than it actually
is. Independently from the specific signal, a consistent and possibly
simple semantic for the concept of "signal boosting" must define:
1. how we translate the boost percentage into a "margin" value to be added
   to the original signal to inflate
2. what is the meaning of a boost value from a user-space perspective

This patch provides the implementation of a possible boost semantic,
named "Signal Proportional Compensation" (SPC), where the boost
percentage (BP) is used to compute a margin (M) which is proportional to
the complement of the original signal (OS):
  M = BP * (SCHED_LOAD_SCALE - OS)
The computed margin then added to the OS to obtain the Boosted Signal (BS)
  BS = OS + M

The proposed boost semantic has these main features:
- each signal gets a boost which is proportional to its delta with respect
  to the maximum available capacity in the system (i.e. SCHED_LOAD_SCALE)
- a 100% boosting has a clear understanding from a user-space perspective,
  since it means simply to run (possibly) "all" tasks at the max OPP
- each boosting value means to improve the task performance by a quantity
  which is proportional to the maximum achievable performance on that
  system
Thus this semantics is somehow forcing a behaviour which is:

  50% boosting means to run at half-way between the current and the
    maximum performance which a task could achieve on that system

This patch provides the code to implement a fast integer division to
convert a boost percentage (BP) value into a margin (M).

NOTE: this code is suitable for all signals operating in range
      [0..SCHED_LOAD_SCALE]

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
344f4ec429 sched/tune: add sysctl interface to define a boost value
The current (CFS) scheduler implementation does not allow "to boost"
tasks performance by running them at a higher OPP compared to the
minimum required to meet their workload demands.

To support tasks performance boosting the scheduler should provide a
"knob" which allows to tune how much the system is going to be optimised
for energy efficiency vs performance.

This patch is the first of a series which provides a simple interface to
define a tuning knob. One system-wide "boost" tunable is exposed via:
  /proc/sys/kernel/sched_cfs_boost
which can be configured in the range [0..100], to define a percentage
where:
  - 0%   boost requires to operate in "standard" mode by scheduling
         tasks at the minimum capacities required by the workload demand
  - 100% boost requires to push at maximum the task performances,
         "regardless" of the incurred energy consumption

A boost value in between these two boundaries is used to bias the
power/performance trade-off, the higher the boost value the more the
scheduler is biased toward performance boosting instead of energy
efficiency.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Juri Lelli
08d1cfd7c9 fixup! sched/fair: jump to max OPP when crossing UP threshold
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:53:23 +08:00
Juri Lelli
3eb29105cf fixup! sched: scheduler-driven cpu frequency selection
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:53:23 +08:00
Vincent Guittot
6e1e1ed872 sched: rt scheduler sets capacity requirement
RT tasks don't provide any running constraints like deadline ones
except their running priority. The only current usable input to
estimate the capacity needed by RT tasks is the rt_avg metric. We use
it to estimate the CPU capacity needed for the RT scheduler class.

In order to monitor the evolution for RT task load, we must
peridiocally check it during the tick.

Then, we use the estimated capacity of the last activity to estimate
the next one which can not be that accurate but is a good starting
point without any impact on the wake up path of RT tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Vincent Guittot
fab5cc59bf sched: deadline: use deadline bandwidth in scale_rt_capacity
Instead of monitoring the exec time of deadline tasks to evaluate the
CPU capacity consumed by deadline scheduler class, we can directly
calculate it thanks to the sum of utilization of deadline tasks on the
CPU.  We can remove deadline tasks from rt_avg metric and directly use
the average bandwidth of deadline scheduler in scale_rt_capacity.

Based in part on a similar patch from Luca Abeni <luca.abeni@unitn.it>.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Vincent Guittot
cd248fa758 sched: remove call of sched_avg_update from sched_rt_avg_update
rt_avg is only used to scale the available CPU's capacity for CFS
tasks.  As the update of this scaling is done during periodic load
balance, we only have to ensure that sched_avg_update has been called
before any periodic load balancing. This requirement is already
fulfilled by __update_cpu_load so the call in sched_rt_avg_update,
which is part of the hotpath, is useless.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Steve Muckle
9d44dc7fe2 sched/cpufreq_sched: add trace events
Trace events will aid in debugging, profiling and tuning.

Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Steve Muckle
6b6c192453 sched/fair: jump to max OPP when crossing UP threshold
Since the true utilization of a long running task is not detectable
while it is running and might be bigger than the current cpu capacity,
create the maximum cpu capacity head room by requesting the maximum
cpu capacity once the cpu usage plus the capacity margin exceeds the
current capacity. This is also done to try to harm the performance of
a task the least.

Original fair-class only version authored by Juri Lelli
<juri.lelli@arm.com>.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Juri Lelli
f99e3fe6eb sched/fair: cpufreq_sched triggers for load balancing
As we don't trigger freq changes from {en,de}queue_task_fair() during load
balancing, we need to do explicitly so on load balancing paths.

[smuckle@linaro.org: move update_capacity_of calls so rq lock is held]

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Juri Lelli
7ff814dd71 sched/{core,fair}: trigger OPP change request on fork()
Patch "sched/fair: add triggers for OPP change requests" introduced OPP
change triggers for enqueue_task_fair(), but the trigger was operating only
for wakeups. Fact is that it makes sense to consider wakeup_new also (i.e.,
fork()), as we don't know anything about a newly created task and thus we
most certainly want to jump to max OPP to not harm performance too much.

However, it is not currently possible (or at least it wasn't evident to me
how to do so :/) to tell new wakeups from other (non wakeup) operations.

This patch introduces an additional flag in sched.h that is only set at
fork() time and it is then consumed in enqueue_task_fair() for our purpose.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:22 +08:00
Juri Lelli
ea429ccef1 sched/fair: add triggers for OPP change requests
Each time a task is {en,de}queued we might need to adapt the current
frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
this purpose.  Only trigger a freq request if we are effectively waking up
or going to sleep.  Filter out load balancing related calls to reduce the
number of triggers.

[smuckle@linaro.org: resolve merge conflicts, define task_new,
 use renamed static key sched_freq]

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:22 +08:00
Michael Turquette
a967a45c71 sched: scheduler-driven cpu frequency selection
Scheduler-driven CPU frequency selection hopes to exploit both
per-task and global information in the scheduler to improve frequency
selection policy, achieving lower power consumption, improved
responsiveness/performance, and less reliance on heuristics and
tunables. For further discussion on the motivation of this integration
see [0].

This patch implements a shim layer between the Linux scheduler and the
cpufreq subsystem. The interface accepts capacity requests from the
CFS, RT and deadline sched classes. The requests from each sched class
are summed on each CPU with a margin applied to the CFS and RT
capacity requests to provide some headroom. Deadline requests are
expected to be precise enough given their nature to not require
headroom. The maximum total capacity request for a CPU in a frequency
domain drives the requested frequency for that domain.

Policy is determined by both the sched classes and this shim layer.

Note that this algorithm is event-driven. There is no polling loop to
check cpu idle time nor any other method which is unsynchronized with
the scheduler, aside from a throttling mechanism to ensure frequency
changes are not attempted faster than the hardware can accommodate them.

Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
code and test results, and to Ricky Liang <jcliang@chromium.org>
for initialization and static key inc/dec fixes.

[0] http://article.gmane.org/gmane.linux.kernel/1499836

[smuckle@linaro.org: various additions and fixes, revised commit text]

CC: Ricky Liang <jcliang@chromium.org>
Signed-off-by: Michael Turquette <mturquette@baylibre.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:22 +08:00
Morten Rasmussen
0d2b1cdfa1 sched: Consider misfit tasks when load-balancing
With the new group_misfit_task load-balancing scenario additional policy
conditions are needed when load-balancing. Misfit task balancing only
makes sense between source group with lower capacity than the target
group. If capacities are the same, fallback to normal group_other
balancing. The aim is to balance tasks such that no task has its
throughput hindered by compute capacity if a cpu with more capacity is
available. Load-balancing is generally based on average load in the
sched_groups, but for misfitting tasks it is necessary to introduce
exceptions to migrate tasks against usual metrics and optimize
throughput.

This patch ensures the following load-balance for mixed capacity systems
(e.g. ARM big.LITTLE) for always-running tasks:

1. Place a task on each cpu starting in order from cpus with highest
capacity to lowest until all cpus are in use (i.e. one task on each
cpu).

2. Once all cpus are in use balance according to compute capacity such
that load per capacity is approximately the same regardless of the
compute capacity (i.e. big cpus get more tasks than little cpus).

Necessary changes are introduced in find_busiest_group(),
calculate_imbalance(), and find_busiest_queue(). This includes passing
the group_type on to find_busiest_queue() through struct lb_env, which
is currently only considers imbalance and not the imbalance situation
(group_type).

To avoid taking remote rq locks to examine source sched_groups for
misfit tasks, each cpu is responsible for tracking misfit tasks
themselves and update the rq->misfit_task flag. This means checking task
utilization when tasks are scheduled and on sched_tick.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:55 +08:00
Morten Rasmussen
f2a8923298 sched: Add group_misfit_task load-balance type
To maximize throughput in systems with reduced capacity cpus (e.g.
high RT/IRQ load and/or ARM big.LITTLE) load-balancing has to consider
task and cpu utilization as well as per-cpu compute capacity when
load-balancing in addition to the current average load based
load-balancing policy. Tasks that are scheduled on a reduced capacity
cpu need to be identified and migrated to a higher capacity cpu if
possible.

To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has tasks that are not suitable for its
per-cpu capacity. group_misfit_task is only considered if the system is
not overloaded in any other way (group_imbalanced or group_overloaded).

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:54 +08:00
Morten Rasmussen
563ddb604e sched: Add per-cpu max capacity to sched_group_capacity
struct sched_group_capacity currently represents the compute capacity
sum of all cpus in the sched_group. Unless it is divided by the
group_weight to get the average capacity per cpu it hides differences in
cpu capacity for mixed capacity systems (e.g. high RT/IRQ utilization or
ARM big.LITTLE). But even the average may not be sufficient if the group
covers cpus of different capacities. Instead, by extending struct
sched_group_capacity to indicate max per-cpu capacity in the group a
suitable group for a given task utilization can easily be found such
that cpus with reduced capacity can be avoided for tasks with high
utilization (not implemented by this patch).

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:54 +08:00
Dietmar Eggemann
e14f151823 sched: Do eas idle balance regardless of the rq avg idle value
EAS relies on idle balance to migrate a misfit task towards a cpu with
higher capacity.

When such a cpu becomes idle, idle balance should happen even if the rq
avg idle is smaller than the sched migration cost (default 500us).

The rq avg idle is updated during the wakeup of a task in case the rq has
a non-null idle_stamp. This value stays unchanged and valid until the next
task wakes up on this cpu after an idle period.

So rq avg idle could be smaller than sched migration cost preventing the
idle balance from happening. In this case we would be at the mercy of
wakeup, periodic or nohz-idle load balancing to put another task on this
cpu.

To break this dependency towards rq avg idle make EAS idle balance
independent from this rq avg idle has to be larger than sched migration
cost.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:54 +08:00
Dietmar Eggemann
05a773bfc9 sched: Update max cpu capacity in case of max frequency constraints
Wakeup balancing uses cpu capacity awareness and needs to know the
system-wide maximum cpu capacity.

Patch "sched: Store system-wide maximum cpu capacity in root domain"
finds the system-wide maximum cpu capacity during scheduler domain
hierarchy setup. This is sufficient as long as maximum frequency
invariance is not enabled.

If it is enabled, the system-wide maximum cpu capacity can change
between scheduler domain hierarchy setups due to frequency capping.

The cpu capacity is changed in update_cpu_capacity() which is called in
load balance on the lowest scheduler domain hierarchy level. To be able
to know if a change in cpu capacity for a certain cpu also has an effect
on the system-wide maximum cpu capacity it is normally necessary to
iterate over all cpus. This would be way too costly. That's why this
patch follows a different approach.

The unsigned long max_cpu_capacity value in struct root_domain is
replaced with a struct max_cpu_capacity, containing value (the
max_cpu_capacity) and cpu (the cpu index of the cpu providing the
maximum cpu_capacity).

Changes to the system-wide maximum cpu capacity and the cpu index are
made if:

 1 System-wide maximum cpu capacity < cpu capacity
 2 System-wide maximum cpu capacity > cpu capacity and cpu index == cpu

There are no changes to the system-wide maximum cpu capacity in all
other cases.

Atomic read and write access to the pair (max_cpu_capacity.val,
max_cpu_capacity.cpu) is enforced by max_cpu_capacity.lock.

The access to max_cpu_capacity.val in task_fits_max() is still performed
without taking the max_cpu_capacity.lock.

The code to set max cpu capacity in build_sched_domains() has been
removed because the whole functionality is now provided by
update_cpu_capacity() instead.

This approach can introduce errors temporarily, e.g. in case the cpu
currently providing the max cpu capacity has its cpu capacity lowered
due to frequency capping and calls update_cpu_capacity() before any cpu
which might provide the max cpu now.

There is also an outstanding question:

Should the cpu capacity of a cpu going idle be set to a very small
value?

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:54 +08:00
Robin Randhawa
d2b3db032e sched: Support for extracting EAS energy costs from DT
This patch implements support for extracting energy cost data from DT.
The data should conform to the DT bindings for energy cost data needed
by EAS (energy aware scheduling).

Signed-off-by: Robin Randhawa <robin.randhawa@arm.com>
2016-05-10 16:49:53 +08:00
Morten Rasmussen
b71188bd2c sched: Disable energy-unfriendly nohz kicks
With energy-aware scheduling enabled nohz_kick_needed() generates many
nohz idle-balance kicks which lead to nothing when multiple tasks get
packed on a single cpu to save energy. This causes unnecessary wake-ups
and hence wastes energy. Make these conditions depend on !energy_aware()
for now until the energy-aware nohz story gets sorted out.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:52 +08:00
Dietmar Eggemann
52b7b8a37c sched: Consider a not over-utilized energy-aware system as balanced
In case the system operates below the tipping point indicator,
introduced in ("sched: Add over-utilization/tipping point
indicator"), bail out in find_busiest_group after the dst and src
group statistics have been checked.

There is simply no need to move usage around because all involved
cpus still have spare cycles available.

For an energy-aware system below its tipping point,  we rely on the
task placement of the wakeup path. This works well for short running
tasks.

The existence of long running tasks on one of the involved cpus lets
the system operate over its tipping point. To be able to move such
a task (whose load can't be used to average the load among the cpus)
from a src cpu with lower capacity than the dst_cpu, an additional
rule has to be implemented in need_active_balance.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:52 +08:00
Morten Rasmussen
e38982ba32 sched: Energy-aware wake-up task placement
Let available compute capacity and estimated energy impact select
wake-up target cpu when energy-aware scheduling is enabled and the
system in not over-utilized (above the tipping point).

energy_aware_wake_cpu() attempts to find group of cpus with sufficient
compute capacity to accommodate the task and find a cpu with enough spare
capacity to handle the task within that group. Preference is given to
cpus with enough spare capacity at the current OPP. Finally, the energy
impact of the new target and the previous task cpu is compared to select
the wake-up target cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:52 +08:00
Dietmar Eggemann
ec055b978e sched: Determine the current sched_group idle-state
To estimate the energy consumption of a sched_group in
sched_group_energy() it is necessary to know which idle-state the group
is in when it is idle. For now, it is assumed that this is the current
idle-state (though it might be wrong). Based on the individual cpu
idle-states group_idle_state() finds the group idle-state.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:52 +08:00
Morten Rasmussen
19a5ebe13d sched, cpuidle: Track cpuidle state index in the scheduler
The idle-state of each cpu is currently pointed to by rq->idle_state but
there isn't any information in the struct cpuidle_state that can used to
look up the idle-state energy model data stored in struct
sched_group_energy. For this purpose is necessary to store the idle
state index as well. Ideally, the idle-state data should be unified.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:51 +08:00
Morten Rasmussen
1b5ec5d8ab sched: Add over-utilization/tipping point indicator
Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.

The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as:

sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at it's
highest frequency instead:

cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity

IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.

With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.

For system where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:51 +08:00