The usage of conditional compiled code is discouraged in fair.c.
This patch clean up a bit fair.c by moving schedtune_{cpu.task}_boost
definitions into tune.h.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
The definition of the acceptance regions as well as the translation of
these regions into a payoff value was both wrong which turned out in:
a) a wrong definition of payoff for the performance boost region
b) a correct "by chance" definition of the payoff for the performance
constraint region (i.e. two sign errors together fixing the formula)
This patch provides a better description of the cut regions as well as
a fixed version of the payoff computations, which are now reduced to a
single formula usable for both cases.
Reported-by: Leo Yan <leo.yan@linaro.org>
Reviewed-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Change-Id: I164ee04ba98c3a776605f18cb65ee61b3e917939
Contains also:
eas/stune: schedtune cpu boost_max must be non-negative.
This is to avoid under-accounting cpu capacity which may
cause task stacking and frequency spikes.
Change-Id: Ie1c1cbd52a6edb77b4c15a830030aa748dff6f29
The energy normalization function is required to get the proper values
for the P-E space filtering function to work.
That normalization is part of the hot wakeup path and currently implemented
with a function call.
Moving the normalization function into fair.c allows the compiler to
further optimize that code by reducing overheads in the wakeup hot path.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
A boosted task needs to be scheduled on a CPU which can grant a minimum
capacity which is higher than its utilization.
However, a task can be allocated on a CPU which already provides an utilization
which is higher than the task boosted utilization itself.
Moreover, with the previous approach a task 100% boosted is not fitting any
CPU.
This patch makes use of the boosted task utilization just as a threashold
which defines the minimum capacity should be available on a CPU to host that
task.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
The choice of initial task load upon fork has a large influence
on CPU and OPP selection when scheduler-driven DVFS is in use.
Make this tuneable by adding a new sysctl "sched_initial_task_util".
If the sched governor is not used, the default remains at SCHED_LOAD_SCALE
Otherwise, the value from the sysctl is used. This defaults to 0.
Signed-off-by: "Todd Kjos <tkjos@google.com>"
EAS assumes that clusters with smaller capacity cores are more
energy-efficient. This may not be true on non-big-little devices,
so EAS can make incorrect cluster selections when finding a CPU
to wake. The "sched_is_big_little" hint can be used to cause a
cpu-based selection instead of cluster-based selection.
This change incorporates the addition of the sync hint enable patch
EAS did not honour synchronous wakeup hints, a new sysctl is
created to ask EAS to use this information when selecting a CPU.
The control is called "sched_sync_hint_enable".
Also contains:
EAS: sched/fair: for SMP bias toward idle core with capacity
For SMP devices, on wakeup bias towards idle cores that have capacity
vs busy devices that need a higher OPP
eas: favor idle cpus for boosted tasks
BUG: 29533997
BUG: 29512132
Change-Id: I0cc9a1b1b88fb52916f18bf2d25715bdc3634f9c
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
eas/sched/fair: Favoring busy cpus with low OPPs
BUG: 29533997
BUG: 29512132
Change-Id: I9305b3239698d64278db715a2e277ea0bb4ece79
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Introduce a new sysctl for this option, 'sched_cstate_aware'.
When this is enabled, select_idle_sibling in CFS is modified to
choose the idle CPU in the sibling group which has the lowest
idle state index - idle state indexes are assumed to increase
as sleep depth and hence wakeup latency increase. In this way,
we attempt to minimise wakeup latency when an idle CPU is
required.
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
Includes:
sched: EAS: fix select_idle_sibling
when sysctl_sched_cstate_aware is enabled, best_idle cpu will not be chosen
in the original flow because it will goto done directly
Bug: 30107557
Change-Id: Ie09c2e3960cafbb976f8d472747faefab3b4d6ac
Signed-off-by: martin_liu <martin_liu@htc.com>
Contains:
sched/cpufreq_sched: use shorter throttle for raising OPP
Avoid cases where a brief drop in load causes a change to a low OPP
for the full throttle period. Use a shorter throttle period for
raising OPP than for lowering OPP.
sched-freq: Fix handling of max/min frequency
This reverts commit 9726142608f5b3bf5df4280243c9d324e692a510.
Change-Id: Ia78095354f7ad9492f00deb509a2b45112361eda
sched/cpufreq: Increasing throttle_down_nsec to 50ms
Change-Id: I2d8969cf2a64fa719b9dd86f43f9dd14b1ff84fe
sched-freq: make throttle times tunable
Change-Id: I127879645367425b273441d7f0306bb15d5633cb
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[jstultz: Fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Currently the build for a single-core (e.g. user-mode) Linux is broken
and this configuration is required (at least) to run some network tests.
The main issues for the current code support on single-core systems are:
1. {se,rq}::sched_avg is not available nor maintained for !SMP systems
This means that load and utilisation signals are NOT available in single
core systems. All the EAS code depends on these signals.
2. sched_group_energy is also SMP dependant. Again this means that all the
EAS setup and preparation code (energyn model initialization) has to be
properly guarded/disabled for !SMP systems.
3. SchedFreq depends on utilization signal, which is not available on
!SMP systems.
4. SchedTune is useless on unicore systems if SchedFreq is not available.
5. WALT machinery is not required on single-core systems.
This patch addresses all these issues by enforcing some constraints for
single-core systems:
a) WALT, SchedTune and SchedTune are now dependant on SMP
b) The default governor for !SMP systems is INTERACTIVE
c) The energy model initialisation/build functions are
d) Other minor code re-arrangements and CONFIG_SMP guarding to enable
single core builds.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Doing a Exponential moving average per nr_running++/-- does not
guarantee a fixed sample rate which induces errors if there are lots of
threads being enqueued/dequeued from the rq (Linpack mt). Instead of
keeping track of the avg, the scheduler now keeps track of the integral
of nr_running and allows the readers to perform filtering on top.
Original-author: Sai Charan Gurrappadi <sgurrappadi@nvidia.com>
Change-Id: Id946654f32fa8be0eaf9d8fa7c9a8039b5ef9fab
Signed-off-by: Joseph Lo <josephl@nvidia.com>
Signed-off-by: Andrew Bresticker <abrestic@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/174694
Reviewed-on: https://chromium-review.googlesource.com/272853
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
This reverts commit 4e09c51018.
I checked for the usage of this debug helper in AOSP common kernels as
well as vendor kernels (e.g exynos, msm, mediatek, omap, tegra, x86,
x86_64) hosted at https://android.googlesource.com/kernel/ and I found
out that other than few fairly obsolete Omap trees (for tuna & Glass)
and Exynos tree (for Manta), there is no active user of this debug
helper. So we can safely remove this helper code.
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
(cherry picked from commit 29d6455178a09e1dc340380c582b13356227e8df)
Until now, hitting this BUG_ON caused a recursive oops (because oops
handling involves do_exit(), which calls into the scheduler, which in
turn raises an oops), which caused stuff below the stack to be
overwritten until a panic happened (e.g. via an oops in interrupt
context, caused by the overwritten CPU index in the thread_info).
Just panic directly.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: Ia3acb3f747f7a58ec2d071644433b0591925969f
Bug: 29444228
This patch provides a allow_attach hook for cpusets,
which resolves lots of the following logcat noise.
W SchedPolicy: add_tid_to_cgroup failed to write '2816' (Permission denied); fd=29
W ActivityManager: Failed setting process group of 2816 to 0
W System.err: java.lang.IllegalArgumentException
W System.err: at android.os.Process.setProcessGroup(Native Method)
W System.err: at com.android.server.am.ActivityManagerService.applyOomAdjLocked(ActivityManagerService.java:18763)
W System.err: at com.android.server.am.ActivityManagerService.updateOomAdjLocked(ActivityManagerService.java:19028)
W System.err: at com.android.server.am.ActivityManagerService.updateOomAdjLocked(ActivityManagerService.java:19106)
W System.err: at com.android.server.am.ActiveServices.serviceDoneExecutingLocked(ActiveServices.java:2015)
W System.err: at com.android.server.am.ActiveServices.publishServiceLocked(ActiveServices.java:905)
W System.err: at com.android.server.am.ActivityManagerService.publishService(ActivityManagerService.java:16065)
W System.err: at android.app.ActivityManagerNative.onTransact(ActivityManagerNative.java:1007)
W System.err: at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2493)
W System.err: at android.os.Binder.execTransact(Binder.java:453)
Change-Id: Ic1b61b2bbb7ce74c9e9422b5e22ee9078251de21
[Ported to 4.4, added commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
This patch backports 969624b (which backports caaee6234d0 upstream),
from the v4.4-stable branch to the common/android-4.4 branch.
This patch is needed to provide the PTRACE_MODE_ATTACH_FSCREDS definition
which was used by the backported version of proc/<tid>/timerslack_ns
in change-id: Ie5799b9a3402a31f88cd46437dcda4a0e46415a7
commit caaee6234d05a58c5b4d05e7bf766131b810a657 upstream.
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[jstultz: Cherry-picked for common/android-4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
This backports da8b44d5a9f8bf26da637b7336508ca534d6b319 from upstream.
This patchset introduces a /proc/<pid>/timerslack_ns interface which
would allow controlling processes to be able to set the timerslack value
on other processes in order to save power by avoiding wakeups (Something
Android currently does via out-of-tree patches).
The first patch tries to fix the internal timer_slack_ns usage which was
defined as a long, which limits the slack range to ~4 seconds on 32bit
systems. It converts it to a u64, which provides the same basically
unlimited slack (500 years) on both 32bit and 64bit machines.
The second patch introduces the /proc/<pid>/timerslack_ns interface
which allows the full 64bit slack range for a task to be read or set on
both 32bit and 64bit machines.
With these two patches, on a 32bit machine, after setting the slack on
bash to 10 seconds:
$ time sleep 1
real 0m10.747s
user 0m0.001s
sys 0m0.005s
The first patch is a little ugly, since I had to chase the slack delta
arguments through a number of functions converting them to u64s. Let me
know if it makes sense to break that up more or not.
Other than that things are fairly straightforward.
This patch (of 2):
The timer_slack_ns value in the task struct is currently a unsigned
long. This means that on 32bit applications, the maximum slack is just
over 4 seconds. However, on 64bit machines, its much much larger (~500
years).
This disparity could make application development a little (as well as
the default_slack) to a u64. This means both 32bit and 64bit systems
have the same effective internal slack range.
Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
the interface as a unsigned long, so we preserve that limitation on
32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
actually larger then what can be stored by an unsigned long.
This patch also modifies hrtimer functions which specified the slack
delta as a unsigned long.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case some sysfs nodes needs to be labeled with a different label than
sysfs then user needs to be notified when a core is brought back online.
Signed-off-by: Thierry Strudel <tstrudel@google.com>
Bug: 29359497
Change-Id: I0395c86e01cd49c348fda8f93087d26f88557c91
When kernel.perf_event_open is set to 3 (or greater), disallow all
access to performance events by users without CAP_SYS_ADMIN.
Add a Kconfig symbol CONFIG_SECURITY_PERF_EVENTS_RESTRICT that
makes this value the default.
This is based on a similar feature in grsecurity
(CONFIG_GRKERNSEC_PERF_HARDEN). This version doesn't include making
the variable read-only. It also allows enabling further restriction
at run-time regardless of whether the default is changed.
https://lkml.org/lkml/2016/1/11/587
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Bug: 29054680
Change-Id: Iff5bff4fc1042e85866df9faa01bce8d04335ab8
This patch makes the energy data available via procfs. The related files
are placed as sub-directory named 'energy' inside the
/proc/sys/kernel/sched_domain/cpuX/domainY/groupZ directory for those
cpu/domain/group tuples which have energy information.
The following example depicts the contents of
/proc/sys/kernel/sched_domain/cpu0/domain0/group[01] for a system which
has energy information attached to domain level 0.
├── cpu0
│ ├── domain0
│ │ ├── busy_factor
│ │ ├── busy_idx
│ │ ├── cache_nice_tries
│ │ ├── flags
│ │ ├── forkexec_idx
│ │ ├── group0
│ │ │ └── energy
│ │ │ ├── cap_states
│ │ │ ├── idle_states
│ │ │ ├── nr_cap_states
│ │ │ └── nr_idle_states
│ │ ├── group1
│ │ │ └── energy
│ │ │ ├── cap_states
│ │ │ ├── idle_states
│ │ │ ├── nr_cap_states
│ │ │ └── nr_idle_states
│ │ ├── idle_idx
│ │ ├── imbalance_pct
│ │ ├── max_interval
│ │ ├── max_newidle_lb_cost
│ │ ├── min_interval
│ │ ├── name
│ │ ├── newidle_idx
│ │ └── wake_idx
│ └── domain1
│ ├── busy_factor
│ ├── busy_idx
│ ├── cache_nice_tries
│ ├── flags
│ ├── forkexec_idx
│ ├── idle_idx
│ ├── imbalance_pct
│ ├── max_interval
│ ├── max_newidle_lb_cost
│ ├── min_interval
│ ├── name
│ ├── newidle_idx
│ └── wake_idx
The files 'nr_idle_states' and 'nr_cap_states' contain a scalar value
whereas 'idle_states' and 'cap_states' contain a vector of power
consumption at this idle state respectively (compute capacity, power
consumption) at this capacity state.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Once the SchedTune support is enabled and the CPU bandwidth demand of a
task is boosted, we could expect increased energy consumptions which are
balanced by corresponding increases of tasks performance.
However, the current implementation of the energy_diff() function
accepts all and _only_ the schedule candidates which results into a
reduced expected system energy, which works against the boosting
strategy.
This patch links the energy_diff() function with the "energy payoff"
engine provided by SchedTune. The energy variation computed by the
energy_diff() function is now filtered using the SchedTune support to
evaluated the energy payoff for a boosted task.
With that patch, the energy_diff() function is going to reported as
"acceptable schedule candidate" only the schedule candidate which
corresponds to a positive energy_payoff.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
The current EAS implementation considers only energy variations, while it
disregards completely the impact on performance for the selection of
a certain schedule candidate. Moreover, it also makes its decision based
on the "absolute" value of expected energy variations.
In order to properly define a trade-off strategy between increased energy
consumption and performances benefits it is required to compare energy
variations with performance variations.
Thus, both performance and energy metrics must be expressed in comparable
units. While the performance variations are expressed in terms of capacity
deltas, which are defined in the range [0..SCHED_LOAD_SCALE], the same
scale is not used for energy variations.
This patch introduces the function:
schedtune_normalize_energy(energy_diff)
which returns a normalized value in the same range of capacity variations,
i.e. [0..SCHED_LOAD_SCALE].
A proper set of energy normalization constants are required to provide
a fast division by a constant during the normalziation of the energy_diff.
The value of these constants depends on the specific energy model and
topology of a target device.
Thus, this patch provides also the required support for the computation
at boot time of this set of variables.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
The current EAS implementation does not allow "to boost" tasks
performances, for example by running them at an higher OPP (or a more
capable CPU), even if that could require a "reasonable" increase in
energy consumption. To defined how much reasonable is an energy
increase with respect to a required boost value, it is required to
define and compute a trade-off between the expected energy and
performance variations.
However, the current EAS implementation considers only energy variations
while completely disregard the impact on performance for the selection
of a certain schedule candidate.
This patch extends the eenv energy environment to keep track of both
energy and performance deltas which are implied by the activation of a
schedule candidate.
The performance variation is estimated considering the different
capacities of the CPUs in which the task could be scheduled. The idea is
that while running on a CPU with higher capacity (e.g. higher operating
point) the task could (potentially) complete faster and thus get better
performance.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
The task utilization signal, which is derived from PELT signals and
properly scaled to be architecture and frequency invariant, is used by
EAS as an estimation of the task requirements in terms of CPU bandwidth.
When the energy aware scheduler is in use, this signal affects the CPU
selection. Thus, a convenient way to bias that decision, which is also
little intrusive, is to boost the task utilization signal each time it
is required to support them.
This patch introduces the new function:
boosted_task_util(task)
which returns a boosted value for the utilization of the specified task.
The margin added to the original utilization is:
1. computed based on the "boosting strategy" in use
2. proportional to boost value defined either by the sysctl interface,
when global boosting is in use, or the "taskgroup" value, when
per-task boosting is enabled.
The boosted signal is used by EAS
a. transparently, via its integration into the task_fits() function
b. explicitly, in the energy-aware wakeup path
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
When per-task boosting is enabled, every time a task enters/exits a CPU
its boost value could impact the currently selected OPP for that CPU.
Thus, the "aggregated" boost value for that CPU potentially needs to
be updated to match the current maximum boost value among all the tasks
currently RUNNABLE on that CPU.
This patch introduces the required support to keep track of which boost
groups are impacting a CPU. Each time a task is enqueued/dequeued to/from
a CPU its boost group is used to increment a per-cpu counter of RUNNABLE
tasks on that CPU.
Only when the number of runnable tasks for a specific boost group
becomes 1 or 0 the corresponding boost group changes its effects on
that CPU, specifically:
a) boost_group::tasks == 1: this boost group starts to impact the CPU
b) boost_group::tasks == 0: this boost group stops to impact the CPU
In each of these two conditions the aggregation function:
sched_cpu_update(cpu)
could be required to run in order to identify the new maximum boost
value required for the CPU.
The proposed patch minimizes the number of times the aggregation
function is executed while still providing the required support to
always boost a CPU to the maximum boost value required by all its
currently RUNNABLE tasks.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
When per task boosting is enabled, we could have multiple RUNNABLE tasks
which are concurrently scheduled on the same CPU but each one with a
different boost value.
For example, we could have a scenarios like this:
Task SchedTune CGroup Boost Value
T1 root 0
T2 low-priority 10
T3 interactive 90
In these conditions we expect a CPU to be configured according to a
proper "aggregation" of the required boost values for all the tasks
currently scheduled on this CPU.
A suitable aggregation function is the one which tracks the MAX boost
value for all the tasks RUNNABLE on a CPU. This approach allows to
always satisfy the most boost demanding task while at the same time:
a) boosting all the concurrently scheduled tasks thus reducing
potential co-scheduling side-effects on demanding tasks
b) reduce the number of frequency switch requested towards SchedDVFS,
thus being more friendly to architectures with slow frequency
switching times
Every time a task enters/exits the RQ of a CPU the max boost value
should be updated considering all the boost groups currently "affecting"
that CPU, i.e. which have at least one RUNNABLE task currently allocated
on that CPU.
This patch introduces the required support to keep track of the boost
groups currently affecting CPUs. Thanks to the limited number of boost
groups, a small and memory efficient per-cpu array of boost groups
values (cpu_boost_groups) is used which is updated for each CPU entry by
schedtune_boostgroup_update() but only when a schedtune CGroup boost
value is updated. However, this is expected to be a rare operation,
perhaps done just one time at system boot time.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
To support task performance boosting, the usage of a single knob has the
advantage to be a simple solution, both from the implementation and the
usability standpoint. However, on a real system it can be difficult to
identify a single value for the knob which fits the needs of multiple
different tasks. For example, some kernel threads and/or user-space
background services should be better managed the "standard" way while we
still want to be able to boost the performance of specific workloads.
In order to improve the flexibility of the task boosting mechanism this
patch is the first of a small series which extends the previous
implementation to introduce a "per task group" support.
This first patch introduces just the basic CGroups support, a new
"schedtune" CGroups controller is added which allows to configure
different boost value for different groups of tasks.
To keep the implementation simple but still effective for a boosting
strategy, the new controller:
1. allows only a two layer hierarchy
2. supports only a limited number of boost groups
A two layer hierarchy allows to place each task either:
a) in the root control group
thus being subject to a system-wide boosting value
b) in a child of the root group
thus being subject to the specific boost value defined by that
"boost group"
The limited number of "boost groups" supported is mainly motivated by
the observation that in a real system it could be useful to have only
few classes of tasks which deserve different treatment.
For example, background vs foreground or interactive vs low-priority.
As an additional benefit, a limited number of boost groups allows also
to have a simpler implementation especially for the code required to
compute the boost value for CPUs which have runnable tasks belonging to
different boost groups.
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
The CPU usage signal is used by the scheduler as an estimation of the
overall bandwidth currently allocated on a CPU. When SchedDVFS is in
use, this signal affects the selection of the operating points (OPP)
required to accommodate all the workload allocated in a CPU.
A convenient way to boost the performance of tasks running on a CPU,
which is also little intrusive, is to boost the CPU usage signal each
time it is used to select an OPP.
This patch introduces a new function:
get_boosted_cpu_usage(cpu)
to return a boosted value for the usage of a specified CPU.
The margin added to the original usage is:
1. computed based on the "boosting strategy" in use
2. proportional to the system-wide boost value defined by provided
user-space interface
The boosted signal is used by SchedDVFS (transparently) each time it
requires to get an estimation of the capacity required for a CPU.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
The basic idea of the boost knob is to "artificially inflate" a signal
to make a task or logical CPU appears more demanding than it actually
is. Independently from the specific signal, a consistent and possibly
simple semantic for the concept of "signal boosting" must define:
1. how we translate the boost percentage into a "margin" value to be added
to the original signal to inflate
2. what is the meaning of a boost value from a user-space perspective
This patch provides the implementation of a possible boost semantic,
named "Signal Proportional Compensation" (SPC), where the boost
percentage (BP) is used to compute a margin (M) which is proportional to
the complement of the original signal (OS):
M = BP * (SCHED_LOAD_SCALE - OS)
The computed margin then added to the OS to obtain the Boosted Signal (BS)
BS = OS + M
The proposed boost semantic has these main features:
- each signal gets a boost which is proportional to its delta with respect
to the maximum available capacity in the system (i.e. SCHED_LOAD_SCALE)
- a 100% boosting has a clear understanding from a user-space perspective,
since it means simply to run (possibly) "all" tasks at the max OPP
- each boosting value means to improve the task performance by a quantity
which is proportional to the maximum achievable performance on that
system
Thus this semantics is somehow forcing a behaviour which is:
50% boosting means to run at half-way between the current and the
maximum performance which a task could achieve on that system
This patch provides the code to implement a fast integer division to
convert a boost percentage (BP) value into a margin (M).
NOTE: this code is suitable for all signals operating in range
[0..SCHED_LOAD_SCALE]
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
The current (CFS) scheduler implementation does not allow "to boost"
tasks performance by running them at a higher OPP compared to the
minimum required to meet their workload demands.
To support tasks performance boosting the scheduler should provide a
"knob" which allows to tune how much the system is going to be optimised
for energy efficiency vs performance.
This patch is the first of a series which provides a simple interface to
define a tuning knob. One system-wide "boost" tunable is exposed via:
/proc/sys/kernel/sched_cfs_boost
which can be configured in the range [0..100], to define a percentage
where:
- 0% boost requires to operate in "standard" mode by scheduling
tasks at the minimum capacities required by the workload demand
- 100% boost requires to push at maximum the task performances,
"regardless" of the incurred energy consumption
A boost value in between these two boundaries is used to bias the
power/performance trade-off, the higher the boost value the more the
scheduler is biased toward performance boosting instead of energy
efficiency.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
RT tasks don't provide any running constraints like deadline ones
except their running priority. The only current usable input to
estimate the capacity needed by RT tasks is the rt_avg metric. We use
it to estimate the CPU capacity needed for the RT scheduler class.
In order to monitor the evolution for RT task load, we must
peridiocally check it during the tick.
Then, we use the estimated capacity of the last activity to estimate
the next one which can not be that accurate but is a good starting
point without any impact on the wake up path of RT tasks.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Instead of monitoring the exec time of deadline tasks to evaluate the
CPU capacity consumed by deadline scheduler class, we can directly
calculate it thanks to the sum of utilization of deadline tasks on the
CPU. We can remove deadline tasks from rt_avg metric and directly use
the average bandwidth of deadline scheduler in scale_rt_capacity.
Based in part on a similar patch from Luca Abeni <luca.abeni@unitn.it>.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
rt_avg is only used to scale the available CPU's capacity for CFS
tasks. As the update of this scaling is done during periodic load
balance, we only have to ensure that sched_avg_update has been called
before any periodic load balancing. This requirement is already
fulfilled by __update_cpu_load so the call in sched_rt_avg_update,
which is part of the hotpath, is useless.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Since the true utilization of a long running task is not detectable
while it is running and might be bigger than the current cpu capacity,
create the maximum cpu capacity head room by requesting the maximum
cpu capacity once the cpu usage plus the capacity margin exceeds the
current capacity. This is also done to try to harm the performance of
a task the least.
Original fair-class only version authored by Juri Lelli
<juri.lelli@arm.com>.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
As we don't trigger freq changes from {en,de}queue_task_fair() during load
balancing, we need to do explicitly so on load balancing paths.
[smuckle@linaro.org: move update_capacity_of calls so rq lock is held]
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Patch "sched/fair: add triggers for OPP change requests" introduced OPP
change triggers for enqueue_task_fair(), but the trigger was operating only
for wakeups. Fact is that it makes sense to consider wakeup_new also (i.e.,
fork()), as we don't know anything about a newly created task and thus we
most certainly want to jump to max OPP to not harm performance too much.
However, it is not currently possible (or at least it wasn't evident to me
how to do so :/) to tell new wakeups from other (non wakeup) operations.
This patch introduces an additional flag in sched.h that is only set at
fork() time and it is then consumed in enqueue_task_fair() for our purpose.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Each time a task is {en,de}queued we might need to adapt the current
frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
this purpose. Only trigger a freq request if we are effectively waking up
or going to sleep. Filter out load balancing related calls to reduce the
number of triggers.
[smuckle@linaro.org: resolve merge conflicts, define task_new,
use renamed static key sched_freq]
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Scheduler-driven CPU frequency selection hopes to exploit both
per-task and global information in the scheduler to improve frequency
selection policy, achieving lower power consumption, improved
responsiveness/performance, and less reliance on heuristics and
tunables. For further discussion on the motivation of this integration
see [0].
This patch implements a shim layer between the Linux scheduler and the
cpufreq subsystem. The interface accepts capacity requests from the
CFS, RT and deadline sched classes. The requests from each sched class
are summed on each CPU with a margin applied to the CFS and RT
capacity requests to provide some headroom. Deadline requests are
expected to be precise enough given their nature to not require
headroom. The maximum total capacity request for a CPU in a frequency
domain drives the requested frequency for that domain.
Policy is determined by both the sched classes and this shim layer.
Note that this algorithm is event-driven. There is no polling loop to
check cpu idle time nor any other method which is unsynchronized with
the scheduler, aside from a throttling mechanism to ensure frequency
changes are not attempted faster than the hardware can accommodate them.
Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
code and test results, and to Ricky Liang <jcliang@chromium.org>
for initialization and static key inc/dec fixes.
[0] http://article.gmane.org/gmane.linux.kernel/1499836
[smuckle@linaro.org: various additions and fixes, revised commit text]
CC: Ricky Liang <jcliang@chromium.org>
Signed-off-by: Michael Turquette <mturquette@baylibre.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>