The current EAS implementation considers only energy variations, while it
disregards completely the impact on performance for the selection of
a certain schedule candidate. Moreover, it also makes its decision based
on the "absolute" value of expected energy variations.
In order to properly define a trade-off strategy between increased energy
consumption and performances benefits it is required to compare energy
variations with performance variations.
Thus, both performance and energy metrics must be expressed in comparable
units. While the performance variations are expressed in terms of capacity
deltas, which are defined in the range [0..SCHED_LOAD_SCALE], the same
scale is not used for energy variations.
This patch introduces the function:
schedtune_normalize_energy(energy_diff)
which returns a normalized value in the same range of capacity variations,
i.e. [0..SCHED_LOAD_SCALE].
A proper set of energy normalization constants are required to provide
a fast division by a constant during the normalziation of the energy_diff.
The value of these constants depends on the specific energy model and
topology of a target device.
Thus, this patch provides also the required support for the computation
at boot time of this set of variables.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
The task utilization signal, which is derived from PELT signals and
properly scaled to be architecture and frequency invariant, is used by
EAS as an estimation of the task requirements in terms of CPU bandwidth.
When the energy aware scheduler is in use, this signal affects the CPU
selection. Thus, a convenient way to bias that decision, which is also
little intrusive, is to boost the task utilization signal each time it
is required to support them.
This patch introduces the new function:
boosted_task_util(task)
which returns a boosted value for the utilization of the specified task.
The margin added to the original utilization is:
1. computed based on the "boosting strategy" in use
2. proportional to boost value defined either by the sysctl interface,
when global boosting is in use, or the "taskgroup" value, when
per-task boosting is enabled.
The boosted signal is used by EAS
a. transparently, via its integration into the task_fits() function
b. explicitly, in the energy-aware wakeup path
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
When per-task boosting is enabled, every time a task enters/exits a CPU
its boost value could impact the currently selected OPP for that CPU.
Thus, the "aggregated" boost value for that CPU potentially needs to
be updated to match the current maximum boost value among all the tasks
currently RUNNABLE on that CPU.
This patch introduces the required support to keep track of which boost
groups are impacting a CPU. Each time a task is enqueued/dequeued to/from
a CPU its boost group is used to increment a per-cpu counter of RUNNABLE
tasks on that CPU.
Only when the number of runnable tasks for a specific boost group
becomes 1 or 0 the corresponding boost group changes its effects on
that CPU, specifically:
a) boost_group::tasks == 1: this boost group starts to impact the CPU
b) boost_group::tasks == 0: this boost group stops to impact the CPU
In each of these two conditions the aggregation function:
sched_cpu_update(cpu)
could be required to run in order to identify the new maximum boost
value required for the CPU.
The proposed patch minimizes the number of times the aggregation
function is executed while still providing the required support to
always boost a CPU to the maximum boost value required by all its
currently RUNNABLE tasks.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
When per task boosting is enabled, we could have multiple RUNNABLE tasks
which are concurrently scheduled on the same CPU but each one with a
different boost value.
For example, we could have a scenarios like this:
Task SchedTune CGroup Boost Value
T1 root 0
T2 low-priority 10
T3 interactive 90
In these conditions we expect a CPU to be configured according to a
proper "aggregation" of the required boost values for all the tasks
currently scheduled on this CPU.
A suitable aggregation function is the one which tracks the MAX boost
value for all the tasks RUNNABLE on a CPU. This approach allows to
always satisfy the most boost demanding task while at the same time:
a) boosting all the concurrently scheduled tasks thus reducing
potential co-scheduling side-effects on demanding tasks
b) reduce the number of frequency switch requested towards SchedDVFS,
thus being more friendly to architectures with slow frequency
switching times
Every time a task enters/exits the RQ of a CPU the max boost value
should be updated considering all the boost groups currently "affecting"
that CPU, i.e. which have at least one RUNNABLE task currently allocated
on that CPU.
This patch introduces the required support to keep track of the boost
groups currently affecting CPUs. Thanks to the limited number of boost
groups, a small and memory efficient per-cpu array of boost groups
values (cpu_boost_groups) is used which is updated for each CPU entry by
schedtune_boostgroup_update() but only when a schedtune CGroup boost
value is updated. However, this is expected to be a rare operation,
perhaps done just one time at system boot time.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
To support task performance boosting, the usage of a single knob has the
advantage to be a simple solution, both from the implementation and the
usability standpoint. However, on a real system it can be difficult to
identify a single value for the knob which fits the needs of multiple
different tasks. For example, some kernel threads and/or user-space
background services should be better managed the "standard" way while we
still want to be able to boost the performance of specific workloads.
In order to improve the flexibility of the task boosting mechanism this
patch is the first of a small series which extends the previous
implementation to introduce a "per task group" support.
This first patch introduces just the basic CGroups support, a new
"schedtune" CGroups controller is added which allows to configure
different boost value for different groups of tasks.
To keep the implementation simple but still effective for a boosting
strategy, the new controller:
1. allows only a two layer hierarchy
2. supports only a limited number of boost groups
A two layer hierarchy allows to place each task either:
a) in the root control group
thus being subject to a system-wide boosting value
b) in a child of the root group
thus being subject to the specific boost value defined by that
"boost group"
The limited number of "boost groups" supported is mainly motivated by
the observation that in a real system it could be useful to have only
few classes of tasks which deserve different treatment.
For example, background vs foreground or interactive vs low-priority.
As an additional benefit, a limited number of boost groups allows also
to have a simpler implementation especially for the code required to
compute the boost value for CPUs which have runnable tasks belonging to
different boost groups.
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
The current (CFS) scheduler implementation does not allow "to boost"
tasks performance by running them at a higher OPP compared to the
minimum required to meet their workload demands.
To support tasks performance boosting the scheduler should provide a
"knob" which allows to tune how much the system is going to be optimised
for energy efficiency vs performance.
This patch is the first of a series which provides a simple interface to
define a tuning knob. One system-wide "boost" tunable is exposed via:
/proc/sys/kernel/sched_cfs_boost
which can be configured in the range [0..100], to define a percentage
where:
- 0% boost requires to operate in "standard" mode by scheduling
tasks at the minimum capacities required by the workload demand
- 100% boost requires to push at maximum the task performances,
"regardless" of the incurred energy consumption
A boost value in between these two boundaries is used to bias the
power/performance trade-off, the higher the boost value the more the
scheduler is biased toward performance boosting instead of energy
efficiency.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>