Commit graph

563294 commits

Author SHA1 Message Date
Sami Tolvanen
249d2baf9b ANDROID: dm verity fec: limit error correction recursion
If verity tree itself is sufficiently corrupted in addition to data
blocks, it's possible for error correction to end up in a deep recursive
error correction loop that eventually causes a kernel panic as follows:

[   14.728962] [<ffffffc0008c1a14>] verity_fec_decode+0xa8/0x138
[   14.734691] [<ffffffc0008c3ee0>] verity_verify_level+0x11c/0x180
[   14.740681] [<ffffffc0008c482c>] verity_hash_for_block+0x88/0xe0
[   14.746671] [<ffffffc0008c1508>] fec_decode_rsb+0x318/0x75c
[   14.752226] [<ffffffc0008c1a14>] verity_fec_decode+0xa8/0x138
[   14.757956] [<ffffffc0008c3ee0>] verity_verify_level+0x11c/0x180
[   14.763944] [<ffffffc0008c482c>] verity_hash_for_block+0x88/0xe0

This change limits the recursion to a reasonable level during a single
I/O operation.

Bug: 28943429
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Change-Id: I0a7ebff331d259c59a5e03c81918cc1613c3a766
(cherry picked from commit f4b9e40597e73942d2286a73463c55f26f61bfa7)
2016-06-06 14:16:11 -07:00
Jeff Vander Stoep
0ac94b48c0 ANDROID: restrict access to perf events
Add:
CONFIG_SECURITY_PERF_EVENTS_RESTRICT=y

to android-base.cfg

The kernel.perf_event_paranoid sysctl is set to 3 by default.
No unprivileged use of the perf_event_open syscall will be
permitted unless it is changed.

Bug: 29054680
Change-Id: Ie7512259150e146d8e382dc64d40e8faaa438917
2016-06-01 14:21:04 -07:00
Jeff Vander Stoep
9d5f5d9346 FROMLIST: security,perf: Allow further restriction of perf_event_open
When kernel.perf_event_open is set to 3 (or greater), disallow all
access to performance events by users without CAP_SYS_ADMIN.
Add a Kconfig symbol CONFIG_SECURITY_PERF_EVENTS_RESTRICT that
makes this value the default.

This is based on a similar feature in grsecurity
(CONFIG_GRKERNSEC_PERF_HARDEN).  This version doesn't include making
the variable read-only.  It also allows enabling further restriction
at run-time regardless of whether the default is changed.

https://lkml.org/lkml/2016/1/11/587

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>

Bug: 29054680
Change-Id: Iff5bff4fc1042e85866df9faa01bce8d04335ab8
2016-05-31 22:22:16 -07:00
Ben Hutchings
ebac2a3dcd BACKPORT: perf tools: Document the perf sysctls
perf_event_paranoid was only documented in source code and a perf error
message.  Copy the documentation from the error message to
Documentation/sysctl/kernel.txt.

perf_cpu_time_max_percent was already documented but missing from the
list at the top, so add it there.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-doc@vger.kernel.org
Link: http://lkml.kernel.org/r/20160119213515.GG2637@decadent.org.uk
[ Remove reference to external Documentation file, provide info inline, as before ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

Bug: 29054680
Change-Id: I13e73cfb2ad761c94762d0c8196df7725abdf5c5
2016-05-31 22:22:01 -07:00
Ard Biesheuvel
d8853fd2e9 UPSTREAM: arm64: module: avoid undefined shift behavior in reloc_data()
Compilers may engage the improbability drive when encountering shifts
by a distance that is a multiple of the size of the operand type. Since
the required bounds check is very simple here, we can get rid of all the
fuzzy masking, shifting and comparing, and use the documented bounds
directly.

Change-Id: Ibc1b73f4a630bc182deb6edfa7458b5e29ba9577
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-05-31 13:26:37 -07:00
Ard Biesheuvel
ef55b4532b UPSTREAM: arm64: module: fix relocation of movz instruction with negative immediate
The test whether a movz instruction with a signed immediate should be
turned into a movn instruction (i.e., when the immediate is negative)
is flawed, since the value of imm is always positive. Also, the
subsequent bounds check is incorrect since the limit update never
executes, due to the fact that the imm_type comparison will always be
false for negative signed immediates.

Let's fix this by performing the sign test on sval directly, and
replacing the bounds check with a simple comparison against U16_MAX.

Change-Id: I9ad3d8bfd91e5fdc6434b1be6c3062dfec193176
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[will: tidied up use of sval, renamed MOVK enum value to MOVKZ]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-05-31 13:25:56 -07:00
Amit Pundir
fd2e341b47 Revert "armv6 dcc tty driver"
This reverts commit 97312429c2.

Drop AOSP's "armv6 dcc tty driver" in favor of upstream DCC driver for
ARMv6/v7 16c63f8ea4 (drivers: char: hvc: add arm JTAG DCC console
support) and for ARMv8 4cad4c57e0 (ARM64: TTY: hvc_dcc: Add support
for ARM64 dcc).

Change-Id: I0ca651ef2d854fff03cee070524fe1e3971b6d8f
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-05-26 23:53:27 +05:30
Amit Pundir
dbe3b633a7 Revert "arm: dcc_tty: fix armv6 dcc tty build failure"
This reverts commit dfc1d4be88.

Drop AOSP's "armv6 dcc tty driver" in favor of upstream DCC driver for
ARMv6/v7 16c63f8ea4 (drivers: char: hvc: add arm JTAG DCC console
support) and for ARMv8 4cad4c57e0 (ARM64: TTY: hvc_dcc: Add support
for ARM64 dcc).

Change-Id: I8110a4fd649b8ac1ec9bfac00255c1214135e4b2
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-05-26 23:52:35 +05:30
Dmitry Shmidt
9f6f7a2724 ARM64: Ignore Image-dtb from git point of view
Change-Id: I5bbf1db90f28ea956383b4a5d91ad508eea656dc
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2016-05-24 14:42:55 -07:00
Haojian Zhuang
c11dd134ea arm64: add option to build Image-dtb
Some bootloaders couldn't decompress Image.gz-dtb.

Change-Id: I698cd0c4ee6894e8d0655d88f3ecf4826c28a645
Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2016-05-24 14:21:58 -07:00
Winter Wang
cba8d1f729 ANDROID: usb: gadget: f_midi: set fi->f to NULL when free f_midi function
fi->f is set in f_midi's alloc_func, need to clean this to
NULL in free_func, otherwise on ConfigFS's function switch,
midi->usb_function it self is freed, fi->f will be a wild
pointer and run into below kernel panic:
---------------
[   58.950628] Unable to handle kernel paging request at virtual address 63697664
[   58.957869] pgd = c0004000
[   58.960583] [63697664] *pgd=00000000
[   58.964185] Internal error: Oops: 80000005 [#1] PREEMPT SMP ARM
[   58.970111] Modules linked in:
[   58.973191] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.15-03504-g34c857c-dirty #89
[   58.981024] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[   58.987557] task: c110bd70 ti: c1100000 task.ti: c1100000
[   58.992962] PC is at 0x63697664
[   58.996120] LR is at android_setup+0x78/0x138
<..snip..>
[   60.044980] 1fc0: ffffffff ffffffff c1000684 00000000 00000000 c108ecd0 c11f7294 c11039c0
[   60.053181] 1fe0: c108eccc c110d148 1000406a 412fc09a 00000000 1000807c 00000000 00000000
[   60.061420] [<c073b1fc>] (android_setup) from [<c0730490>] (udc_irq+0x758/0x1034)
[   60.068951] [<c0730490>] (udc_irq) from [<c017c650>] (handle_irq_event_percpu+0x50/0x254)
[   60.077165] [<c017c650>] (handle_irq_event_percpu) from [<c017c890>] (handle_irq_event+0x3c/0x5c)
[   60.086072] [<c017c890>] (handle_irq_event) from [<c017f3ec>] (handle_fasteoi_irq+0xe0/0x198)
[   60.094630] [<c017f3ec>] (handle_fasteoi_irq) from [<c017bcfc>] (generic_handle_irq+0x2c/0x3c)
[   60.103271] [<c017bcfc>] (generic_handle_irq) from [<c017bfb8>] (__handle_domain_irq+0x7c/0xec)
[   60.112000] [<c017bfb8>] (__handle_domain_irq) from [<c0101450>] (gic_handle_irq+0x24/0x5c)
--------------

Signed-off-by: Winter Wang <wente.wang@nxp.com>
2016-05-23 15:53:44 -07:00
Jeff Mahoney
8f2aea82ca UPSTREAM: mac80211: fix "warning: ‘target_metric’ may be used uninitialized"
(This cherry-picks b4201cc4fc6e1c57d6d306b1f787865043d60129 upstream)

This fixes:

net/mac80211/mesh_hwmp.c:603:26: warning: ‘target_metric’ may be used uninitialized in this function

target_metric is only consumed when reply = true so no bug exists here,
but not all versions of gcc realize it.  Initialize to 0 to remove the
warning.

Change-Id: I13923fda9d314f48196c29e4354133dfe01f5abd
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
[jstultz: Cherry-picked to android-4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-05-17 17:36:19 +00:00
Peter Hurley
1f95b0e9be UPSTREAM: tty: Fix unsafe ldisc reference via ioctl(TIOCGETD)
(cherry pick from commit 5c17c861a357e9458001f021a7afa7aab9937439)

ioctl(TIOCGETD) retrieves the line discipline id directly from the
ldisc because the line discipline id (c_line) in termios is untrustworthy;
userspace may have set termios via ioctl(TCSETS*) without actually
changing the line discipline via ioctl(TIOCSETD).

However, directly accessing the current ldisc via tty->ldisc is
unsafe; the ldisc ptr dereferenced may be stale if the line discipline
is changing via ioctl(TIOCSETD) or hangup.

Wait for the line discipline reference (just like read() or write())
to retrieve the "current" line discipline id.

Cc: <stable@vger.kernel.org>
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 28409131
Change-Id: I6774bd883a2e48bbe020486c72c42fb410e3f98a
2016-05-17 09:44:29 -07:00
Amit Pundir
9a326f6c08 Revert "drivers: power: use 'current' instead of 'get_current()'"
This reverts commit e1b5d10389.

This patch fixed the aosp commit ad86cc8ad6 (drivers: power:
Add watchdog timer to catch drivers which lockup during suspend.),
which we dropped in Change Id Ic72a87432e27844155467817600adc6cf0c2209c,
so we no longer need this fix. A part of this patch is already reverted
in above mentioned Change Id.

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-05-17 16:33:47 +05:30
Amit Pundir
287d9d0efb cpufreq: interactive: drop cpufreq_{get,put}_global_kobject func calls
Upstream commit 8eec1020f0 (cpufreq: create cpu/cpufreq at boot time)
make sure that cpufreq sysfs entry get created at boot time, and there
is no need to create/destroy it on need basis anymore.

So drop deprecated cpufreq_{get,put}_global_kobject function calls which
otherwise result in following compilation errors:

drivers/cpufreq/cpufreq_interactive.c: In function 'cpufreq_governor_interactive':
drivers/cpufreq/cpufreq_interactive.c:1187:4: error: implicit declaration of function 'cpufreq_get_global_kobject' [-Werror=implicit-function-declaration]
    WARN_ON(cpufreq_get_global_kobject());
    ^
drivers/cpufreq/cpufreq_interactive.c:1197:5: error: implicit declaration of function 'cpufreq_put_global_kobject'[-Werror=implicit-function-declaration]
     cpufreq_put_global_kobject();
     ^

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-05-16 13:47:55 +05:30
Amit Pundir
d495067ae1 Revert "cpufreq: interactive: build fixes for 4.4"
This reverts commit bc68f6c4ef.

This build fix broke the Interactive Gov at runtime with duplicate sysfs
entry warnings at boot time. We no longer need to this create/destroy
cpufreq sysfs entry at run time on need basis thanks to upstream commit
8eec1020f0 (cpufreq: create cpu/cpufreq at boot time) which creates it
at boot time. Hence drop this build fix.

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-05-16 13:43:28 +05:30
John Stultz
cc0063b8eb xt_qtaguid: Fix panic caused by processing non-full socket.
In an issue very similar to 4e461c777e (xt_qtaguid: Fix panic
caused by synack processing), we were seeing panics on occasion
in testing.

In this case, it was the same issue, but caused by a different
call path, as the sk being returned from qtaguid_find_sk() was
not a full socket. Resulting in the sk->sk_socket deref to fail.

This patch adds an extra check to ensure the sk being retuned
is a full socket, and if not it returns NULL.

Reported-by: Milosz Wasilewski <milosz.wasilewski@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-05-12 11:24:10 -07:00
Dmitry Shmidt
239dd54388 fiq_debugger: Add fiq_debugger.disable option
This change allows to use same kernel image with
different console options for uart and fiq_debugger.
If fiq_debugger.disable will be set to 1/y/Y,
fiq_debugger will not be initialized.

Change-Id: I71fda54f5f863d13b1437b1f909e52dd375d002d
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2016-05-12 10:45:33 -07:00
Janis Danisevskis
92c4fc6f09 UPSTREAM: procfs: fixes pthread cross-thread naming if !PR_DUMPABLE
The PR_DUMPABLE flag causes the pid related paths of the
proc file system to be owned by ROOT. The implementation
of pthread_set/getname_np however needs access to
/proc/<pid>/task/<tid>/comm.
If PR_DUMPABLE is false this implementation is locked out.

This patch installs a special permission function for
the file "comm" that grants read and write access to
all threads of the same group regardless of the ownership
of the inode. For all other threads the function falls back
to the generic inode permission check.

Signed-off-by: Janis Danisevskis <jdanis@google.com>
2016-05-11 08:46:18 +00:00
Jimmy Perchet
e9253e8760 FROMLIST: wlcore: Disable filtering in AP role
When you configure (set it up) a STA interface, the driver
install a multicast filter. This is normal behavior, when
one application subscribe to multicast address the filter
is updated. When Access Point interface is configured, there
is no filter installation and the "filter update" path is
disabled in the driver.

The problem happens when you switch an interface from STA
type to AP type. The filter is installed but there are no
means to update it.

Change-Id: Ied22323af831575303abd548574918baa9852dd0
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2016-05-10 15:13:21 -07:00
Lianwei Wang
e3b53389cf Revert "drivers: power: Add watchdog timer to catch drivers which lockup during suspend."
This reverts commit ad86cc8ad6.

Commit 70fea60d88 ("PM / Sleep: Detect device suspend/resume lockup...")
added a suspend/resume watchdog timer to catch the lockup. Let's revert the
duplicate one.

Change-Id: Ic72a87432e27844155467817600adc6cf0c2209c
Signed-off-by: Lianwei Wang <lianwei.wang@gmail.com>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-05-10 16:16:06 +05:30
Patrick Bellasi
fed95d9ef1 DEBUG: schedtune: add tracepoint for schedtune_tasks_update() values
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:54:43 +08:00
Patrick Bellasi
a09a25c5df DEBUG: schedtune: add tracepoint for CPU boost signal
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:54:42 +08:00
Patrick Bellasi
c8a65d2e9a DEBUG: schedtune: add tracepoint for SchedTune configuration update
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:54:42 +08:00
Dietmar Eggemann
0a0f4aa12f DEBUG: sched: add energy procfs interface
This patch makes the energy data available via procfs. The related files
are placed as sub-directory named 'energy' inside the
/proc/sys/kernel/sched_domain/cpuX/domainY/groupZ directory for those
cpu/domain/group tuples which have energy information.

The following example depicts the contents of
/proc/sys/kernel/sched_domain/cpu0/domain0/group[01] for a system which
has energy information attached to domain level 0.

├── cpu0
│   ├── domain0
│   │   ├── busy_factor
│   │   ├── busy_idx
│   │   ├── cache_nice_tries
│   │   ├── flags
│   │   ├── forkexec_idx
│   │   ├── group0
│   │   │   └── energy
│   │   │       ├── cap_states
│   │   │       ├── idle_states
│   │   │       ├── nr_cap_states
│   │   │       └── nr_idle_states
│   │   ├── group1
│   │   │   └── energy
│   │   │       ├── cap_states
│   │   │       ├── idle_states
│   │   │       ├── nr_cap_states
│   │   │       └── nr_idle_states
│   │   ├── idle_idx
│   │   ├── imbalance_pct
│   │   ├── max_interval
│   │   ├── max_newidle_lb_cost
│   │   ├── min_interval
│   │   ├── name
│   │   ├── newidle_idx
│   │   └── wake_idx
│   └── domain1
│       ├── busy_factor
│       ├── busy_idx
│       ├── cache_nice_tries
│       ├── flags
│       ├── forkexec_idx
│       ├── idle_idx
│       ├── imbalance_pct
│       ├── max_interval
│       ├── max_newidle_lb_cost
│       ├── min_interval
│       ├── name
│       ├── newidle_idx
│       └── wake_idx

The files 'nr_idle_states' and 'nr_cap_states' contain a scalar value
whereas 'idle_states' and 'cap_states' contain a vector of power
consumption at this idle state respectively (compute capacity, power
consumption) at this capacity state.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:54:42 +08:00
Juri Lelli
dd2460f387 DEBUG: sched,cpufreq: add cpu_capacity change tracepoint
This is useful when we want to compare cpu utilization and
cpu curr capacity side by side.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:54:42 +08:00
Juri Lelli
e5a25996d3 DEBUG: sched: add tracepoint for CPU load/util signals
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:53:25 +08:00
Juri Lelli
8017fd7418 DEBUG: sched: add tracepoint for task load/util signals
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:53:25 +08:00
Juri Lelli
99ed4e57cb DEBUG: sched: add tracepoint for cpu/freq scale invariance
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:53:25 +08:00
Patrick Bellasi
a2a6dc7508 sched/fair: filter energy_diff() based on energy_payoff value
Once the SchedTune support is enabled and the CPU bandwidth demand of a
task is boosted, we could expect increased energy consumptions which are
balanced by corresponding increases of tasks performance.
However, the current implementation of the energy_diff() function
accepts all and _only_ the schedule candidates which results into a
reduced expected system energy, which works against the boosting
strategy.

This patch links the energy_diff() function with the "energy payoff"
engine provided by SchedTune. The energy variation computed by the
energy_diff() function is now filtered using the SchedTune support to
evaluated the energy payoff for a boosted task.

With that patch, the energy_diff() function is going to reported as
"acceptable schedule candidate" only the schedule candidate which
corresponds to a positive energy_payoff.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:25 +08:00
Patrick Bellasi
637ee3715a sched/tune: add support to compute normalized energy
The current EAS implementation considers only energy variations, while it
disregards completely the impact on performance for the selection of
a certain schedule candidate. Moreover, it also makes its decision based
on the "absolute" value of expected energy variations.

In order to properly define a trade-off strategy between increased energy
consumption and performances benefits it is required to compare energy
variations with performance variations.

Thus, both performance and energy metrics must be expressed in comparable
units. While the performance variations are expressed in terms of capacity
deltas, which are defined in the range [0..SCHED_LOAD_SCALE], the same
scale is not used for energy variations.

This patch introduces the function:
  schedtune_normalize_energy(energy_diff)
which returns a normalized value in the same range of capacity variations,
i.e. [0..SCHED_LOAD_SCALE].

A proper set of energy normalization constants are required to provide
a fast division by a constant during the normalziation of the energy_diff.
The value of these constants depends on the specific energy model and
topology of a target device.
Thus, this patch provides also the required support for the computation
at boot time of this set of variables.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:25 +08:00
Patrick Bellasi
36967b2412 sched/fair: keep track of energy/capacity variations
The current EAS implementation does not allow "to boost" tasks
performances, for example by running them at an higher OPP (or a more
capable CPU), even if that could require a "reasonable" increase in
energy consumption.  To defined how much reasonable is an energy
increase with respect to a required boost value, it is required to
define and compute a trade-off between the expected energy and
performance variations.
However, the current EAS implementation considers only energy variations
while completely disregard the impact on performance for the selection
of a certain schedule candidate.

This patch extends the eenv energy environment to keep track of both
energy and performance deltas which are implied by the activation of a
schedule candidate.
The performance variation is estimated considering the different
capacities of the CPUs in which the task could be scheduled. The idea is
that while running on a CPU with higher capacity (e.g. higher operating
point) the task could (potentially) complete faster and thus get better
performance.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:25 +08:00
Patrick Bellasi
a8f65584fc sched/fair: add boosted task utilization
The task utilization signal, which is derived from PELT signals and
properly scaled to be architecture and frequency invariant, is used by
EAS as an estimation of the task requirements in terms of CPU bandwidth.

When the energy aware scheduler is in use, this signal affects the CPU
selection. Thus, a convenient way to bias that decision, which is also
little intrusive, is to boost the task utilization signal each time it
is required to support them.

This patch introduces the new function:
  boosted_task_util(task)
which returns a boosted value for the utilization of the specified task.
The margin added to the original utilization is:
  1. computed based on the "boosting strategy" in use
  2. proportional to boost value defined either by the sysctl interface,
     when global boosting is in use, or the "taskgroup" value, when
     per-task boosting is enabled.

The boosted signal is used by EAS
  a. transparently, via its integration into the task_fits() function
  b. explicitly, in the energy-aware wakeup path

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:25 +08:00
Patrick Bellasi
a515b887ec sched/{fair,tune}: track RUNNABLE tasks impact on per CPU boost value
When per-task boosting is enabled, every time a task enters/exits a CPU
its boost value could impact the currently selected OPP for that CPU.
Thus, the "aggregated" boost value for that CPU potentially needs to
be updated to match the current maximum boost value among all the tasks
currently RUNNABLE on that CPU.

This patch introduces the required support to keep track of which boost
groups are impacting a CPU. Each time a task is enqueued/dequeued to/from
a CPU its boost group is used to increment a per-cpu counter of RUNNABLE
tasks on that CPU.
Only when the number of runnable tasks for a specific boost group
becomes 1 or 0 the corresponding boost group changes its effects on
that CPU, specifically:
  a) boost_group::tasks == 1: this boost group starts to impact the CPU
  b) boost_group::tasks == 0: this boost group stops to impact the CPU
In each of these two conditions the aggregation function:
  sched_cpu_update(cpu)
could be required to run in order to identify the new maximum boost
value required for the CPU.

The proposed patch minimizes the number of times the aggregation
function is executed while still providing the required support to
always boost a CPU to the maximum boost value required by all its
currently RUNNABLE tasks.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
9cd53fbeb2 sched/tune: compute and keep track of per CPU boost value
When per task boosting is enabled, we could have multiple RUNNABLE tasks
which are concurrently scheduled on the same CPU but each one with a
different boost value.
For example, we could have a scenarios like this:

  Task   SchedTune CGroup   Boost Value
    T1               root            0
    T2       low-priority           10
    T3        interactive           90

In these conditions we expect a CPU to be configured according to a
proper "aggregation" of the required boost values for all the tasks
currently scheduled on this CPU.

A suitable aggregation function is the one which tracks the MAX boost
value for all the tasks RUNNABLE on a CPU. This approach allows to
always satisfy the most boost demanding task while at the same time:
 a) boosting all the concurrently scheduled tasks thus reducing
    potential co-scheduling side-effects on demanding tasks
 b) reduce the number of frequency switch requested towards SchedDVFS,
    thus being more friendly to architectures with slow frequency
    switching times

Every time a task enters/exits the RQ of a CPU the max boost value
should be updated considering all the boost groups currently "affecting"
that CPU, i.e. which have at least one RUNNABLE task currently allocated
on that CPU.

This patch introduces the required support to keep track of the boost
groups currently affecting CPUs. Thanks to the limited number of boost
groups, a small and memory efficient per-cpu array of boost groups
values (cpu_boost_groups) is used which is updated for each CPU entry by
schedtune_boostgroup_update() but only when a schedtune CGroup boost
value is updated. However, this is expected to be a rare operation,
perhaps done just one time at system boot time.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
13001f47c9 sched/tune: add initial support for CGroups based boosting
To support task performance boosting, the usage of a single knob has the
advantage to be a simple solution, both from the implementation and the
usability standpoint.  However, on a real system it can be difficult to
identify a single value for the knob which fits the needs of multiple
different tasks. For example, some kernel threads and/or user-space
background services should be better managed the "standard" way while we
still want to be able to boost the performance of specific workloads.

In order to improve the flexibility of the task boosting mechanism this
patch is the first of a small series which extends the previous
implementation to introduce a "per task group" support.
This first patch introduces just the basic CGroups support, a new
"schedtune" CGroups controller is added which allows to configure
different boost value for different groups of tasks.
To keep the implementation simple but still effective for a boosting
strategy, the new controller:
  1. allows only a two layer hierarchy
  2. supports only a limited number of boost groups

A two layer hierarchy allows to place each task either:
  a) in the root control group
     thus being subject to a system-wide boosting value
  b) in a child of the root group
     thus being subject to the specific boost value defined by that
     "boost group"

The limited number of "boost groups" supported is mainly motivated by
the observation that in a real system it could be useful to have only
few classes of tasks which deserve different treatment.
For example, background vs foreground or interactive vs low-priority.
As an additional benefit, a limited number of boost groups allows also
to have a simpler implementation especially for the code required to
compute the boost value for CPUs which have runnable tasks belonging to
different boost groups.

cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
07e22949de sched/fair: add boosted CPU usage
The CPU usage signal is used by the scheduler as an estimation of the
overall bandwidth currently allocated on a CPU. When SchedDVFS is in
use, this signal affects the selection of the operating points (OPP)
required to accommodate all the workload allocated in a CPU.
A convenient way to boost the performance of tasks running on a CPU,
which is also little intrusive, is to boost the CPU usage signal each
time it is used to select an OPP.

This patch introduces a new function:
  get_boosted_cpu_usage(cpu)
to return a boosted value for the usage of a specified CPU.
The margin added to the original usage is:
  1. computed based on the "boosting strategy" in use
  2. proportional to the system-wide boost value defined by provided
     user-space interface

The boosted signal is used by SchedDVFS (transparently) each time it
requires to get an estimation of the capacity required for a CPU.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
6ed2714501 sched/fair: add function to convert boost value into "margin"
The basic idea of the boost knob is to "artificially inflate" a signal
to make a task or logical CPU appears more demanding than it actually
is. Independently from the specific signal, a consistent and possibly
simple semantic for the concept of "signal boosting" must define:
1. how we translate the boost percentage into a "margin" value to be added
   to the original signal to inflate
2. what is the meaning of a boost value from a user-space perspective

This patch provides the implementation of a possible boost semantic,
named "Signal Proportional Compensation" (SPC), where the boost
percentage (BP) is used to compute a margin (M) which is proportional to
the complement of the original signal (OS):
  M = BP * (SCHED_LOAD_SCALE - OS)
The computed margin then added to the OS to obtain the Boosted Signal (BS)
  BS = OS + M

The proposed boost semantic has these main features:
- each signal gets a boost which is proportional to its delta with respect
  to the maximum available capacity in the system (i.e. SCHED_LOAD_SCALE)
- a 100% boosting has a clear understanding from a user-space perspective,
  since it means simply to run (possibly) "all" tasks at the max OPP
- each boosting value means to improve the task performance by a quantity
  which is proportional to the maximum achievable performance on that
  system
Thus this semantics is somehow forcing a behaviour which is:

  50% boosting means to run at half-way between the current and the
    maximum performance which a task could achieve on that system

This patch provides the code to implement a fast integer division to
convert a boost percentage (BP) value into a margin (M).

NOTE: this code is suitable for all signals operating in range
      [0..SCHED_LOAD_SCALE]

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
344f4ec429 sched/tune: add sysctl interface to define a boost value
The current (CFS) scheduler implementation does not allow "to boost"
tasks performance by running them at a higher OPP compared to the
minimum required to meet their workload demands.

To support tasks performance boosting the scheduler should provide a
"knob" which allows to tune how much the system is going to be optimised
for energy efficiency vs performance.

This patch is the first of a series which provides a simple interface to
define a tuning knob. One system-wide "boost" tunable is exposed via:
  /proc/sys/kernel/sched_cfs_boost
which can be configured in the range [0..100], to define a percentage
where:
  - 0%   boost requires to operate in "standard" mode by scheduling
         tasks at the minimum capacities required by the workload demand
  - 100% boost requires to push at maximum the task performances,
         "regardless" of the incurred energy consumption

A boost value in between these two boundaries is used to bias the
power/performance trade-off, the higher the boost value the more the
scheduler is biased toward performance boosting instead of energy
efficiency.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
8aaf022dd7 sched/tune: add detailed documentation
The topic of a single simple power-performance tunable, that is wholly
scheduler centric, and has well defined and predictable properties has
come up on several occasions in the past. With techniques such as a
scheduler driven DVFS, we now have a good framework for implementing
such a tunable.

This patch provides a detailed description of the motivations and design
decisions behind the implementation of the SchedTune.

cc: Jonathan Corbet <corbet@lwn.net>
cc: linux-doc@vger.kernel.org
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Juri Lelli
08d1cfd7c9 fixup! sched/fair: jump to max OPP when crossing UP threshold
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:53:23 +08:00
Juri Lelli
3eb29105cf fixup! sched: scheduler-driven cpu frequency selection
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-05-10 16:53:23 +08:00
Vincent Guittot
6e1e1ed872 sched: rt scheduler sets capacity requirement
RT tasks don't provide any running constraints like deadline ones
except their running priority. The only current usable input to
estimate the capacity needed by RT tasks is the rt_avg metric. We use
it to estimate the CPU capacity needed for the RT scheduler class.

In order to monitor the evolution for RT task load, we must
peridiocally check it during the tick.

Then, we use the estimated capacity of the last activity to estimate
the next one which can not be that accurate but is a good starting
point without any impact on the wake up path of RT tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Vincent Guittot
fab5cc59bf sched: deadline: use deadline bandwidth in scale_rt_capacity
Instead of monitoring the exec time of deadline tasks to evaluate the
CPU capacity consumed by deadline scheduler class, we can directly
calculate it thanks to the sum of utilization of deadline tasks on the
CPU.  We can remove deadline tasks from rt_avg metric and directly use
the average bandwidth of deadline scheduler in scale_rt_capacity.

Based in part on a similar patch from Luca Abeni <luca.abeni@unitn.it>.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Vincent Guittot
cd248fa758 sched: remove call of sched_avg_update from sched_rt_avg_update
rt_avg is only used to scale the available CPU's capacity for CFS
tasks.  As the update of this scaling is done during periodic load
balance, we only have to ensure that sched_avg_update has been called
before any periodic load balancing. This requirement is already
fulfilled by __update_cpu_load so the call in sched_rt_avg_update,
which is part of the hotpath, is useless.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Steve Muckle
9d44dc7fe2 sched/cpufreq_sched: add trace events
Trace events will aid in debugging, profiling and tuning.

Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Steve Muckle
6b6c192453 sched/fair: jump to max OPP when crossing UP threshold
Since the true utilization of a long running task is not detectable
while it is running and might be bigger than the current cpu capacity,
create the maximum cpu capacity head room by requesting the maximum
cpu capacity once the cpu usage plus the capacity margin exceeds the
current capacity. This is also done to try to harm the performance of
a task the least.

Original fair-class only version authored by Juri Lelli
<juri.lelli@arm.com>.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Juri Lelli
f99e3fe6eb sched/fair: cpufreq_sched triggers for load balancing
As we don't trigger freq changes from {en,de}queue_task_fair() during load
balancing, we need to do explicitly so on load balancing paths.

[smuckle@linaro.org: move update_capacity_of calls so rq lock is held]

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:23 +08:00
Juri Lelli
7ff814dd71 sched/{core,fair}: trigger OPP change request on fork()
Patch "sched/fair: add triggers for OPP change requests" introduced OPP
change triggers for enqueue_task_fair(), but the trigger was operating only
for wakeups. Fact is that it makes sense to consider wakeup_new also (i.e.,
fork()), as we don't know anything about a newly created task and thus we
most certainly want to jump to max OPP to not harm performance too much.

However, it is not currently possible (or at least it wasn't evident to me
how to do so :/) to tell new wakeups from other (non wakeup) operations.

This patch introduces an additional flag in sched.h that is only set at
fork() time and it is then consumed in enqueue_task_fair() for our purpose.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:22 +08:00
Juri Lelli
ea429ccef1 sched/fair: add triggers for OPP change requests
Each time a task is {en,de}queued we might need to adapt the current
frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
this purpose.  Only trigger a freq request if we are effectively waking up
or going to sleep.  Filter out load balancing related calls to reduce the
number of triggers.

[smuckle@linaro.org: resolve merge conflicts, define task_new,
 use renamed static key sched_freq]

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
2016-05-10 16:53:22 +08:00