Commit graph

1449 commits

Author SHA1 Message Date
Joonwoo Park
07eb3f803b sched: select task's prev_cpu as the best CPU when it was chosen recently
Select given task's prev_cpu when the task slept for short period to
reduce latency of task placement and migrations.  A new tunable
/proc/sys/kernel/sched_select_prev_cpu_us introduced to determine whether
tasks are eligible to go through fast path.

CRs-fixed: 947467
Change-Id: Ia507665b91f4e9f0e6ee1448d8df8994ead9739a
[joonwoop@codeaurora.org: fixed conflict in include/linux/sched.h,
 include/linux/sched/sysctl.h, kernel/sched/core.c and kernel/sysctl.c]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:43 -07:00
Joonwoo Park
0498f793e8 sched: use ktime instead of sched_clock for load tracking
At present, HMP scheduler uses sched_clock to setup window boundary to
be aligned with timer interrupt to ensure timer interrupt fires after
window rollover.  However this alignment won't last long since the timer
interrupt rearms next timer based on time measured by ktime which isn't
coupled with sched_clock.

Convert sched_clock to ktime to avoid wallclock discrepancy between
scheduler and timer so that we can ensure scheduler's window boundary is
always aligned with timer.

CRs-fixed: 933330
Change-Id: I4108819a4382f725b3ce6075eb46aab0cf670b7e
[joonwoop@codeaurora.org: fixed minor conflict in include/linux/tick.h
 and kernel/sched/core.c.  omitted fixes for kernel/sched/qhmp_core.c]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:41 -07:00
Syed Rameez Mustafa
c00814c023 sched: Notify cpufreq governor early about potential big tasks
Tasks that are on the runqueue continuously for a certain amount of time
have the potential to be big tasks at the end of the window in which they
are runnable. In such scenarios ramping the CPU frequency early can
boost performance rather than waiting till the end of a window for the
governor to query load. Notify the governor early at every tick when a
task has been observed to execute beyond some percentage of the tick
period.

The threshold beyond which a task is eligible for early detection can be
changed via the tunable sched_early_detection_duration. The feature itself
is enabled only when scheduler boost is in effect.

Change-Id: I528b72bbc79a55b4593d1b8ab45450411c6d70f3
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in scheduler_tick() in
 kernel/sched/core.c.  fixed minor conflicts in include/linux/sched.h,
 include/linux/sched/sysctl.h and kernel/sysctl.c due to
 CONFIG_SCHED_QHMP.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:34 -07:00
Joonwoo Park
446beddcd4 sched: account new task load so that governor can apply different policy
Account amount of load contributed by new tasks within CPU load so that
governor can apply different policy when CPU is loaded by new tasks.

To be able to distinguish new task load a new tunable
sched_new_task_windows also introduced.  The tunable defines tasks as new
when the tasks are have been active less than configured windows.

Change-Id: I2e2e62e4103882f7362154b792ab978b181b9f59
Suggested-by: Saravana Kannan <skannan@codeaurora.org>
[joonwoop@codeaurora.org: ommited changes for
 drivers/cpufreq/cpufreq_interactive.c.  cpufreq changes needs to be
 applied separately later.  fixed conflict in include/linux/sched.h and
 include/linux/sched/sysctl.h.  omitted changes for qhmp_core.c]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:29 -07:00
Olav Haugan
03a683a55c sched: Add tunables for static cpu and cluster cost
Add per-cpu tunable to set the extra cost to use a CPU that is idle.
Add the same for a cluster.

Change-Id: I4aa53f3c42c963df7abc7480980f747f0413d389
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
[joonwoop@codeaurora.org: omitted changes for qhmp*.[c,h]  stripped out
 CONFIG_SCHED_QHMP in drivers/base/cpu.c and include/linux/sched.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:27 -07:00
Olav Haugan
4996dafe68 sched/core: Add API to set cluster d-state
Add new API to the scheduler to allow low power mode driver to inform
the scheduler about the d-state of a cluster. This can be leveraged by
the scheduler to make an informed decision about the cost of placing a task
on a cluster.

Change-Id: If0fe0fdba7acad1c2eb73654ebccfdb421225e62
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
[joonwoop@codeaurora.org: omitted fixes for qhmp_core.c and qhmp_core.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:26 -07:00
Joonwoo Park
b4627e0104 sched: take into account of governor's frequency max load
At present HMP scheduler packs tasks to busy CPU till the CPU's load is
100% to avoid waking up of idle CPU as much as possible.  Such aggressive
packing leads unintended CPU frequency raise as governor raises the busy
CPU's frequency when its load is more than configured frequency max load
which can be less than 100%.

Fix to take into account of governor's frequency max load and pack tasks
only when the CPU's projected load is less than max load to avoid
unnecessary frequency raise.

Change-Id: I4447e5e0c2fa5214ae7a9128f04fd7585ed0dcac
[joonwoop@codeaurora.org: fixed minor conflict in kernel/sched/sched.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:25 -07:00
Syed Rameez Mustafa
87fe20de7e sched: Update the wakeup placement logic for fair and rt tasks
For the fair sched class, update the select_best_cpu() policy to do
power based placement. The hope is to minimize the voltage at which
the CPU runs.

While RT tasks already do power based placement, their placement
preference has to now take into account the power cost of all tasks
on a given CPU. Also remove the check for sched_boost since
sched_boost no longer intends to elevate all tasks to the highest
capacity cluster.

Change-Id: Ic6a7625c97d567254d93b94cec3174a91727cb87
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2016-03-23 20:02:20 -07:00
Syed Rameez Mustafa
f2ea07a155 sched: Rework energy aware scheduling
Energy aware core rotation is not compatible with the power
based task placement being introduced in subsequent patches.
Remove all existing EA based task placement/migration logic.
power_cost() is the only function remaining. This function has
been modified to return the total power cost associated with a
task on a given CPU taking existing load on that CPU into
account.

Change-Id: Ia00501e3cbfc6e11446a9a2e93e318c4c42bdab4
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed multiple conflicts in fair.c and minor
 conflict in features.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:18 -07:00
Joonwoo Park
a4c475e43d sched: prevent task migration while governor queries CPUs' load
At present, governor retrieves each CPUs' load sequentially.  In this
way, there is chance of race between governor's CPU load query and task
migration that would result in reporting of lesser CPUs' load than actual.

For example,
CPU0 load = 30%.  CPU1 load = 50%.
Governor                               Load balancer
- sched_get_busy(cpu 0) = 30%.
                                       - A task 'p' migrated from CPU 1 to
                                         CPU 0.  p->ravg->prev_window = 50.
                                         Now CPU 0's load = 80%,
                                         CPU 1's load = 0%.
- sched_get_busy(cpu 1) = 0%
  50% of load from CPU 1 to 0 never
  accounted.

Fix such issues by introducing a new API sched_get_cpus_busy() which
makes for governor to be able to get set of CPUs' load.  The loads set
internally constructed with blocking load balancer to ensure migration
cannot occur in the meantime.

Change-Id: I4fa4dd1195eff26aa603829aca2054871521495e
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:07 -07:00
Srivatsa Vaddagiri
c41a54cb8d sched: Keep track of average nr_big_tasks
Extend sched_get_nr_running_avg() API to return average nr_big_tasks,
in addition to average nr_running and average nr_io_wait tasks. Also
add a new trace point to record values returned by
sched_get_nr_running_avg() API.

Change-Id: Id3591e6d04da8db484b4d1cb9d95dba075f5ab9a
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org: Resolve trivial merge conflicts]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2016-03-23 20:01:42 -07:00
Srivatsa Vaddagiri
44d892787e sched: Fix bug in average nr_running and nr_iowait calculation
sched_get_nr_running_avg() returns average nr_running and nr_iowait
task count since it was last invoked. Fix several bugs in their
calculation.

* sched_update_nr_prod() needs to consider that nr_running count can
  change by more than 1 when CFS_BANDWIDTH feature is used

* sched_get_nr_running_avg() needs to sum up nr_iowait count across
  all cpus, rather than just one

* sched_get_nr_running_avg() could race with sched_update_nr_prod(),
  as a result of which it could use curr_time which is behind a cpu's
  'last_time' value. That would lead to erroneous calculation of
  average nr_running or nr_iowait.

While at it, fix also a bug in BUG_ON() check in
sched_update_nr_prod() function and remove unnecessary nr_running
argument to sched_update_nr_prod() function.

Change-Id: I46737614737292fae0d7204c4648fb9b862f65b2
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2016-03-23 20:01:41 -07:00
Srivatsa Vaddagiri
207d78dd26 sched: Add userspace interface to set PF_WAKE_UP_IDLE
sched_prefer_idle flag controls whether tasks can be woken to any
available idle cpu. It may be desirable to set sched_prefer_idle to 0
so that most tasks wake up to non-idle cpus under mostly_idle
threshold and have specialized tasks override this behavior through
other means. PF_WAKE_UP_IDLE flag per task provides exactly that. It
lets tasks with PF_WAKE_UP_IDLE flag set be woken up to any available
idle cpu independent of sched_prefer_idle flag setting. Currently
only kernel-space API exists to set PF_WAKE_UP_IDLE flag for a task.
This patch adds a user-space API (in /proc filesystem) to set
PF_WAKE_UP_IDLE flag for a given task. /proc/[pid]/sched_wake_up_idle
file can be written to set or clear PF_WAKE_UP_IDLE flag for a given
task.

Change-Id: I13a37e740195e503f457ebe291d54e83b230fbeb
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in kernel/sched/fair.c]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:01:33 -07:00
Jeff Ohlstein
b0ccf5db31 sched_avg: add run queue averaging
Add code to calculate the run queue depth of a cpu and iowait
depth of the cpu.

The scheduler calls in to sched_update_nr_prod whenever there
is a runqueue change. This function maintains the runqueue average
and the iowait of that cpu in that time interval.

Whoever wants to know the runqueue average is expected to call
sched_get_nr_running_avg periodically to get the accumulated
runqueue and iowait averages for all the cpus.

Change-Id: Id8cb2ecf0ed479f090a83ccb72dd59c53fa73e0c
Signed-off-by: Jeff Ohlstein <johlstei@codeaurora.org>
(cherry picked from commit 0299fcaaad80e2c0ac9aa583c95107f6edc27750)
[rameezmustafa@codeaurora.org: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2016-03-23 20:01:32 -07:00
Srivatsa Vaddagiri
5b45dc56e5 sched: Per-cpu prefer_idle flag
Remove the global sysctl_sched_prefer_idle flag and replace it with a
per-cpu prefer_idle flag. The per-cpu flag is expected to same for all
cpus in a cluster. It thus provides convenient means to disable
packing in one cluster while allowing packing in another cluster.

Change-Id: Ie4cc73bb1a55b4eac5697be38e558546161faca1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:01:26 -07:00
Srivatsa Vaddagiri
29a412dffa sched: Avoid frequent migration of running task
Power values for cpus can drop quite considerably when it goes idle.
As a result, the best choice for running a single task in a cluster
can vary quite rapidly. As the task keeps hopping cpus, other cpus go
idle and start being seen as more favorable target for running a task,
leading to task migrating almost every scheduler tick!

Prevent this by keeping track of when a task started running on a cpu
and allowing task migration in tick path (migration_needed()) on
account of energy efficiency reasons only if the task has run
sufficiently long (as determined by sysctl_sched_min_runtime
variable).

Note that currently sysctl_sched_min_runtime setting is considered
only in scheduler_tick()->migration_needed() path and not in
idle_balance() path. In other words, a task could be migrated to
another cpu which did a idle_balance(). This limitation should not
affect high-frequency migrations seen typically (when a single
high-demand task runs on high-performance cpu).

CRs-Fixed: 756570
Change-Id: I96413b7a81b623193c3bbcec6f3fa9dfec367d99
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in set_task_cpu() and
 __schedule().]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:01:17 -07:00
Steve Muckle
6370716bc3 sched: do not set window until sched_clock is fully initialized
The system initially uses a jiffy-based sched clock. When the platform
registers a new timer for sched_clock, sched_clock can jump backwards.
Once sched_clock_postinit() runs it should be safe to rely on it.

Also sched_clock_cpu() relies on completion of sched_clock_init()
and until that happens sched_clock_cpu() returns zero. This is used
in the irq accounting path which window-based stats relies upon.
So do not set window_start until sched_clock_cpu() is working.

Change-Id: Ided349de8f8554f80a027ace0f63ea52b1c38c68
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
2016-03-23 20:01:06 -07:00
Srivatsa Vaddagiri
8e3aa6790c sched: Packing support until a frequency threshold
Add another dimension for task packing based on frequency. This patch
adds a per-cpu tunable, rq->mostly_idle_freq, which when set will
result in tasks being packed on a single cpu in cluster as long as
cluster frequency is less than set threshold.

Change-Id: I318e9af6c8788ddf5dfcda407d621449ea5343c0
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:01:03 -07:00
Srivatsa Vaddagiri
b2e57842c0 sched: per-cpu mostly_idle threshold
sched_mostly_idle_load and sched_mostly_idle_nr_run knobs help pack
tasks on cpus to some extent. In some cases, it may be desirable to
have different packing limits for different cpus. For example, pack to
a higher limit on high-performance cpus compared to power-efficient
cpus.

This patch removes the global mostly_idle tunables and makes them
per-cpu, thus letting task packing behavior to be controlled in a
fine-grained manner.

Change-Id: Ifc254cda34b928eae9d6c342ce4c0f64e531e6c2
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:59 -07:00
Srivatsa Vaddagiri
33af11b6f4 sched: Add API to set task's initial task load
Add a per-task attribute, init_load_pct, that is used to initialize
newly created children's initial task load. This helps important
applications launch their child tasks on cpus with highest capacity.

Change-Id: Ie9665fd2aeb15203f95fd7f211c50bebbaa18727
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict int init_new_task_load.
 se.avg.runnable_avg_sum has deprecated.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:00:58 -07:00
Srivatsa Vaddagiri
3a67b4ce87 sched: window-stats: Enhance cpu busy time accounting
rq->curr/prev_runnable_sum counters represent cpu demand from various
tasks that have run on a cpu. Any task that runs on a cpu will have a
representation in rq->curr_runnable_sum. Their partial_demand value
will be included in rq->curr_runnable_sum. Since partial_demand is
derived from historical load samples for a task, rq->curr_runnable_sum
could represent "inflated/un-realistic" cpu usage. As an example, lets
say that task with partial_demand of 10ms runs for only 1ms on a cpu.
What is included in rq->curr_runnable_sum is 10ms (and not the actual
execution time of 1ms). This leads to cpu busy time being reported on
the upside causing frequency to stay higher than necessary.

This patch fixes cpu busy accounting scheme to strictly represent
actual usage. It also provides for conditional fixup of busy time upon
migration and upon heavy-task wakeup.

CRs-Fixed: 691443
Change-Id: Ic4092627668053934049af4dfef65d9b6b901e6b
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in init_task_load(),
 se.avg.decay_count has deprecated.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:00:50 -07:00
Srivatsa Vaddagiri
3dcd52ded0 sched: Fix compile error
sched_get_busy(), sched_set_io_is_busy() and sched_set_window() need
to be defined only when CONFIG_SCHED_FREQ_INPUT is defined, otherwise
we get compilation error related to dual definition of those routines

Change-Id: Ifd5c9b6675b78d04c2f7ef0e24efeae70f7ce19b
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in include/linux/sched.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:00:37 -07:00
Junjie Wu
a37679f0c3 sched: Define dummy scheduler freq input functions
Define dummy scheduler freq input functions when
CONFIG_SCHED_FREQ_INPUT is not selected.

Change-Id: Id041cbf157cf9aba86601bf95e1068be206775f0
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in include/linux/sched.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:00:36 -07:00
Srivatsa Vaddagiri
2a7d718b3d sched: window-stats: Fix exit race
Exiting tasks are removed from tasklist and hence at some point will
become invisible to do_each_thread/for_each_thread task iterators.
This breaks the functionality of reset_all_windows_stats() which *has*
to reset stats for *all* tasks.

This patch causes exiting tasks stats to be reset *before* they are
removed from tasklist. DONT_ACCOUNT bit in exiting task's ravg.flags
is also marked so that their remaining execution time is not accounted
in cpu busy time counters (rq->curr/prev_runnable_sum).
reset_all_windows_stats() is thus guaranteed to return with all task's
stats reset to 0.

Change-Id: I5f101156a4f958c1b3f31eb0db8cd06e621b75e9
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:27 -07:00
Srivatsa Vaddagiri
9425ce4309 sched: window-stats: Remove unused prev_window variable
Remove unused prev_window variable in 'struct ravg'

Change-Id: I22ec040bae6fa5810f9f8771aa1cb873a2183746
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:22 -07:00
Olav Haugan
8eede4a8d5 sched: Make RAVG_HIST_SIZE tunable
Make RAVG_HIST_SIZE available from /proc/sys/kernel/sched_ravg_hist_size
to allow tuning of the size of the history that is used in computation
of task demand.

CRs-fixed: 706138
Change-Id: Id54c1e4b6e974a62d787070a0af1b4e8ce3b4be6
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in sysctl.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:00:19 -07:00
Srivatsa Vaddagiri
c097c9b574 sched: window-stats: Account interrupt handling time as busy time
Account cycles spent by idle cpu handling interrupts (irq or softirq)
towards its busy time.

Change-Id: I84cc084ced67502e1cfa7037594f29ed2305b2b1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in core.c]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:00:14 -07:00
Srivatsa Vaddagiri
c20a41478d sched: window-stats: Account idle time as busy time
Provide a knob to consider idle time as busy time, when cpu becomes
idle as a result of io_schedule() call. This will let governor
parameter 'io_is_busy' to be appropriately honored.

Change-Id: Id9fb4fe448e8e4909696aa8a3be5a165ad7529d3
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:13 -07:00
Srivatsa Vaddagiri
900b44b621 sched: window-stats: Account wait time
Extend window-based task load accounting mechanism to include
wait-time as part of task demand. A subsequent patch will make this
feature configurable at runtime.

Change-Id: I8e79337c30a19921d5c5527a79ac0133b385f8a9
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:12 -07:00
Srivatsa Vaddagiri
ad25ca2afb sched: support legacy mode better
It should be possible to bypass all HMP scheduler changes at runtime
by setting sysctl_sched_enable_hmp_task_placement and
sysctl_sched_enable_power_aware to 0.  Fix various code paths to honor
this requirement.

Change-Id: I74254e68582b3f9f1b84661baf7dae14f981c025
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in rt.c, p->nr_cpus_allowed ==
 1 is now moved in core.c]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:59:54 -07:00
Syed Rameez Mustafa
754f666131 sched/fair: Introduce scheduler boost for low latency workloads
Certain low latency bursty workloads require immediate use of highest
capacity CPUs in HMP systems. Existing load tracking mechanisms may be
unable to respond to the sudden surge in the system load within the
latency requirements. Introduce the scheduler boost feature for such
workloads. While boost is in effect the scheduler bypasses regular load
based task placement and prefers highest capacity CPUs in the system
for all non-small fair sched class tasks. Provide both a kernel and
userspace API for software that may have apriori knowledge about the
system workload.

Change-Id: I783f585d1f8c97219e629d9c54f712318821922f
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in
 include/linux/sched/sysctl.h.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:59:45 -07:00
Srivatsa Vaddagiri
3f2beb24f2 sched: make sched_set_window() return failure when PELT is in use
Window-based load tracking is a pre-requisite for the scheduler to
feed cpu load information to the governor. When PELT is in use, return
failure when governor attempts to set window-size. This will let
governor fall back to other APIs for retrieving cpu load statistics.

Change-Id: I0e11188594c1a54b3b7ff55447d30bfed1a01115
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed trivial merge conflict
 in include/linux/sched.h.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:59:40 -07:00
Srivatsa Vaddagiri
7f78facb9a sched: Add new trace events
Add trace events for update_task_ravg(), update_history(), and
set_task_cpu(). These tracepoints are useful for monitoring the
per-task and per-runqueue demand statistics.

Change-Id: Ibec9f945074ff31d1fc1a76ae37c40c8fea8cda9
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 19:59:36 -07:00
Steve Muckle
53a4978f80 sched: do not balance on exec if SCHED_HMP
Rebalancing at exec time will currently undo any beneficial placement
that has been done during fork time, since select_best_cpu() will not
discount the currently running task.

For now just skip re-evaluating task placement at exec.

Change-Id: I1e5e0fcc329b7b53c338c8c73795ebd5e85a118b
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
2016-03-23 19:59:35 -07:00
Srivatsa Vaddagiri
2735664021 sched: Use historical load for freq governor input
Historical load maintained per task can be used to influence cpu
frequency better. For example, when a heavy demand task wakes up after
prolonged sleep, we could use the historical load information to alert
cpufreq governor about the need to raise cpu frequency. This patch
changes CPU busy statistics to be aggregation of historical task
demand. Also task's historical load (as defined by
sysctl_sched_window_stats_policy) is add to cpu's busy statistics
(rq->curr_runnable_sum) whenever it executes on a cpu.

Change-Id: I2b66136f138b147ba19083b9b044c4feb20d9b57
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org
2016-03-23 19:59:34 -07:00
Steve Muckle
f469bce8e2 sched: add migration load change notifier for frequency guidance
When a task moves between CPUs in two different frequency domains
the cpufreq governor may wish to immediately modify the frequency
of both the source and destination CPUs of the migrating task.

A tunable is provided to establish what size task is considered
"significant" enough to warrant notifying cpufreq.

Also fix a bug that would cause load to not be accounted properly
during wakeup migrations.

Change-Id: Ie8f6b1cc4d43a602840dac18590b42a81327c95a
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
[rameezmustafa@codeaurora.org: Add double rq locking for set_task_cpu()]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2016-03-23 19:59:29 -07:00
Srivatsa Vaddagiri
188a6bc174 sched: add sched_get_busy, sched_set_window APIs
sched_get_busy() returns the busy time of a cpu during the most
recent completed window.
sched_set_window() will set window size and aligns windows across
all CPUs.

Change-Id: Ic53e27f43fd4600109b7b6db979e1c52c7aca103
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in include/linux/sched.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:59:19 -07:00
Srivatsa Vaddagiri
9427c55650 sched: window-stats: add prev_window counter per-task
Currently windows where tasks had no execution time are ignored.
However accurate accounting of cpu busy time that factors in migration
would need to know actual utilization of a task in the window previous
to the latest one. This would help scheduler guide cpufreq governor on
busy time per-cpu that is not subject to migration induced errors.

Change-Id: I5841b1732c83e83d69002139de3bdb93333ce347
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 19:59:17 -07:00
Srivatsa Vaddagiri
33e7100103 sched: window-stats: synchronize windows across cpus
Synchronizing windows across cpus for task load measurements
simplifies cpu busy time accounting during migrations. For task
migrations, its usage in current window can be carried over to its new
cpu. This lets cpufreq governor see a correct picture of cpu busy time
that is not affected by migrations.

This patch lines up windows across cpus. One of the cpu, sync_cpu,
serves as a reference for all others. During bootup sync_cpu would
initialize its window_start (from its sched_clock()). Other cpus will
synchronize their window_start in reference to sync_cpu. This patch
assumes synchronous sched_clock() across cpus and may need some change
to address architectures which do not provide such synchronized
sched_clock().

Change-Id: I13381389a72f5f9f85cc2446401d493a55c78ab7
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 19:59:16 -07:00
Srivatsa Vaddagiri
551f83f5d6 sched: Add CONFIG_SCHED_HMP Kconfig option
Add a compile-time flag to enable or disable scheduler features for
HMP (heterogenous multi-processor) systems. Main feature deals with
optimizing task placement for best power/performance tradeoff.

Also extend features currently dependent on CONFIG_SCHED_FREQ_INPUT to
be enabled for CONFIG_HMP as well.

Change-Id: I03b3942709a80cc19f7b934a8089e1d84c14d72d
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor ifdefry conflict.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:59:00 -07:00
Srivatsa Vaddagiri
025dedac36 sched: Add scaled task load statistics
Scheduler guided frequency selection as well as task placement on
heterogeneous systems require scaled task load statistics. This patch
adds a 'runnable_avg_sum_scaled' metric per task that is a scaled
derivative of 'runnable_avg_sum'. Load is scaled in reference to
"best" cpu, i.e one with best possible max_freq

Change-Id: Ie8ae450d0b02753e9927fb769aee734c6d33190f
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: incoporated with change 9d89c257df
 (" sched/fair: Rewrite runnable load and utilization average
 tracking").  Used container_of() to get sched_entity.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:58:59 -07:00
Srivatsa Vaddagiri
77fe8dd14d sched: Introduce CONFIG_SCHED_FREQ_INPUT
Introduce a compile time flag to enable scheduler guidance of
frequency selection. This flag is also used to turn on or off
window-based load stats feature.

Having a compile time flag will let some platforms avoid any
overhead that may be present with this scheduler feature.

Change-Id: Id8dec9839f90dcac82f58ef7e2bd0ccd0b6bd16c
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict around
 sysctl_timer_migration.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:58:59 -07:00
Srivatsa Vaddagiri
3967da2dd1 sched: Window-based load stat improvements
Some tasks can have a sporadic load pattern such that they can suddenly
start running for longer intervals of time after running for shorter
durations. To recognize such sharp increase in tasks' demands, max
between the average of 5 window load samples and the most recent sample
is chosen as the task demand.

Make the window size (sched_ravg_window) configurable at boot up
time. To prevent users from setting inappropriate values for window
size, min and max limits are defined. As 'ravg' struct tracks load for
both real-time and non real-time tasks it is moved out of sched_entity
struct.

In order to prevent changing function signatures for move_tasks() and
move_one_task() per-cpu variables are defined to track the total load
moved. In case multiple tasks are selected to migrate in one load
balance operation, loads > 100 could be sent through migration notifiers.
Prevent this scenario by setting mnd.load to 100 in such cases.

Define wrapper functions to compute cpu demands for tasks and to change
rq->cumulative_runnable_avg.

Change-Id: I9abfbf3b5fe23ae615a6acd3db9580cfdeb515b4
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>
[rameezmustafa@codeaurora.org: Port to msm-3.18 and squash "dcf7256 sched:
			window-stats: Fix overflow bug" into this patch.]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in __migrate_task().]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:58:53 -07:00
Rohit Gupta
48056d2399 cpufreq: cpu-boost: Introduce scheduler assisted load based syncs
Previously, on getting a migration notification cpu-boost changed
the scaling min of the destination frequency to match that of the
source frequency or sync_threshold whichever was minimum.

If the scheduler migration notification is extended with task load
(cpu demand) information, the cpu boost driver can use this load to
compute a suitable frequency for the migrating task. The required
frequency for the task is calculated by taking the load percentage
of the max frequency and no sync is performed if the load is less
than a particular value (migration_load_threshold).This change is
beneficial for both perf and power as demand of a task is taken into
consideration while making cpufreq decisions and unnecessary syncs
for lightweight tasks are avoided.

The task load information provided by scheduler comes from a
window-based load collection mechanism which also normalizes the
load collected by the scheduler to the max possible frequency
across all CPUs.

Change-Id: Id2ba91cc4139c90602557f9b3801fb06b3c38992
Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in __migrate_task().]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:58:46 -07:00
Srivatsa Vaddagiri
74463329e4 sched: window-based load stats for tasks
Provide a metric per task that specifies how cpu bound a task is. Task
execution is monitored over several time windows and the fraction of
the window for which task was found to be executing or wanting to run
is recorded as task's demand. Windows over which task was sleeping are
ignored. We track last 5 recent windows for every task and the maximum
demand seen in any of the previous 5 windows (where task had some
activity) drives freq demand for every task.

A per-cpu metric (rq->cumulative_runnable_avg) is also provided which
is an aggregation of cpu demand of all tasks currently enqueued on it.
rq->cumulative_runnable_avg will be useful to know if cpu frequency
will need to be changed to match task demand.

Change-Id: Ib83207b9ba8683cd3304ee8a2290695c34f08fe2
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in ttwu_do_wakeup() to
 incorporate with changed trace_sched_wakeup() location.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:58:39 -07:00
Steve Muckle
63249df6b2 sched: provide per cpu-cgroup option to notify on migrations
On systems where CPUs may run asynchronously, task migrations
between CPUs running at grossly different speeds can cause
problems.

This change provides a mechanism to notify a subsystem
in the kernel if a task in a particular cgroup migrates to a
different CPU. Other subsystems (such as cpufreq) may then
register for this notifier to take appropriate action when
such a task is migrated.

The cgroup attribute to set for this behavior is
"notify_on_migrate" .

Change-Id: Ie1868249e53ef901b89c837fdc33b0ad0c0a4590
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
[rameezmustafa@codeaurora.org: Use new cgroup APIs, fix 64-bit
			compilation issues and resolve some merge
			conflicts. Also squash "2bd8075 sched:
			remove migration notification from RT class"
			into this patch.]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: Incorporated with new __migrate_task().]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:58:33 -07:00
Steve Muckle
74b3b06c52 sched: add PF_WAKE_UP_IDLE
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior
of waking their tasks up on idle CPUs. The feature has too much of a
negative impact on other workloads however to apply globally. The
PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this
flag set, or tasks woken by tasks with this flag set, on an idle CPU
if one is available.

Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict around set_wake_up_idle() in
 include/linux/sched.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:58:27 -07:00
Srivatsa Vaddagiri
8da8122d5f sched: Make the scheduler aware of C-state for cpus
C-state represents a power-state of a cpu. A cpu could have one or
more C-states associated with it. C-state transitions are based on
various factors (expected sleep time for example). "Deeper" C-states
implies longer wakeup latencies.

Scheduler needs to know wakeup latency associated with various C-states.
Having this information allows the scheduler to make better decisions
during task placement. For example:

- Prefer an idle cpu that is in the least shallow C-state
- Avoid waking up small tasks on a idle cpu unless it is in the least
  shallow C-state

This patch introduces APIs in the scheduler that can be used by the
architecture specific power-management driver to inform the scheduler
about C-states for cpus.

Change-Id: I39c5ae6dbace4f8bd96e88f75cd2d72620436dd1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
2016-03-23 19:58:27 -07:00
Andrey Ryabinin
788349f659 UBSAN: run-time undefined behavior sanity checker
UBSAN uses compile-time instrumentation to catch undefined behavior
(UB).  Compiler inserts code that perform certain kinds of checks before
operations that could cause UB.  If check fails (i.e.  UB detected)
__ubsan_handle_* function called to print error message.

So the most of the work is done by compiler.  This patch just implements
ubsan handlers printing errors.

GCC has this capability since 4.9.x [1] (see -fsanitize=undefined
option and its suboptions).
However GCC 5.x has more checkers implemented [2].
Article [3] has a bit more details about UBSAN in the GCC.

[1] - https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html
[2] - https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html
[3] - http://developerblog.redhat.com/2014/10/16/gcc-undefined-behavior-sanitizer-ubsan/

Issues which UBSAN has found thus far are:

Found bugs:

 * out-of-bounds access - 97840cb67f ("netfilter: nfnetlink: fix
   insufficient validation in nfnetlink_bind")

undefined shifts:

 * d48458d4a7 ("jbd2: use a better hash function for the revoke
   table")

 * 10632008b9 ("clockevents: Prevent shift out of bounds")

 * 'x << -1' shift in ext4 -
   http://lkml.kernel.org/r/<5444EF21.8020501@samsung.com>

 * undefined rol32(0) -
   http://lkml.kernel.org/r/<1449198241-20654-1-git-send-email-sasha.levin@oracle.com>

 * undefined dirty_ratelimit calculation -
   http://lkml.kernel.org/r/<566594E2.3050306@odin.com>

 * undefined roundown_pow_of_two(0) -
   http://lkml.kernel.org/r/<1449156616-11474-1-git-send-email-sasha.levin@oracle.com>

 * [WONTFIX] undefined shift in __bpf_prog_run -
   http://lkml.kernel.org/r/<CACT4Y+ZxoR3UjLgcNdUm4fECLMx2VdtfrENMtRRCdgHB2n0bJA@mail.gmail.com>

   WONTFIX here because it should be fixed in bpf program, not in kernel.

signed overflows:

 * 32a8df4e0b ("sched: Fix odd values in effective_load()
   calculations")

 * mul overflow in ntp -
   http://lkml.kernel.org/r/<1449175608-1146-1-git-send-email-sasha.levin@oracle.com>

 * incorrect conversion into rtc_time in rtc_time64_to_tm() -
   http://lkml.kernel.org/r/<1449187944-11730-1-git-send-email-sasha.levin@oracle.com>

 * unvalidated timespec in io_getevents() -
   http://lkml.kernel.org/r/<CACT4Y+bBxVYLQ6LtOKrKtnLthqLHcw-BMp3aqP3mjdAvr9FULQ@mail.gmail.com>

 * [NOTABUG] signed overflow in ktime_add_safe() -
   http://lkml.kernel.org/r/<CACT4Y+aJ4muRnWxsUe1CMnA6P8nooO33kwG-c8YZg=0Xc8rJqw@mail.gmail.com>

[akpm@linux-foundation.org: fix unused local warning]
[akpm@linux-foundation.org: fix __int128 build woes]
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yury Gribov <y.gribov@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-repo: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/
Git-commit: c6d308534aef6c99904bf5862066360ae067abc4
[tsoni@codeaurora.org: trivial merge conflict resolution]
CRs-Fixed: 969533
Change-Id: I048b9936b1120e0d375b7932c59de78d8ef8f411
Signed-off-by: Trilok Soni <tsoni@codeaurora.org>
[satyap@codeaurora.org: trivial merge conflict resolution]
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
2016-03-22 11:09:57 -07:00
Peter Zijlstra
be958bdc96 sched/core: Fix unserialized r-m-w scribbling stuff
Some of the sched bitfieds (notably sched_reset_on_fork) can be set
on other than current, this can cause the r-m-w to race with other
updates.

Since all the sched bits are serialized by scheduler locks, pull them
in a separate word.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: hannes@cmpxchg.org
Cc: mhocko@kernel.org
Cc: vdavydov@parallels.com
Link: http://lkml.kernel.org/r/20151125150207.GM11639@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:01:07 +01:00