Commit graph

61 commits

Author SHA1 Message Date
Joonwoo Park
07eb3f803b sched: select task's prev_cpu as the best CPU when it was chosen recently
Select given task's prev_cpu when the task slept for short period to
reduce latency of task placement and migrations.  A new tunable
/proc/sys/kernel/sched_select_prev_cpu_us introduced to determine whether
tasks are eligible to go through fast path.

CRs-fixed: 947467
Change-Id: Ia507665b91f4e9f0e6ee1448d8df8994ead9739a
[joonwoop@codeaurora.org: fixed conflict in include/linux/sched.h,
 include/linux/sched/sysctl.h, kernel/sched/core.c and kernel/sysctl.c]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:43 -07:00
Syed Rameez Mustafa
c00814c023 sched: Notify cpufreq governor early about potential big tasks
Tasks that are on the runqueue continuously for a certain amount of time
have the potential to be big tasks at the end of the window in which they
are runnable. In such scenarios ramping the CPU frequency early can
boost performance rather than waiting till the end of a window for the
governor to query load. Notify the governor early at every tick when a
task has been observed to execute beyond some percentage of the tick
period.

The threshold beyond which a task is eligible for early detection can be
changed via the tunable sched_early_detection_duration. The feature itself
is enabled only when scheduler boost is in effect.

Change-Id: I528b72bbc79a55b4593d1b8ab45450411c6d70f3
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in scheduler_tick() in
 kernel/sched/core.c.  fixed minor conflicts in include/linux/sched.h,
 include/linux/sched/sysctl.h and kernel/sysctl.c due to
 CONFIG_SCHED_QHMP.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:34 -07:00
Joonwoo Park
446beddcd4 sched: account new task load so that governor can apply different policy
Account amount of load contributed by new tasks within CPU load so that
governor can apply different policy when CPU is loaded by new tasks.

To be able to distinguish new task load a new tunable
sched_new_task_windows also introduced.  The tunable defines tasks as new
when the tasks are have been active less than configured windows.

Change-Id: I2e2e62e4103882f7362154b792ab978b181b9f59
Suggested-by: Saravana Kannan <skannan@codeaurora.org>
[joonwoop@codeaurora.org: ommited changes for
 drivers/cpufreq/cpufreq_interactive.c.  cpufreq changes needs to be
 applied separately later.  fixed conflict in include/linux/sched.h and
 include/linux/sched/sysctl.h.  omitted changes for qhmp_core.c]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:29 -07:00
Syed Rameez Mustafa
ca42a1bec8 sched: add frequency zone awareness to the load balancer
Add zone awareness to the load balancer. Remove all earlier restrictions
that the load balancer had for inter cluster kicks and migration.

Change-Id: I12ad3d0c2d2e9bb498f49a231810f2ad418b061f
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in nohz_kick_needed() due
 to its return type change.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:21 -07:00
Syed Rameez Mustafa
d590f25153 sched: remove the notion of small tasks and small task packing
Task packing will now be determined solely on the basis of the
power cost of task placement. All tasks are eligible for packing.
Remove the notion of "small" tasks from the scheduler.

Change-Id: I72d52d04b2677c6a8d0bc6aa7d50ff0f1a4f5ebb
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2016-03-23 20:02:19 -07:00
Syed Rameez Mustafa
f2ea07a155 sched: Rework energy aware scheduling
Energy aware core rotation is not compatible with the power
based task placement being introduced in subsequent patches.
Remove all existing EA based task placement/migration logic.
power_cost() is the only function remaining. This function has
been modified to return the total power cost associated with a
task on a given CPU taking existing load on that CPU into
account.

Change-Id: Ia00501e3cbfc6e11446a9a2e93e318c4c42bdab4
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed multiple conflicts in fair.c and minor
 conflict in features.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:02:18 -07:00
Joonwoo Park
b40bf941f6 sched: add scheduling latency tracking procfs node
Add a new procfs node /proc/sys/kernel/sched_max_latency_us to track the
worst scheduling latency.  It provides easier way to identify maximum
scheduling latency seen across the CPUs.

Change-Id: I6e435bbf825c0a4dff2eded4a1256fb93f108d0e
[joonwoop@codeaurora.org: fixed conflict in update_stats_wait_end().]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:01:50 -07:00
Joonwoo Park
8f90803a45 sched: warn/panic upon excessive scheduling latency
Add new tunables /proc/sys/kernel/sched_latency_warn_threshold_us and
/proc/sys/kernel/sched_latency_panic_threshold_us to warn or panic for the
cases that tasks are runnable but not scheduled more than configured time.

This helps to find out unacceptably high scheduling latency more easily.

Change-Id: If077aba6211062cf26ee289970c5abcd1c218c82
[joonwoop@codeaurora.org: fixed conflict in update_stats_wait_end().]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:01:49 -07:00
Srivatsa Vaddagiri
5b45dc56e5 sched: Per-cpu prefer_idle flag
Remove the global sysctl_sched_prefer_idle flag and replace it with a
per-cpu prefer_idle flag. The per-cpu flag is expected to same for all
cpus in a cluster. It thus provides convenient means to disable
packing in one cluster while allowing packing in another cluster.

Change-Id: Ie4cc73bb1a55b4eac5697be38e558546161faca1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:01:26 -07:00
Olav Haugan
3f947e7ba7 sched: Add sysctl to enable power aware scheduling
Add sysctl to enable energy awareness at runtime. This is useful for
performance/power tuning/measurements and debugging. In addition this
will match up with the Documentation/scheduler/sched-hmp.txt documentation.

Change-Id: I0a9185498640d66917b38bf5d55f6c59fc60ad5c
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org
2016-03-23 20:01:24 -07:00
Srivatsa Vaddagiri
29a412dffa sched: Avoid frequent migration of running task
Power values for cpus can drop quite considerably when it goes idle.
As a result, the best choice for running a single task in a cluster
can vary quite rapidly. As the task keeps hopping cpus, other cpus go
idle and start being seen as more favorable target for running a task,
leading to task migrating almost every scheduler tick!

Prevent this by keeping track of when a task started running on a cpu
and allowing task migration in tick path (migration_needed()) on
account of energy efficiency reasons only if the task has run
sufficiently long (as determined by sysctl_sched_min_runtime
variable).

Note that currently sysctl_sched_min_runtime setting is considered
only in scheduler_tick()->migration_needed() path and not in
idle_balance() path. In other words, a task could be migrated to
another cpu which did a idle_balance(). This limitation should not
affect high-frequency migrations seen typically (when a single
high-demand task runs on high-performance cpu).

CRs-Fixed: 756570
Change-Id: I96413b7a81b623193c3bbcec6f3fa9dfec367d99
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in set_task_cpu() and
 __schedule().]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:01:17 -07:00
Srivatsa Vaddagiri
72b7c5d36c sched: Provide knob to prefer mostly_idle over idle cpus
sysctl_sched_prefer_idle lets the scheduler bias selection of
idle cpus over mostly idle cpus for tasks. This knob could be
useful to control balance between power and performance.

Change-Id: Ide6eef684ef94ac8b9927f53c220ccf94976fe67
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:01:12 -07:00
Steve Muckle
588055e8c7 sched: make sched_cpu_high_irqload a runtime tunable
It may be desirable to be able to alter the scehd_cpu_high_irqload
setting easily, so make it a runtime tunable value.

Change-Id: I832030eec2aafa101f0f435a4fd2d401d447880d
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
2016-03-23 20:01:11 -07:00
Srivatsa Vaddagiri
b2e57842c0 sched: per-cpu mostly_idle threshold
sched_mostly_idle_load and sched_mostly_idle_nr_run knobs help pack
tasks on cpus to some extent. In some cases, it may be desirable to
have different packing limits for different cpus. For example, pack to
a higher limit on high-performance cpus compared to power-efficient
cpus.

This patch removes the global mostly_idle tunables and makes them
per-cpu, thus letting task packing behavior to be controlled in a
fine-grained manner.

Change-Id: Ifc254cda34b928eae9d6c342ce4c0f64e531e6c2
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:59 -07:00
Srivatsa Vaddagiri
98f89f00dc sched: update governor notification logic
Make criteria for notifying governor to be per-cpu. Governor is
notified of any large change in cpu's busy time statistics
(rq->prev_runnable_sum) since the last reported value.

Change-Id: I727354d994d909b166d093b94d3dade7c7dddc0d
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:54 -07:00
Srivatsa Vaddagiri
c12a2b5ab9 sched: Use absolute scale for notifying governor
Make the tunables used for deciding the need for notification to be on
absolute scale. The earlier scale (in percent terms relative to
cur_freq) does not work well with available range of frequencies. For
example, 100% tunable value would work well for lower range of
frequencies and not for higher range. Having the tunable to be on
absolute scale makes tuning more realistic.

Change-Id: I35a8c4e2f2e9da57f4ca4462072276d06ad386f1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:51 -07:00
Srivatsa Vaddagiri
3a67b4ce87 sched: window-stats: Enhance cpu busy time accounting
rq->curr/prev_runnable_sum counters represent cpu demand from various
tasks that have run on a cpu. Any task that runs on a cpu will have a
representation in rq->curr_runnable_sum. Their partial_demand value
will be included in rq->curr_runnable_sum. Since partial_demand is
derived from historical load samples for a task, rq->curr_runnable_sum
could represent "inflated/un-realistic" cpu usage. As an example, lets
say that task with partial_demand of 10ms runs for only 1ms on a cpu.
What is included in rq->curr_runnable_sum is 10ms (and not the actual
execution time of 1ms). This leads to cpu busy time being reported on
the upside causing frequency to stay higher than necessary.

This patch fixes cpu busy accounting scheme to strictly represent
actual usage. It also provides for conditional fixup of busy time upon
migration and upon heavy-task wakeup.

CRs-Fixed: 691443
Change-Id: Ic4092627668053934049af4dfef65d9b6b901e6b
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in init_task_load(),
 se.avg.decay_count has deprecated.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:00:50 -07:00
Srivatsa Vaddagiri
c9d0953c31 sched: improve logic for alerting governor
Currently we send notification to governor not taking note of cpus
that are synchronized with regard to their frequency. As a result,
scheduler could send pointless notifications (notification spam!).

Avoid this by considering synchronized cpus and alerting governor only
when the highest demand of any cpu within cluster far exceeds or falls
behind current frequency.

Change-Id: I74908b5a212404ca56b38eb94548f9b1fbcca33d
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:48 -07:00
Srivatsa Vaddagiri
d8932ae7df sched: window-stats: legacy mode
Support legacy mode, which results in busy time being seen by governor
that is close to what it would have seen via existing APIs i.e
get_cpu_idle_time_us(), get_cpu_iowait_time_us() and
get_cpu_idle_time_jiffy(). In particular, legacy mode means that only
task execution time is counted in rq->curr_runnable_sum and
rq->prev_runnable_sum. Also task migration does not result in
adjustment of those counters.

Change-Id: If374ccc084aa73f77374b6b3ab4cd0a4ca7b8c90
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:26 -07:00
Srivatsa Vaddagiri
e39131c3be sched: window-stats: Code cleanup
Remove code duplication associated with update of various window-stats
related sysctl tunables

Change-Id: I64e29ac065172464ba371a03758937999c42a71f
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:24 -07:00
Olav Haugan
8eede4a8d5 sched: Make RAVG_HIST_SIZE tunable
Make RAVG_HIST_SIZE available from /proc/sys/kernel/sched_ravg_hist_size
to allow tuning of the size of the history that is used in computation
of task demand.

CRs-fixed: 706138
Change-Id: Id54c1e4b6e974a62d787070a0af1b4e8ce3b4be6
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in sysctl.h]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:00:19 -07:00
Srivatsa Vaddagiri
4641b37da8 sched: window-stats: Allow acct_wait_time to be tuned
Add sysctl interface to tune sched_acct_wait_time variable at runtime

Change-Id: I38339cdb388a507019e429709a7c28e80b5b3585
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 20:00:15 -07:00
Srivatsa Vaddagiri
1ffae4dc94 sched: window-stats: Handle policy change properly
sched_window_stat_policy influences task demand and thus various
statistics maintained per-cpu like curr_runnable_sum. Changing policy
non-atomically would lead to improper accounting. For example, when
task is enqueued on a cpu's runqueue, its demand that is added to
rq->cumulative_runnable_avg could be based on AVG policy and when its
dequeued its demand that is removed can be based on MAX, leading to
erroneous accounting.

This change causes policy change to be "atomic" i.e all cpu's rq->lock
are held and all task's window-stats are reset before policy is changed.

Change-Id: I6a3e4fb7bc299dfc5c367693b5717a1ef518c32d
CRs-Fixed: 687409
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in
 include/linux/sched/sysctl.h.
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 20:00:03 -07:00
Srivatsa Vaddagiri
f27b626521 sched: remove sysctl control for HMP and power-aware task placement
There is no real need to control HMP and power-aware task placement at
runtime after kernel has booted. Boot-time control should be
sufficient. Not allowing for runtime (sysctl) support simplifies the
code quite a bit.

Also rename sysctl_sched_enable_hmp_task_placement to be shorter.

Change-Id: I60cae51a173c6f73b79cbf90c50ddd41a27604aa
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict.  p->nr_cpus_allowed == 1
 has moved to core.c
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:59:55 -07:00
Syed Rameez Mustafa
754f666131 sched/fair: Introduce scheduler boost for low latency workloads
Certain low latency bursty workloads require immediate use of highest
capacity CPUs in HMP systems. Existing load tracking mechanisms may be
unable to respond to the sudden surge in the system load within the
latency requirements. Introduce the scheduler boost feature for such
workloads. While boost is in effect the scheduler bypasses regular load
based task placement and prefers highest capacity CPUs in the system
for all non-small fair sched class tasks. Provide both a kernel and
userspace API for software that may have apriori knowledge about the
system workload.

Change-Id: I783f585d1f8c97219e629d9c54f712318821922f
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed minor conflict in
 include/linux/sched/sysctl.h.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:59:45 -07:00
Steve Muckle
476ea8d45d sched: notify cpufreq on over/underprovisioned CPUs
After a migration occurs the source and destination CPUs may
not be running at frequencies which match the new task load on
those CPUs.

Previously, the scheduler was notifying cpufreq anytime a task
greater than a certain size migrates. This is suboptimal however
since this does not take into account the CPU's current
frequency and other task activity that may be present.

Change-Id: I5092bda3a517e1343f97e5a455957c25ee19b549
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
2016-03-23 19:59:32 -07:00
Syed Rameez Mustafa
a536cf8ac8 sched: Introduce spill threshold tunables to manage overcommitment
When the number of tasks intended for a cluster exceed the number of
mostly idle CPUs in that cluster, the scheduler currently freely uses
CPUs in other clusters if possible. While this is optimal for
performance the power trade off can be quite significant. Introduce
spill threshold tunables that govern the extent to which the scheduler
should attempt to contain tasks within a cluster.

Change-Id: I797e6c6b2aa0c3a376dad93758abe1d587663624
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org
[joonwoop@codeaurora.org: fixed conflict in nohz_kick_needed()]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:59:32 -07:00
Steve Muckle
f469bce8e2 sched: add migration load change notifier for frequency guidance
When a task moves between CPUs in two different frequency domains
the cpufreq governor may wish to immediately modify the frequency
of both the source and destination CPUs of the migrating task.

A tunable is provided to establish what size task is considered
"significant" enough to warrant notifying cpufreq.

Also fix a bug that would cause load to not be accounted properly
during wakeup migrations.

Change-Id: Ie8f6b1cc4d43a602840dac18590b42a81327c95a
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
[rameezmustafa@codeaurora.org: Add double rq locking for set_task_cpu()]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2016-03-23 19:59:29 -07:00
Steve Muckle
ec7d8cc076 sched: add power aware scheduling sysctl
The sched_enable_power_aware sysctl will control whether
or not scheduling decisions are influenced by the power
consumption of individual CPUs.

Change-Id: I312f892cf76a3fccc4ecc8aa6703908b205267f0
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 19:59:21 -07:00
Srivatsa Vaddagiri
7cec8569e3 sched: Basic task placement support for HMP systems
HMP systems have cpus with different power and performance
characteristics. Some cpus could offer better power at cost of lower
performance while other cpus could offer better performance at cost of
higher power. As a result, bandwidth consumed by a task to do some
"fixed" amount of work could vary across cpus.

Optimal task placement on HMP would involve placing a task on a cpu
where it can meet its performance goals at lowest power cost. Since
kernel has little to no awareness of performance goals of
applications, we guestimate whether task is meeting its performance
goals or not by looking at its cpu bandwidth consumption. High
bandwidth consumption could imply that task's performance can improve
by running on cpus with better capacity/performance-characterisitcs.

This patch makes the basic changes to support HMP. It provides a
configurable threshold and any task consuming bandwidth in excess of
threshold will be placed on a cpu with better capacity.

Change-Id: I3fd98edd430f73342fbef06411e8b2d1cf2f56fa
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict about members of p->se which
 are not available anymore.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:59:03 -07:00
Srivatsa Vaddagiri
a25a5c1c30 sched: window-based load stats improvements
Following cleanups and improvements are made to window-based load
stats feature:

* Add sysctl to pick max, avg or most recent samples as task's
  demand.

* Fix overflow possibility in calculation of sum for average policy.

* Use unscaled statistics when a task is running on a CPU which is
thermally throttled.

Change-Id: I8293565ca0c2a785dadf8adb6c67f579a445ed29
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-03-23 19:58:58 -07:00
Srivatsa Vaddagiri
3967da2dd1 sched: Window-based load stat improvements
Some tasks can have a sporadic load pattern such that they can suddenly
start running for longer intervals of time after running for shorter
durations. To recognize such sharp increase in tasks' demands, max
between the average of 5 window load samples and the most recent sample
is chosen as the task demand.

Make the window size (sched_ravg_window) configurable at boot up
time. To prevent users from setting inappropriate values for window
size, min and max limits are defined. As 'ravg' struct tracks load for
both real-time and non real-time tasks it is moved out of sched_entity
struct.

In order to prevent changing function signatures for move_tasks() and
move_one_task() per-cpu variables are defined to track the total load
moved. In case multiple tasks are selected to migrate in one load
balance operation, loads > 100 could be sent through migration notifiers.
Prevent this scenario by setting mnd.load to 100 in such cases.

Define wrapper functions to compute cpu demands for tasks and to change
rq->cumulative_runnable_avg.

Change-Id: I9abfbf3b5fe23ae615a6acd3db9580cfdeb515b4
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>
[rameezmustafa@codeaurora.org: Port to msm-3.18 and squash "dcf7256 sched:
			window-stats: Fix overflow bug" into this patch.]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in __migrate_task().]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:58:53 -07:00
Rohit Gupta
e3fe80da05 sched: Call the notify_on_migrate notifier chain for wakeups as well
Add a change to send notify_on_migrate hints on wakeups of
foreground tasks from scheduler if their load is above
wakeup_load_thresholds (default value is 60).
These hints can be used to choose an appropriate CPU frequency
corresponding to the load of the task being woken up.

By default sched_wakeup_load_threshold is set to 60 and therefore
wakeup hints are sent out for those tasks whose loads are higher
that value. This might cause unnecessary wakeup boosts to happen
when load based syncing is turned ON for cpu-boost.
Disable the wake up hints by setting the sched_wakeup_load_threshold
to a value higher than 100 so that wakeup boost doesnt happen unless
it is explicitly turned ON from adb shell.

Change-Id: Ieca413c1a8bd2b14a15a7591e8e15d22925c42ca
Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>
[rameezmustafa@codeaurora.org: Squash "a26fcce sched: Disable wakeup
			hints for foreground tasks by default" into
			this patch and update commit text.]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2016-03-23 19:58:52 -07:00
Srivatsa Vaddagiri
74463329e4 sched: window-based load stats for tasks
Provide a metric per task that specifies how cpu bound a task is. Task
execution is monitored over several time windows and the fraction of
the window for which task was found to be executing or wanting to run
is recorded as task's demand. Windows over which task was sleeping are
ignored. We track last 5 recent windows for every task and the maximum
demand seen in any of the previous 5 windows (where task had some
activity) drives freq demand for every task.

A per-cpu metric (rq->cumulative_runnable_avg) is also provided which
is an aggregation of cpu demand of all tasks currently enqueued on it.
rq->cumulative_runnable_avg will be useful to know if cpu frequency
will need to be changed to match task demand.

Change-Id: Ib83207b9ba8683cd3304ee8a2290695c34f08fe2
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed conflict in ttwu_do_wakeup() to
 incorporate with changed trace_sched_wakeup() location.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2016-03-23 19:58:39 -07:00
Steve Muckle
03ab01a318 sched: add sysctl for controlling task migrations on wake
The PF_WAKE_UP_IDLE per-task flag made it impossible to enable
the old behavior of SD_SHARE_PKG_RESOURCES, where every task
migrates to an idle CPU on wakeup.

The sched_wake_to_idle sysctl value, when made nonzero, will cause
all tasks to migrate to an idle CPU if one is available when the
task is woken up. This is regardless of how PF_WAKE_UP_IDLE is
configured for tasks in the system. Similar to PF_WAKE_UP_IDLE,
the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled
for the sysctl value to have an effect.

Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
(cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282)
2016-03-23 19:58:30 -07:00
Juri Lelli
2726d6ce38 sched/deadline: Unify dl_time_before() usage
Move dl_time_before() static definition in include/linux/sched/deadline.h
so that it can be used by different parties without being re-defined.

Reported-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441188096-23021-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-23 09:51:25 +02:00
Thomas Gleixner
bc7a34b8b9 timer: Reduce timer migration overhead if disabled
Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.

We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.

The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.

With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.

Before:
  47.76%  hog       [.] main
  14.84%  [kernel]  [k] _raw_spin_lock_irqsave
   9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.71%  [kernel]  [k] mod_timer
   6.24%  [kernel]  [k] lock_timer_base.isra.38
   3.76%  [kernel]  [k] detach_if_pending
   3.71%  [kernel]  [k] del_timer
   2.50%  [kernel]  [k] internal_add_timer
   1.51%  [kernel]  [k] get_nohz_timer_target
   1.28%  [kernel]  [k] __internal_add_timer
   0.78%  [kernel]  [k] timerfn
   0.48%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu


Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 15:18:28 +02:00
Thomas Gleixner
0782e63bc6 sched: Handle priority boosted tasks proper in setscheduler()
Ronny reported that the following scenario is not handled correctly:

	T1 (prio = 10)
	   lock(rtmutex);

	T2 (prio = 20)
	   lock(rtmutex)
	      boost T1

	T1 (prio = 20)
	   sys_set_scheduler(prio = 30)
	   T1 prio = 30
	   ....
	   sys_set_scheduler(prio = 10)
	   T1 prio = 30

The last step is wrong as T1 should now be back at prio 20.

Commit c365c292d0 ("sched: Consider pi boosting in setscheduler()")
only handles the case where a boosted tasks tries to lower its
priority.

Fix it by taking the new effective priority into account for the
decision whether a change of the priority is required.

Reported-by: Ronny Meeus <ronny.meeus@gmail.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Fixes: c365c292d0 ("sched: Consider pi boosting in setscheduler()")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 11:53:55 +02:00
Kirill A. Shutemov
3fb1c8dcfc mm: update comment for DEFAULT_MAX_MAP_COUNT
With ELF extended numbering 16-bit bound is not hard limit any more.

[akpm@linux-foundation.org: fix typo]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:05 -07:00
Dongsheng Yang
7aa2c016db sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice()
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/a568a1e3cc8e78648f41b5035fa5e381d36274da.1399532322.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22 11:16:36 +02:00
Dongsheng Yang
3ee237dddc sched/prio: Add 3 macros of MAX_NICE, MIN_NICE and NICE_WIDTH in prio.h
Currently there is lots of hard coding to 19 and -20, to represent
maximum and minimum of nice values.

This patch add three macros in prio.h for maximum, minimum and width
of nice value, and uses it to remove hardcoded values in prio.h.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/3994e89327b2b15f992277cdf9f409c516f87d1b.1392103744.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Collapsed two small patches. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:14:13 +01:00
Dongsheng Yang
7e298d60f7 sched/prio: Use DEFAULT_PRIO to define NICE_TO_PRIO() and PRIO_TO_NICE()
There is already a macro named DEFAULT_PRIO in prio.h, we can use it
to define NICE_TO_PRIO and PRIO_TO_NICE rather than use hard coding
of (MAX_RT_PRIO + 20).

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/4e28ec36fb49e8906027cbbdd900ab26a149905e.1392103744.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:11:29 +01:00
Thomas Gleixner
c365c292d0 sched: Consider pi boosting in setscheduler()
If a PI boosted task policy/priority is modified by a setscheduler()
call we unconditionally dequeue and requeue the task if it is on the
runqueue even if the new priority is lower than the current effective
boosted priority. This can result in undesired reordering of the
priority bucket list.

If the new priority is less or equal than the current effective we
just store the new parameters in the task struct and leave the
scheduler class and the runqueue untouched. This is handled when the
task deboosts itself. Only if the new priority is higher than the
effective boosted priority we apply the change immediately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Rebase ontop of v3.14-rc1. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-7-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:10:04 +01:00
Dongsheng Yang
d0ea026808 sched: Implement task_nice() as static inline function
As patch "sched: Move the priority specific bits into a new header file" exposes
the priority related macros in linux/sched/prio.h, we don't have to implement
task_nice() in kernel/sched/core.c any more.

This patch implements it in linux/sched/sched.h as static inline function,
saving the kernel stack and enhancing performance a bit.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Cc: clark.williams@gmail.com
Cc: rostedt@goodmis.org
Cc: raistlin@linux.it
Cc: juri.lelli@gmail.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1390878045-7096-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09 15:28:23 +01:00
Dongsheng Yang
6b6350f155 sched: Expose some macros related to priority
Some macros in kernel/sched/sched.h about priority are
private to kernel/sched. But they are useful to other
parts of the core kernel.

This patch moves these macros from kernel/sched/sched.h to
include/linux/sched/prio.h so that they are available to
other subsystems.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Cc: raistlin@linux.it
Cc: juri.lelli@gmail.com
Cc: clark.williams@gmail.com
Cc: rostedt@goodmis.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/2b022810905b52d13238466807f4b2a691577180.1390859827.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09 13:31:51 +01:00
Dongsheng Yang
5c228079ce sched: Move the priority specific bits into a new header file
Some bits about priority are defined in linux/sched/rt.h, but
some of them are not only for rt scheduler, such as MAX_PRIO.

This patch move them all into a new header file, linux/sched/prio.h.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Cc: clark.williams@gmail.com
Cc: rostedt@goodmis.org
Cc: raistlin@linux.it
Cc: juri.lelli@gmail.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/f7549508a1588da2c613d601748ca9de30fa5dcf.1390859827.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09 13:31:49 +01:00
Linus Torvalds
ab5318788c Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core debug changes from Ingo Molnar:
 "This contains mostly kernel debugging related updates:

   - make hung_task detection more configurable to distros
   - add final bits for x86 UV NMI debugging, with related KGDB changes
   - update the mailing-list of MAINTAINERS entries I'm involved with"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hung_task: Display every hung task warning
  sysctl: Add neg_one as a standard constraint
  x86/uv/nmi, kgdb/kdb: Fix UV NMI handler when KDB not configured
  x86/uv/nmi: Fix Sparse warnings
  kgdb/kdb: Fix no KDB config problem
  MAINTAINERS: Restore "L: linux-kernel@vger.kernel.org" entries
2014-01-31 08:59:46 -08:00
Aaron Tomlin
270750dbc1 hung_task: Display every hung task warning
When khungtaskd detects hung tasks, it prints out
backtraces from a number of those tasks.

Limiting the number of backtraces being printed
out can result in the user not seeing the information
necessary to debug the issue. The hung_task_warnings
sysctl controls this feature.

This patch makes it possible for hung_task_warnings
to accept a special value to print an unlimited
number of backtraces when khungtaskd detects hung
tasks.

The special value is -1. To use this value it is
necessary to change types from ulong to int.

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/1390239253-24030-3-git-send-email-atomlin@redhat.com
[ Build warning fix. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25 12:13:33 +01:00
Andi Kleen
54a43d5498 numa: add a sysctl for numa_balancing
Add a working sysctl to enable/disable automatic numa memory balancing
at runtime.

This allows us to track down performance problems with this feature and
is generally a good idea.

This was possible earlier through debugfs, but only with special
debugging options set.  Also fix the boot message.

[akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:36:51 -08:00
Peter Zijlstra
1724813d9f sched/deadline: Remove the sysctl_sched_dl knobs
Remove the deadline specific sysctls for now. The problem with them is
that the interaction with the exisiting rt knobs is nearly impossible
to get right.

The current (as per before this patch) situation is that the rt and dl
bandwidth is completely separate and we enforce rt+dl < 100%. This is
undesirable because this means that the rt default of 95% leaves us
hardly any room, even though dl tasks are saver than rt tasks.

Another proposed solution was (a discarted patch) to have the dl
bandwidth be a fraction of the rt bandwidth. This is highly
confusing imo.

Furthermore neither proposal is consistent with the situation we
actually want; which is rt tasks ran from a dl server. In which case
the rt bandwidth is a direct subset of dl.

So whichever way we go, the introduction of dl controls at this point
is painful. Therefore remove them and instead share the rt budget.

This means that for now the rt knobs are used for dl admission control
and the dl runtime is accounted against the rt runtime. I realise that
this isn't entirely desirable either; but whatever we do we appear to
need to change the interface later, so better have a small interface
for now.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13 13:47:23 +01:00