evie/android_kernel_oneplus_msm8998 - Gay Catgirls Forgejo: gay catgirls having sex

evie/android_kernel_oneplus_msm8998

2886 lines

76 KiB

C

Raw Normal View History

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
			`#include <linux/sched.h>`
sched: Move sched.h sysctl bits into separate header Move the sysctl-related bits from include/linux/sched.h into a new file: include/linux/sched/sysctl.h. Then update source files requiring access to those bits by including the new header file. Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-02-07 09:46:59 -06:00			`#include <linux/sched/sysctl.h>`
sched/rt: Move rt specific bits into new header file Move rt scheduler definitions out of include/linux/sched.h into new file include/linux/sched/rt.h Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-02-07 09:47:07 -06:00			`#include <linux/sched/rt.h>`
sched/deadline: Add SCHED_DEADLINE structures & implementation Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-28 11:14:43 +01:00			`#include <linux/sched/deadline.h>`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#include <linux/mutex.h>`
			`#include <linux/spinlock.h>`
			`#include <linux/stop_machine.h>`
sched/rt: Use IPI to trigger RT task push migration instead of pulling When debugging the latencies on a 40 core box, where we hit 300 to 500 microsecond latencies, I found there was a huge contention on the runqueue locks. Investigating it further, running ftrace, I found that it was due to the pulling of RT tasks. The test that was run was the following: cyclictest --numa -p95 -m -d0 -i100 This created a thread on each CPU, that would set its wakeup in iterations of 100 microseconds. The -d0 means that all the threads had the same interval (100us). Each thread sleeps for 100us and wakes up and measures its latencies. cyclictest is maintained at: git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git What happened was another RT task would be scheduled on one of the CPUs that was running our test, when the other CPU tests went to sleep and scheduled idle. This caused the "pull" operation to execute on all these CPUs. Each one of these saw the RT task that was overloaded on the CPU of the test that was still running, and each one tried to grab that task in a thundering herd way. To grab the task, each thread would do a double rq lock grab, grabbing its own lock as well as the rq of the overloaded CPU. As the sched domains on this box was rather flat for its size, I saw up to 12 CPUs block on this lock at once. This caused a ripple affect with the rq locks especially since the taking was done via a double rq lock, which means that several of the CPUs had their own rq locks held while trying to take this rq lock. As these locks were blocked, any wakeups or load balanceing on these CPUs would also block on these locks, and the wait time escalated. I've tried various methods to lessen the load, but things like an atomic counter to only let one CPU grab the task wont work, because the task may have a limited affinity, and we may pick the wrong CPU to take that lock and do the pull, to only find out that the CPU we picked isn't in the task's affinity. Instead of doing the PULL, I now have the CPUs that want the pull to send over an IPI to the overloaded CPU, and let that CPU pick what CPU to push the task to. No more need to grab the rq lock, and the push/pull algorithm still works fine. With this patch, the latency dropped to just 150us over a 20 hour run. Without the patch, the huge latencies would trigger in seconds. I've created a new sched feature called RT_PUSH_IPI, which is enabled by default. When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks and having the pulling CPU do the work is implemented. When RT_PUSH_IPI is enabled, the IPI is sent to the overloaded CPU to do a push. To enabled or disable this at run time: # mount -t debugfs nodev /sys/kernel/debug # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features or # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features Update: This original patch would send an IPI to all CPUs in the RT overload list. But that could theoretically cause the reverse issue. That is, there could be lots of overloaded RT queues and one CPU lowers its priority. It would then send an IPI to all the overloaded RT queues and they could then all try to grab the rq lock of the CPU lowering its priority, and then we have the same problem. The latest design sends out only one IPI to the first overloaded CPU. It tries to push any tasks that it can, and then looks for the next overloaded CPU that can push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable tasks that have priorities greater than the source CPU are covered. In case the source CPU lowers its priority again, a flag is set to tell the IPI traversal to restart with the first RT overloaded CPU after the source CPU. Parts-suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joern Engel <joern@purestorage.com> Cc: Clark Williams <williams@redhat.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-03-18 14:49:46 -04:00			`#include <linux/irq_work.h>`
sched: Kick full dynticks CPU that have more than one task enqueued. Kick the tick on full dynticks CPUs when they get more than one task running on their queue. This makes sure that local fairness is maintained by the tick on the destination. This is done regardless of these tasks' class. We should be able to be more clever in the future depending on these. eg: a CPU that runs a SCHED_FIFO task doesn't need to maintain fairness against local pending tasks of the fair class. But keep things simple for now. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> 2013-04-20 14:35:09 +02:00			`#include <linux/tick.h>`
sched/numa: Track NUMA hinting faults on per-node basis This patch tracks what nodes numa hinting faults were incurred on. This information is later used to schedule a task on the node storing the pages most frequently faulted by the task. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-07 11:28:57 +01:00			`#include <linux/slab.h>`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched: Move all scheduler bits into kernel/sched/ There's too many sched*.[ch] files in kernel/, give them their own directory. (No code changed, other than Makefile glue added.) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-11-15 17:14:39 +01:00			`#include "cpupri.h"`
sched/deadline: speed up SCHED_DEADLINE pushes with a push-heap Data from tests confirmed that the original active load balancing logic didn't scale neither in the number of CPU nor in the number of tasks (as sched_rt does). Here we provide a global data structure to keep track of deadlines of the running tasks in the system. The structure is composed by a bitmask showing the free CPUs and a max-heap, needed when the system is heavily loaded. The implementation and concurrent access scheme are kept simple by design. However, our measurements show that we can compete with sched_rt on large multi-CPUs machines [1]. Only the push path is addressed, the extension to use this structure also for pull decisions is straightforward. However, we are currently evaluating different (in order to decrease/avoid contention) data structures to solve possibly both problems. We are also going to re-run tests considering recent changes inside cpupri [2]. [1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf [2] http://www.spinics.net/lists/linux-rt-users/msg06778.html Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-14-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:47 +01:00			`#include "cpudeadline.h"`
sched: Split cpuacct code out of sched.h Add cpuacct.h and let sched.h include it. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5155367B.2060506@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-29 14:36:43 +08:00			`#include "cpuacct.h"`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched: Factor out load calculation code from sched/core.c --> sched/proc.c This large chunk of load calculation code can be easily divorced from the main core.c scheduler file, with only a couple prototypes and externs added to a kernel/sched header. Some recent commits expanded the code and the documentation of it, making it large enough to warrant separation. For example, see: 556061b, "sched/nohz: Fix rq->cpu_load[] calculations" 5aaa0b7, "sched/nohz: Fix rq->cpu_load calculations some more" 5167e8d, "sched/nohz: Rewrite and fix load-avg computation -- again" More importantly, it helps reduce the size of the main sched/core.c by yet another significant amount (~600 lines). Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1366398650-31599-2-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-04-19 15:10:49 -04:00			`struct rq;`
sched: Let the scheduler see CPU idle states When the cpu enters idle, it stores the cpuidle state pointer in its struct rq instance which in turn could be used to make a better decision when balancing tasks. As soon as the cpu exits its idle state, the struct rq reference is cleared. There are a couple of situations where the idle state pointer could be changed while it is being consulted: 1. For x86/acpi with dynamic c-states, when a laptop switches from battery to AC that could result on removing the deeper idle state. The acpi driver triggers: 'acpi_processor_cst_has_changed' 'cpuidle_pause_and_lock' 'cpuidle_uninstall_idle_handler' 'kick_all_cpus_sync'. All cpus will exit their idle state and the pointed object will be set to NULL. 2. The cpuidle driver is unloaded. Logically that could happen but not in practice because the drivers are always compiled in and 95% of them are not coded to unregister themselves. In any case, the unloading code must call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock' leading to 'kick_all_cpus_sync' as mentioned above. A race can happen if we use the pointer and then one of these two scenarios occurs at the same moment. In order to be safe, the idle state pointer stored in the rq must be used inside a rcu_read_lock section where we are protected with the 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The idle_get_state() and idle_put_state() accessors should be used to that effect. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: linaro-kernel@lists.linaro.org Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-09-04 11:32:09 -04:00			`struct cpuidle_state;`
sched: Factor out load calculation code from sched/core.c --> sched/proc.c This large chunk of load calculation code can be easily divorced from the main core.c scheduler file, with only a couple prototypes and externs added to a kernel/sched header. Some recent commits expanded the code and the documentation of it, making it large enough to warrant separation. For example, see: 556061b, "sched/nohz: Fix rq->cpu_load[] calculations" 5aaa0b7, "sched/nohz: Fix rq->cpu_load calculations some more" 5167e8d, "sched/nohz: Rewrite and fix load-avg computation -- again" More importantly, it helps reduce the size of the main sched/core.c by yet another significant amount (~600 lines). Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1366398650-31599-2-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-04-19 15:10:49 -04:00
sched: Add wrapper for checking task_struct::on_rq Implement task_on_rq_queued() and use it everywhere instead of on_rq check. No functional changes. The only exception is we do not use the wrapper in check_for_tasks(), because it requires to export task_on_rq_queued() in global header files. Next patch in series would return it back, so we do not twist it from here to there. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-08-20 13:47:32 +04:00			`/* task_struct::on_rq states: */`
			`#define TASK_ON_RQ_QUEUED 1`
sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state This is a new p->on_rq state which will be used to indicate that a task is in a process of migrating between two RQs. It allows to get rid of double_rq_lock(), which we used to use to change a rq of a queued task before. Let's consider an example. To move a task between src_rq and dst_rq we will do the following: raw_spin_lock(&src_rq->lock); /* p is a task which is queued on src_rq / p = ...; dequeue_task(src_rq, p, 0); p->on_rq = TASK_ON_RQ_MIGRATING; set_task_cpu(p, dst_cpu); raw_spin_unlock(&src_rq->lock); / * Both RQs are unlocked here. * Task p is dequeued from src_rq * but its on_rq value is not zero. */ raw_spin_lock(&dst_rq->lock); p->on_rq = TASK_ON_RQ_QUEUED; enqueue_task(dst_rq, p, 0); raw_spin_unlock(&dst_rq->lock); While p->on_rq is TASK_ON_RQ_MIGRATING, task is considered as "migrating", and other parallel scheduler actions with it are not available to parallel callers. The parallel caller is spining till migration is completed. The unavailable actions are changing of cpu affinity, changing of priority etc, in other words all the functionality which used to require task_rq(p)->lock before (and related to the task). To implement TASK_ON_RQ_MIGRATING support we primarily are using the following fact. Most of scheduler users (from which we are protecting a migrating task) use task_rq_lock() and __task_rq_lock() to get the lock of task_rq(p). These primitives know that task's cpu may change, and they are spining while the lock of the right RQ is not held. We add one more condition into them, so they will be also spinning until the migration is finished. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1408528062.23412.88.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-08-20 13:47:42 +04:00			`#define TASK_ON_RQ_MIGRATING 2`
sched: Add wrapper for checking task_struct::on_rq Implement task_on_rq_queued() and use it everywhere instead of on_rq check. No functional changes. The only exception is we do not use the wrapper in check_for_tasks(), because it requires to export task_on_rq_queued() in global header files. Next patch in series would return it back, so we do not twist it from here to there. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-08-20 13:47:32 +04:00
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`extern __read_mostly int scheduler_running;`

sched: Factor out load calculation code from sched/core.c --> sched/proc.c This large chunk of load calculation code can be easily divorced from the main core.c scheduler file, with only a couple prototypes and externs added to a kernel/sched header. Some recent commits expanded the code and the documentation of it, making it large enough to warrant separation. For example, see: 556061b, "sched/nohz: Fix rq->cpu_load[] calculations" 5aaa0b7, "sched/nohz: Fix rq->cpu_load calculations some more" 5167e8d, "sched/nohz: Rewrite and fix load-avg computation -- again" More importantly, it helps reduce the size of the main sched/core.c by yet another significant amount (~600 lines). Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1366398650-31599-2-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-04-19 15:10:49 -04:00			`extern unsigned long calc_load_update;`
			`extern atomic_long_t calc_load_tasks;`

sched: Move the loadavg code to a more obvious location I could not find the loadavg code.. turns out it was hidden in a file called proc.c. It further got mingled up with the cruft per rq load indexes (which we really want to get rid of). Move the per rq load indexes into the fair.c load-balance code (that's the only thing that uses them) and rename proc.c to loadavg.c so we can find it again. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> [ Did minor cleanups to the code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-04-14 13:19:42 +02:00			`extern void calc_global_load_tick(struct rq *this_rq);`
sched: take into account of governor's frequency max load At present HMP scheduler packs tasks to busy CPU till the CPU's load is 100% to avoid waking up of idle CPU as much as possible. Such aggressive packing leads unintended CPU frequency raise as governor raises the busy CPU's frequency when its load is more than configured frequency max load which can be less than 100%. Fix to take into account of governor's frequency max load and pack tasks only when the CPU's projected load is less than max load to avoid unnecessary frequency raise. Change-Id: I4447e5e0c2fa5214ae7a9128f04fd7585ed0dcac [joonwoop@codeaurora.org: fixed minor conflict in kernel/sched/sched.h] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-07-27 16:52:12 -07:00
sched: Factor out load calculation code from sched/core.c --> sched/proc.c This large chunk of load calculation code can be easily divorced from the main core.c scheduler file, with only a couple prototypes and externs added to a kernel/sched header. Some recent commits expanded the code and the documentation of it, making it large enough to warrant separation. For example, see: 556061b, "sched/nohz: Fix rq->cpu_load[] calculations" 5aaa0b7, "sched/nohz: Fix rq->cpu_load calculations some more" 5167e8d, "sched/nohz: Rewrite and fix load-avg computation -- again" More importantly, it helps reduce the size of the main sched/core.c by yet another significant amount (~600 lines). Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1366398650-31599-2-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-04-19 15:10:49 -04:00			`extern long calc_load_fold_active(struct rq *this_rq);`
sched: Move the loadavg code to a more obvious location I could not find the loadavg code.. turns out it was hidden in a file called proc.c. It further got mingled up with the cruft per rq load indexes (which we really want to get rid of). Move the per rq load indexes into the fair.c load-balance code (that's the only thing that uses them) and rename proc.c to loadavg.c so we can find it again. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> [ Did minor cleanups to the code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-04-14 13:19:42 +02:00
			`#ifdef CONFIG_SMP`
sched: Factor out load calculation code from sched/core.c --> sched/proc.c This large chunk of load calculation code can be easily divorced from the main core.c scheduler file, with only a couple prototypes and externs added to a kernel/sched header. Some recent commits expanded the code and the documentation of it, making it large enough to warrant separation. For example, see: 556061b, "sched/nohz: Fix rq->cpu_load[] calculations" 5aaa0b7, "sched/nohz: Fix rq->cpu_load calculations some more" 5167e8d, "sched/nohz: Rewrite and fix load-avg computation -- again" More importantly, it helps reduce the size of the main sched/core.c by yet another significant amount (~600 lines). Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1366398650-31599-2-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-04-19 15:10:49 -04:00			`extern void update_cpu_load_active(struct rq *this_rq);`
sched: Move the loadavg code to a more obvious location I could not find the loadavg code.. turns out it was hidden in a file called proc.c. It further got mingled up with the cruft per rq load indexes (which we really want to get rid of). Move the per rq load indexes into the fair.c load-balance code (that's the only thing that uses them) and rename proc.c to loadavg.c so we can find it again. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> [ Did minor cleanups to the code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-04-14 13:19:42 +02:00			`#else`
			`static inline void update_cpu_load_active(struct rq *this_rq) { }`
			`#endif`
sched: Factor out load calculation code from sched/core.c --> sched/proc.c This large chunk of load calculation code can be easily divorced from the main core.c scheduler file, with only a couple prototypes and externs added to a kernel/sched header. Some recent commits expanded the code and the documentation of it, making it large enough to warrant separation. For example, see: 556061b, "sched/nohz: Fix rq->cpu_load[] calculations" 5aaa0b7, "sched/nohz: Fix rq->cpu_load calculations some more" 5167e8d, "sched/nohz: Rewrite and fix load-avg computation -- again" More importantly, it helps reduce the size of the main sched/core.c by yet another significant amount (~600 lines). Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1366398650-31599-2-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-04-19 15:10:49 -04:00
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`/*`
			`* Helpers for converting nanosecond timing to jiffy resolution`
			`*/`
			`#define NS_TO_JIFFIES(TIME) ((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))`

sched: Move SCHED_LOAD_SHIFT macros to kernel/sched/sched.h They are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A771.4070104@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:09 +08:00			`/*`
			`* Increase resolution of nice-level calculations for 64-bit architectures.`
			`* The extra resolution improves shares distribution and load balancing of`
			`* low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup`
			`* hierarchies, especially on larger systems. This is not a user-visible change`
			`* and does not change the user-interface for setting shares/weights.`
			`*`
			`* We increase resolution only if we have enough bits to allow this increased`
			`* resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution`
			`* when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the`
			`* increased costs.`
			`*/`
			`#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */`
			`# define SCHED_LOAD_RESOLUTION 10`
			`# define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION)`
			`# define scale_load_down(w) ((w) >> SCHED_LOAD_RESOLUTION)`
			`#else`
			`# define SCHED_LOAD_RESOLUTION 0`
			`# define scale_load(w) (w)`
			`# define scale_load_down(w) (w)`
			`#endif`

			`#define SCHED_LOAD_SHIFT (10 + SCHED_LOAD_RESOLUTION)`
			`#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#define NICE_0_LOAD SCHED_LOAD_SCALE`
			`#define NICE_0_SHIFT SCHED_LOAD_SHIFT`

sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks In order of deadline scheduling to be effective and useful, it is important that some method of having the allocation of the available CPU bandwidth to tasks and task groups under control. This is usually called "admission control" and if it is not performed at all, no guarantee can be given on the actual scheduling of the -deadline tasks. Since when RT-throttling has been introduced each task group have a bandwidth associated to itself, calculated as a certain amount of runtime over a period. Moreover, to make it possible to manipulate such bandwidth, readable/writable controls have been added to both procfs (for system wide settings) and cgroupfs (for per-group settings). Therefore, the same interface is being used for controlling the bandwidth distrubution to -deadline tasks and task groups, i.e., new controls but with similar names, equivalent meaning and with the same usage paradigm are added. However, more discussion is needed in order to figure out how we want to manage SCHED_DEADLINE bandwidth at the task group level. Therefore, this patch adds a less sophisticated, but actually very sensible, mechanism to ensure that a certain utilization cap is not overcome per each root_domain (the single rq for !SMP configurations). Another main difference between deadline bandwidth management and RT-throttling is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!), and thus we don't need an higher level throttling mechanism to enforce the desired bandwidth. This patch, therefore: - adds system wide deadline bandwidth management by means of: * /proc/sys/kernel/sched_dl_runtime_us, * /proc/sys/kernel/sched_dl_period_us, that determine (i.e., runtime / period) the total bandwidth available on each CPU of each root_domain for -deadline tasks; - couples the RT and deadline bandwidth management, i.e., enforces that the sum of how much bandwidth is being devoted to -rt -deadline tasks to stay below 100%. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created until the sum of their bandwidths stay below: M * (sched_dl_runtime_us / sched_dl_period_us) It is also possible to disable this bandwidth management logic, and be thus free of oversubscribing the system up to any arbitrary level. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:45 +01:00			`/*`
			`* Single value that decides SCHED_DEADLINE internal math precision.`
			`* 10 -> just above 1us`
			`* 9 -> just above 0.5us`
			`*/`
			`#define DL_SCALE (10)`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`/*`
			`* These are the 'tuning knobs' of the scheduler:`
			`*/`

			`/*`
			`* single value that denotes runtime == period, ie unlimited time.`
			`*/`
			`#define RUNTIME_INF ((u64)~0ULL)`

sched/core: Make policy-testing consistent Most of the policy-tests are done via the <class>_policy() helpers with the notable exception of idle. A new wrapper for valid_policy() has also been added to improve readability in set_load_weight(). This commit does not change the logical behavior of the scheduler core. Signed-off-by: Henrik Austad <henrik@austad.us> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1441810841-4756-1-git-send-email-henrik@austad.us Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-09-09 17:00:41 +02:00			`static inline int idle_policy(int policy)`
			`{`
			`return policy == SCHED_IDLE;`
			`}`
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI Add the syscalls needed for supporting scheduling algorithms with extended scheduling parameters (e.g., SCHED_DEADLINE). In general, it makes possible to specify a periodic/sporadic task, that executes for a given amount of runtime at each instance, and is scheduled according to the urgency of their own timing constraints, i.e.: - a (maximum/typical) instance execution time, - a minimum interval between consecutive instances, - a time constraint by which each instance must be completed. Thus, both the data structure that holds the scheduling parameters of the tasks and the system calls dealing with it must be extended. Unfortunately, modifying the existing struct sched_param would break the ABI and result in potentially serious compatibility issues with legacy binaries. For these reasons, this patch: - defines the new struct sched_attr, containing all the fields that are necessary for specifying a task in the computational model described above; - defines and implements the new scheduling related syscalls that manipulate it, i.e., sched_setattr() and sched_getattr(). Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a proof of concept and for developing and testing purposes. Making them available on other architectures is straightforward. Since no "user" for these new parameters is introduced in this patch, the implementation of the new system calls is just identical to their already existing counterpart. Future patches that implement scheduling policies able to exploit the new data structure must also take care of modifying the sched_*attr() calls accordingly with their own purposes. Signed-off-by: Dario Faggioli <raistlin@linux.it> [ Rewrote to use sched_attr. ] Signed-off-by: Juri Lelli <juri.lelli@gmail.com> [ Removed sched_setscheduler2() for now. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:36 +01:00			`static inline int fair_policy(int policy)`
			`{`
			`return policy == SCHED_NORMAL \|\| policy == SCHED_BATCH;`
			`}`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`static inline int rt_policy(int policy)`
			`{`
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI Add the syscalls needed for supporting scheduling algorithms with extended scheduling parameters (e.g., SCHED_DEADLINE). In general, it makes possible to specify a periodic/sporadic task, that executes for a given amount of runtime at each instance, and is scheduled according to the urgency of their own timing constraints, i.e.: - a (maximum/typical) instance execution time, - a minimum interval between consecutive instances, - a time constraint by which each instance must be completed. Thus, both the data structure that holds the scheduling parameters of the tasks and the system calls dealing with it must be extended. Unfortunately, modifying the existing struct sched_param would break the ABI and result in potentially serious compatibility issues with legacy binaries. For these reasons, this patch: - defines the new struct sched_attr, containing all the fields that are necessary for specifying a task in the computational model described above; - defines and implements the new scheduling related syscalls that manipulate it, i.e., sched_setattr() and sched_getattr(). Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a proof of concept and for developing and testing purposes. Making them available on other architectures is straightforward. Since no "user" for these new parameters is introduced in this patch, the implementation of the new system calls is just identical to their already existing counterpart. Future patches that implement scheduling policies able to exploit the new data structure must also take care of modifying the sched_*attr() calls accordingly with their own purposes. Signed-off-by: Dario Faggioli <raistlin@linux.it> [ Rewrote to use sched_attr. ] Signed-off-by: Juri Lelli <juri.lelli@gmail.com> [ Removed sched_setscheduler2() for now. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:36 +01:00			`return policy == SCHED_FIFO \|\| policy == SCHED_RR;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`}`

sched/deadline: Add SCHED_DEADLINE structures & implementation Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-28 11:14:43 +01:00			`static inline int dl_policy(int policy)`
			`{`
			`return policy == SCHED_DEADLINE;`
			`}`
sched/core: Make policy-testing consistent Most of the policy-tests are done via the <class>_policy() helpers with the notable exception of idle. A new wrapper for valid_policy() has also been added to improve readability in set_load_weight(). This commit does not change the logical behavior of the scheduler core. Signed-off-by: Henrik Austad <henrik@austad.us> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1441810841-4756-1-git-send-email-henrik@austad.us Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-09-09 17:00:41 +02:00			`static inline bool valid_policy(int policy)`
			`{`
			`return idle_policy(policy) \|\| fair_policy(policy) \|\|`
			`rt_policy(policy) \|\| dl_policy(policy);`
			`}`
sched/deadline: Add SCHED_DEADLINE structures & implementation Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-28 11:14:43 +01:00
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`static inline int task_has_rt_policy(struct task_struct *p)`
			`{`
			`return rt_policy(p->policy);`
			`}`

sched/deadline: Add SCHED_DEADLINE structures & implementation Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-28 11:14:43 +01:00			`static inline int task_has_dl_policy(struct task_struct *p)`
			`{`
			`return dl_policy(p->policy);`
			`}`

sched/deadline: Add SCHED_DEADLINE inheritance logic Some method to deal with rt-mutexes and make sched_dl interact with the current PI-coded is needed, raising all but trivial issues, that needs (according to us) to be solved with some restructuring of the pi-code (i.e., going toward a proxy execution-ish implementation). This is under development, in the meanwhile, as a temporary solution, what this commits does is: - ensure a pi-lock owner with waiters is never throttled down. Instead, when it runs out of runtime, it immediately gets replenished and it's deadline is postponed; - the scheduling parameters (relative deadline and default runtime) used for that replenishments --during the whole period it holds the pi-lock-- are the ones of the waiting task with earliest deadline. Acting this way, we provide some kind of boosting to the lock-owner, still by using the existing (actually, slightly modified by the previous commit) pi-architecture. We would stress the fact that this is only a surely needed, all but clean solution to the problem. In the end it's only a way to re-start discussion within the community. So, as always, comments, ideas, rants, etc.. are welcome! :-) Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> [ Added !RT_MUTEXES build fix. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:44 +01:00			`/*`
			`* Tells if entity @a should preempt entity @b.`
			`*/`
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks In order of deadline scheduling to be effective and useful, it is important that some method of having the allocation of the available CPU bandwidth to tasks and task groups under control. This is usually called "admission control" and if it is not performed at all, no guarantee can be given on the actual scheduling of the -deadline tasks. Since when RT-throttling has been introduced each task group have a bandwidth associated to itself, calculated as a certain amount of runtime over a period. Moreover, to make it possible to manipulate such bandwidth, readable/writable controls have been added to both procfs (for system wide settings) and cgroupfs (for per-group settings). Therefore, the same interface is being used for controlling the bandwidth distrubution to -deadline tasks and task groups, i.e., new controls but with similar names, equivalent meaning and with the same usage paradigm are added. However, more discussion is needed in order to figure out how we want to manage SCHED_DEADLINE bandwidth at the task group level. Therefore, this patch adds a less sophisticated, but actually very sensible, mechanism to ensure that a certain utilization cap is not overcome per each root_domain (the single rq for !SMP configurations). Another main difference between deadline bandwidth management and RT-throttling is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!), and thus we don't need an higher level throttling mechanism to enforce the desired bandwidth. This patch, therefore: - adds system wide deadline bandwidth management by means of: * /proc/sys/kernel/sched_dl_runtime_us, * /proc/sys/kernel/sched_dl_period_us, that determine (i.e., runtime / period) the total bandwidth available on each CPU of each root_domain for -deadline tasks; - couples the RT and deadline bandwidth management, i.e., enforces that the sum of how much bandwidth is being devoted to -rt -deadline tasks to stay below 100%. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created until the sum of their bandwidths stay below: M * (sched_dl_runtime_us / sched_dl_period_us) It is also possible to disable this bandwidth management logic, and be thus free of oversubscribing the system up to any arbitrary level. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:45 +01:00			`static inline bool`
			`dl_entity_preempt(struct sched_dl_entity a, struct sched_dl_entity b)`
sched/deadline: Add SCHED_DEADLINE inheritance logic Some method to deal with rt-mutexes and make sched_dl interact with the current PI-coded is needed, raising all but trivial issues, that needs (according to us) to be solved with some restructuring of the pi-code (i.e., going toward a proxy execution-ish implementation). This is under development, in the meanwhile, as a temporary solution, what this commits does is: - ensure a pi-lock owner with waiters is never throttled down. Instead, when it runs out of runtime, it immediately gets replenished and it's deadline is postponed; - the scheduling parameters (relative deadline and default runtime) used for that replenishments --during the whole period it holds the pi-lock-- are the ones of the waiting task with earliest deadline. Acting this way, we provide some kind of boosting to the lock-owner, still by using the existing (actually, slightly modified by the previous commit) pi-architecture. We would stress the fact that this is only a surely needed, all but clean solution to the problem. In the end it's only a way to re-start discussion within the community. So, as always, comments, ideas, rants, etc.. are welcome! :-) Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> [ Added !RT_MUTEXES build fix. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:44 +01:00			`{`
			`return dl_time_before(a->deadline, b->deadline);`
			`}`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`/*`
			`* This is the priority-queue data structure of the RT scheduling class:`
			`*/`
			`struct rt_prio_array {`
			`DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */`
			`struct list_head queue[MAX_RT_PRIO];`
			`};`

			`struct rt_bandwidth {`
			`/* nests inside the rq lock: */`
			`raw_spinlock_t rt_runtime_lock;`
			`ktime_t rt_period;`
			`u64 rt_runtime;`
			`struct hrtimer rt_period_timer;`
sched,perf: Fix periodic timers In the below two commits (see Fixes) we have periodic timers that can stop themselves when they're no longer required, but need to be (re)-started when their idle condition changes. Further complications is that we want the timer handler to always do the forward such that it will always correctly deal with the overruns, and we do not want to race such that the handler has already decided to stop, but the (external) restart sees the timer still active and we end up with a 'lost' timer. The problem with the current code is that the re-start can come before the callback does the forward, at which point the forward from the callback will WARN about forwarding an enqueued timer. Now, conceptually its easy to detect if you're before or after the fwd by comparing the expiration time against the current time. Of course, that's expensive (and racy) because we don't have the current time. Alternatively one could cache this state inside the timer, but then everybody pays the overhead of maintaining this extra state, and that is undesired. The only other option that I could see is the external timer_active variable, which I tried to kill before. I would love a nicer interface for this seemingly simple 'problem' but alas. Fixes: 272325c4821f ("perf: Fix mux_interval hrtimer wreckage") Fixes: 77a4d1a1b9a1 ("sched: Cleanup bandwidth timers") Cc: pjt@google.com Cc: tglx@linutronix.de Cc: klamm@yandex-team.ru Cc: mingo@kernel.org Cc: bsegall@google.com Cc: hpa@zytor.com Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net 2015-05-14 12:23:11 +02:00			`unsigned int rt_period_active;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`};`
sched/deadline: Clear dl_entity params when setscheduling to different class When a task is using SCHED_DEADLINE and the user setschedules it to a different class its sched_dl_entity static parameters are not cleaned up. This causes a bug if the user sets it back to SCHED_DEADLINE with the same parameters again. The problem resides in the check we perform at the very beginning of dl_overflow(): if (new_bw == p->dl.dl_bw) return 0; This condition is met in the case depicted above, so the function returns and dl_b->total_bw is not updated (the p->dl.dl_bw is not added to it). After this, admission control is broken. This patch fixes the thing, properly clearing static parameters for a task that ceases to use SCHED_DEADLINE. Reported-by: Daniele Alessandrelli <daniele.alessandrelli@gmail.com> Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reported-by: Vincent Legout <vincent@legout.info> Tested-by: Luca Abeni <luca.abeni@unitn.it> Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Tested-by: Vincent Legout <vincent@legout.info> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Fabio Checconi <fchecconi@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Cc: Michael Trimarchi <michael@amarulasolutions.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1411118561-26323-2-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-09-19 10:22:39 +01:00
			`void __dl_clear_params(struct task_struct *p);`

sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks In order of deadline scheduling to be effective and useful, it is important that some method of having the allocation of the available CPU bandwidth to tasks and task groups under control. This is usually called "admission control" and if it is not performed at all, no guarantee can be given on the actual scheduling of the -deadline tasks. Since when RT-throttling has been introduced each task group have a bandwidth associated to itself, calculated as a certain amount of runtime over a period. Moreover, to make it possible to manipulate such bandwidth, readable/writable controls have been added to both procfs (for system wide settings) and cgroupfs (for per-group settings). Therefore, the same interface is being used for controlling the bandwidth distrubution to -deadline tasks and task groups, i.e., new controls but with similar names, equivalent meaning and with the same usage paradigm are added. However, more discussion is needed in order to figure out how we want to manage SCHED_DEADLINE bandwidth at the task group level. Therefore, this patch adds a less sophisticated, but actually very sensible, mechanism to ensure that a certain utilization cap is not overcome per each root_domain (the single rq for !SMP configurations). Another main difference between deadline bandwidth management and RT-throttling is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!), and thus we don't need an higher level throttling mechanism to enforce the desired bandwidth. This patch, therefore: - adds system wide deadline bandwidth management by means of: * /proc/sys/kernel/sched_dl_runtime_us, * /proc/sys/kernel/sched_dl_period_us, that determine (i.e., runtime / period) the total bandwidth available on each CPU of each root_domain for -deadline tasks; - couples the RT and deadline bandwidth management, i.e., enforces that the sum of how much bandwidth is being devoted to -rt -deadline tasks to stay below 100%. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created until the sum of their bandwidths stay below: M * (sched_dl_runtime_us / sched_dl_period_us) It is also possible to disable this bandwidth management logic, and be thus free of oversubscribing the system up to any arbitrary level. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:45 +01:00			`/*`
			`* To keep the bandwidth of -deadline tasks and groups under control`
			`* we need some place where:`
			`* - store the maximum -deadline bandwidth of the system (the group);`
			`* - cache the fraction of that bandwidth that is currently allocated.`
			`*`
			`* This is all done in the data structure below. It is similar to the`
			`* one used for RT-throttling (rt_bandwidth), with the main difference`
			`* that, since here we are only interested in admission control, we`
			`* do not decrease any runtime while the group "executes", neither we`
			`* need a timer to replenish it.`
			`*`
			`* With respect to SMP, the bandwidth is given on a per-CPU basis,`
			`* meaning that:`
			`* - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;`
			`* - dl_total_bw array contains, in the i-eth element, the currently`
			`* allocated bandwidth on the i-eth CPU.`
			`* Moreover, groups consume bandwidth on each CPU, while tasks only`
			`* consume bandwidth on the CPU they're running on.`
			`* Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw`
			`* that will be shown the next time the proc or cgroup controls will`
			`* be red. It on its turn can be changed by writing on its own`
			`* control.`
			`*/`
			`struct dl_bandwidth {`
			`raw_spinlock_t dl_runtime_lock;`
			`u64 dl_runtime;`
			`u64 dl_period;`
			`};`

			`static inline int dl_bandwidth_enabled(void)`
			`{`
sched/deadline: Remove the sysctl_sched_dl knobs Remove the deadline specific sysctls for now. The problem with them is that the interaction with the exisiting rt knobs is nearly impossible to get right. The current (as per before this patch) situation is that the rt and dl bandwidth is completely separate and we enforce rt+dl < 100%. This is undesirable because this means that the rt default of 95% leaves us hardly any room, even though dl tasks are saver than rt tasks. Another proposed solution was (a discarted patch) to have the dl bandwidth be a fraction of the rt bandwidth. This is highly confusing imo. Furthermore neither proposal is consistent with the situation we actually want; which is rt tasks ran from a dl server. In which case the rt bandwidth is a direct subset of dl. So whichever way we go, the introduction of dl controls at this point is painful. Therefore remove them and instead share the rt budget. This means that for now the rt knobs are used for dl admission control and the dl runtime is accounted against the rt runtime. I realise that this isn't entirely desirable either; but whatever we do we appear to need to change the interface later, so better have a small interface for now. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-12-17 12:44:49 +01:00			`return sysctl_sched_rt_runtime >= 0;`
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks In order of deadline scheduling to be effective and useful, it is important that some method of having the allocation of the available CPU bandwidth to tasks and task groups under control. This is usually called "admission control" and if it is not performed at all, no guarantee can be given on the actual scheduling of the -deadline tasks. Since when RT-throttling has been introduced each task group have a bandwidth associated to itself, calculated as a certain amount of runtime over a period. Moreover, to make it possible to manipulate such bandwidth, readable/writable controls have been added to both procfs (for system wide settings) and cgroupfs (for per-group settings). Therefore, the same interface is being used for controlling the bandwidth distrubution to -deadline tasks and task groups, i.e., new controls but with similar names, equivalent meaning and with the same usage paradigm are added. However, more discussion is needed in order to figure out how we want to manage SCHED_DEADLINE bandwidth at the task group level. Therefore, this patch adds a less sophisticated, but actually very sensible, mechanism to ensure that a certain utilization cap is not overcome per each root_domain (the single rq for !SMP configurations). Another main difference between deadline bandwidth management and RT-throttling is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!), and thus we don't need an higher level throttling mechanism to enforce the desired bandwidth. This patch, therefore: - adds system wide deadline bandwidth management by means of: * /proc/sys/kernel/sched_dl_runtime_us, * /proc/sys/kernel/sched_dl_period_us, that determine (i.e., runtime / period) the total bandwidth available on each CPU of each root_domain for -deadline tasks; - couples the RT and deadline bandwidth management, i.e., enforces that the sum of how much bandwidth is being devoted to -rt -deadline tasks to stay below 100%. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created until the sum of their bandwidths stay below: M * (sched_dl_runtime_us / sched_dl_period_us) It is also possible to disable this bandwidth management logic, and be thus free of oversubscribing the system up to any arbitrary level. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:45 +01:00			`}`

			`extern struct dl_bw *dl_bw_of(int i);`

			`struct dl_bw {`
			`raw_spinlock_t lock;`
			`u64 bw, total_bw;`
			`};`

sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks affinity (performing what is commonly called clustered scheduling). Unfortunately, such thing is currently broken for two reasons: - No check is performed when the user tries to attach a task to an exlusive cpuset (recall that exclusive cpusets have an associated maximum allowed bandwidth). - Bandwidths of source and destination cpusets are not correctly updated after a task is migrated between them. This patch fixes both things at once, as they are opposite faces of the same coin. The check is performed in cpuset_can_attach(), as there aren't any points of failure after that function. The updated is split in two halves. We first reserve bandwidth in the destination cpuset, after we pass the check in cpuset_can_attach(). And we then release bandwidth from the source cpuset when the task's affinity is actually changed. Even if there can be time windows when sched_setattr() may erroneously fail in the source cpuset, we are fine with it, as we can't perfom an atomic update of both cpusets at once. Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reported-by: Vincent Legout <vincent@legout.info> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dario Faggioli <raistlin@linux.it> Cc: Michael Trimarchi <michael@amarulasolutions.com> Cc: Fabio Checconi <fchecconi@gmail.com> Cc: michael@amarulasolutions.com Cc: luca.abeni@unitn.it Cc: Li Zefan <lizefan@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: cgroups@vger.kernel.org Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-09-19 10:22:40 +01:00			`static inline`
			`void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)`
			`{`
			`dl_b->total_bw -= tsk_bw;`
			`}`

			`static inline`
			`void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)`
			`{`
			`dl_b->total_bw += tsk_bw;`
			`}`

			`static inline`
			`bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)`
			`{`
			`return dl_b->bw != -1 &&`
			`dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;`
			`}`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`extern struct mutex sched_domains_mutex;`

			`#ifdef CONFIG_CGROUP_SCHED`

			`#include <linux/cgroup.h>`

			`struct cfs_rq;`
			`struct rt_rq;`

sched,cgroup: Fix up task_groups list With multiple instances of task_groups, for_each_rt_rq() is a noop, no task groups having been added to the rt.c list instance. This renders __enable/disable_runtime() and print_rt_stats() noop, the user (non) visible effect being that rt task groups are missing in /proc/sched_debug. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: stable@kernel.org # v3.3+ Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2012-08-07 05:00:13 +02:00			`extern struct list_head task_groups;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
			`struct cfs_bandwidth {`
			`#ifdef CONFIG_CFS_BANDWIDTH`
			`raw_spinlock_t lock;`
			`ktime_t period;`
			`u64 quota, runtime;`
sched: Clean up some typos and grammatical errors in code/comments Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1411262676-19928-1-git-send-email-zzhsuny@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-09-20 21:24:36 -04:00			`s64 hierarchical_quota;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`u64 runtime_expires;`

sched,perf: Fix periodic timers In the below two commits (see Fixes) we have periodic timers that can stop themselves when they're no longer required, but need to be (re)-started when their idle condition changes. Further complications is that we want the timer handler to always do the forward such that it will always correctly deal with the overruns, and we do not want to race such that the handler has already decided to stop, but the (external) restart sees the timer still active and we end up with a 'lost' timer. The problem with the current code is that the re-start can come before the callback does the forward, at which point the forward from the callback will WARN about forwarding an enqueued timer. Now, conceptually its easy to detect if you're before or after the fwd by comparing the expiration time against the current time. Of course, that's expensive (and racy) because we don't have the current time. Alternatively one could cache this state inside the timer, but then everybody pays the overhead of maintaining this extra state, and that is undesired. The only other option that I could see is the external timer_active variable, which I tried to kill before. I would love a nicer interface for this seemingly simple 'problem' but alas. Fixes: 272325c4821f ("perf: Fix mux_interval hrtimer wreckage") Fixes: 77a4d1a1b9a1 ("sched: Cleanup bandwidth timers") Cc: pjt@google.com Cc: tglx@linutronix.de Cc: klamm@yandex-team.ru Cc: mingo@kernel.org Cc: bsegall@google.com Cc: hpa@zytor.com Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net 2015-05-14 12:23:11 +02:00			`int idle, period_active;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`struct hrtimer period_timer, slack_timer;`
			`struct list_head throttled_cfs_rq;`

			`/* statistics */`
			`int nr_periods, nr_throttled;`
			`u64 throttled_time;`
			`#endif`
			`};`

			`/* task group related information */`
			`struct task_group {`
			`struct cgroup_subsys_state css;`

sched: Add cgroup-based criteria for upmigration It may be desirable to discourage upmigration of tasks belonging to some cgroups. Add a per-cgroup flag (upmigrate_discourage) that discourages upmigration of tasks of a cgroup. Tasks of the cgroup are allowed to upmigrate only under overcommitted scenario. Change-Id: I1780e420af1b6865c5332fb55ee1ee408b74d8ce Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Use new cgroup APIs] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-02-06 18:05:53 +05:30			`#ifdef CONFIG_SCHED_HMP`
			`bool upmigrate_discouraged;`
			`#endif`
sched: provide per cpu-cgroup option to notify on migrations On systems where CPUs may run asynchronously, task migrations between CPUs running at grossly different speeds can cause problems. This change provides a mechanism to notify a subsystem in the kernel if a task in a particular cgroup migrates to a different CPU. Other subsystems (such as cpufreq) may then register for this notifier to take appropriate action when such a task is migrated. The cgroup attribute to set for this behavior is "notify_on_migrate" . Change-Id: Ie1868249e53ef901b89c837fdc33b0ad0c0a4590 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> [rameezmustafa@codeaurora.org: Use new cgroup APIs, fix 64-bit compilation issues and resolve some merge conflicts. Also squash "2bd8075 sched: remove migration notification from RT class" into this patch.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: Incorporated with new __migrate_task().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2013-03-11 16:33:42 -07:00
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#ifdef CONFIG_FAIR_GROUP_SCHED`
			`/* schedulable entities of this group on each cpu */`
			`struct sched_entity **se;`
			`/* runqueue "owned" by this group on each cpu */`
			`struct cfs_rq **cfs_rq;`
			`unsigned long shares;`

sched: Move a few runnable tg variables into CONFIG_SMP The following 2 variables are only used under CONFIG_SMP, so its better to move their definiation into CONFIG_SMP too. atomic64_t load_avg; atomic_t runnable_avg; Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-3-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-06-20 10:18:46 +08:00			`#ifdef CONFIG_SMP`
sched/tg: Use 'unsigned long' for load variable in task group Since tg->load_avg is smaller than tg->load_weight, we don't need a atomic64_t variable for load_avg in 32 bit machine. The same reason for cfs_rq->tg_load_contrib. The atomic_long_t/unsigned long variable type are more efficient and convenience for them. Signed-off-by: Alex Shi <alex.shi@intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-11-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-06-20 10:18:54 +08:00			`atomic_long_t load_avg;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#endif`
sched: Move a few runnable tg variables into CONFIG_SMP The following 2 variables are only used under CONFIG_SMP, so its better to move their definiation into CONFIG_SMP too. atomic64_t load_avg; atomic_t runnable_avg; Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-3-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-06-20 10:18:46 +08:00			`#endif`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
			`#ifdef CONFIG_RT_GROUP_SCHED`
			`struct sched_rt_entity **rt_se;`
			`struct rt_rq **rt_rq;`

			`struct rt_bandwidth rt_bandwidth;`
			`#endif`

			`struct rcu_head rcu;`
			`struct list_head list;`

			`struct task_group *parent;`
			`struct list_head siblings;`
			`struct list_head children;`

			`#ifdef CONFIG_SCHED_AUTOGROUP`
			`struct autogroup *autogroup;`
			`#endif`

			`struct cfs_bandwidth cfs_bandwidth;`
			`};`

			`#ifdef CONFIG_FAIR_GROUP_SCHED`
			`#define ROOT_TASK_GROUP_LOAD NICE_0_LOAD`

			`/*`
			`* A weight of 0 or 1 can cause arithmetics problems.`
			`* A weight of a cfs_rq is the sum of weights of which entities`
			`* are queued on this cfs_rq, so a weight of a entity should not be`
			`* too large, so as the shares value of a task group.`
			`* (The default weight is 1024 - so there's no practical`
			`* limitation from this.)`
			`*/`
			`#define MIN_SHARES (1UL << 1)`
			`#define MAX_SHARES (1UL << 18)`
			`#endif`

			`typedef int (tg_visitor)(struct task_group , void *);`

			`extern int walk_tg_tree_from(struct task_group *from,`
			`tg_visitor down, tg_visitor up, void *data);`

			`/*`
			`* Iterate the full tree, calling @down when first entering a node and @up when`
			`* leaving it for the final time.`
			`*`
			`* Caller must hold rcu_lock or sufficient equivalent.`
			`*/`
			`static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)`
			`{`
			`return walk_tg_tree_from(&root_task_group, down, up, data);`
			`}`

			`extern int tg_nop(struct task_group tg, void data);`

			`extern void free_fair_sched_group(struct task_group *tg);`
			`extern int alloc_fair_sched_group(struct task_group tg, struct task_group parent);`
sched/cgroup: Fix cgroup entity load tracking tear-down When a cgroup's CPU runqueue is destroyed, it should remove its remaining load accounting from its parent cgroup. The current site for doing so it unsuited because its far too late and unordered against other cgroup removal (->css_free() will be, but we're also in an RCU callback). Put it in the ->css_offline() callback, which is the start of cgroup destruction, right after the group has been made unavailable to userspace. The ->css_offline() callbacks are called in hierarchical order after the following v4.4 commit: aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children") Change-Id: Ice7cbd71d9e545da84d61686aa46c7213607bb9d Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: 6fe1f348b3dd1f700f9630562b7d38afd6949568 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org> 2016-01-21 22:24:16 +01:00			`extern void unregister_fair_sched_group(struct task_group *tg);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`extern void init_tg_cfs_entry(struct task_group tg, struct cfs_rq cfs_rq,`
			`struct sched_entity *se, int cpu,`
			`struct sched_entity *parent);`
			`extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);`
			`extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);`

			`extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);`
sched: Cleanup bandwidth timers Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth(). The more I look at that code the more I'm convinced its crack, that entire __start_cfs_bandwidth() thing is brain melting, we don't need to cancel a timer before starting it, hrtimer_start() will happily remove the timer for you if its still enqueued. Removing that, removes a big part of the problem, no more ugly cancel loop to get stuck in. So now, if I understand things right, the entire reason you have this cfs_b->lock guarded ->timer_active nonsense is to make sure we don't accidentally lose the timer. It appears to me that it should be possible to guarantee that same by unconditionally (re)starting the timer when !queued. Because regardless what hrtimer::function will return, if we beat it to (re)enqueue the timer, it doesn't matter. Now, because hrtimers don't come with any serialization guarantees we must ensure both handler and (re)start loop serialize their access to the hrtimer to avoid both trying to forward the timer at the same time. Update the rt bandwidth timer to match. This effectively reverts: 09dc4ab03936 ("sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock"). Reported-by: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2015-04-15 11:41:57 +02:00			`extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);`

			`extern void free_rt_sched_group(struct task_group *tg);`
			`extern int alloc_rt_sched_group(struct task_group tg, struct task_group parent);`
			`extern void init_tg_rt_entry(struct task_group tg, struct rt_rq rt_rq,`
			`struct sched_rt_entity *rt_se, int cpu,`
			`struct sched_rt_entity *parent);`

sched: Move group scheduling functions out of include/linux/sched.h - Make sched_group_{set_,}runtime(), sched_group_{set_,}period() and sched_rt_can_attach() static. - Move sched_{create,destroy,online,offline}_group() to kernel/sched/sched.h. - Remove declaration of sched_group_shares(). Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A7C5.3000708@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:07:33 +08:00			`extern struct task_group sched_create_group(struct task_group parent);`
			`extern void sched_online_group(struct task_group *tg,`
			`struct task_group *parent);`
			`extern void sched_destroy_group(struct task_group *tg);`
			`extern void sched_offline_group(struct task_group *tg);`

			`extern void sched_move_task(struct task_struct *tsk);`

			`#ifdef CONFIG_FAIR_GROUP_SCHED`
			`extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);`
BACKPORT: sched/fair: Make it possible to account fair load avg consistently While set_task_rq_fair() is introduced in mainline by commit ad936d8658fd ("sched/fair: Make it possible to account fair load avg consistently"), the function results to be introduced here by the backport of commit 09a43ace1f98 ("sched/fair: Propagate load during synchronous attach/detach"). The problem (apart from the confusion introduced by the backport) is actually that set_task_rq_fair() is currently not called at all. Fix the problem by backporting again commit ad936d8658fd ("sched/fair: Make it possible to account fair load avg consistently"). Original change log: The current code accounts for the time a task was absent from the fair class (per ATTACH_AGE_LOAD). However it does not work correctly when a task got migrated or moved to another cgroup while outside of the fair class. This patch tries to address that by aging on migration. We locklessly read the 'last_update_time' stamp from both the old and new cfs_rq, ages the load upto the old time, and sets it to the new time. These timestamps should in general not be more than 1 tick apart from one another, so there is a definite bound on things. Signed-off-by: Byungchul Park <byungchul.park@lge.com> [ Changelog, a few edits and !SMP build fix ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry-picked from ad936d8658fd348338cb7d42c577dac77892b074) Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> Change-Id: I17294ab0ada3901d35895014715fd60952949358 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> 2017-05-30 14:51:53 +01:00
			`#ifdef CONFIG_SMP`
			`extern void set_task_rq_fair(struct sched_entity *se,`
			`struct cfs_rq prev, struct cfs_rq next);`
			`#else /* !CONFIG_SMP */`
			`static inline void set_task_rq_fair(struct sched_entity *se,`
			`struct cfs_rq prev, struct cfs_rq next) { }`
			`#endif /* CONFIG_SMP */`
			`#endif /* CONFIG_FAIR_GROUP_SCHED */`
sched: Move group scheduling functions out of include/linux/sched.h - Make sched_group_{set_,}runtime(), sched_group_{set_,}period() and sched_rt_can_attach() static. - Move sched_{create,destroy,online,offline}_group() to kernel/sched/sched.h. - Remove declaration of sched_group_shares(). Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A7C5.3000708@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:07:33 +08:00
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`extern struct task_group css_tg(struct cgroup_subsys_state css);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#else /* CONFIG_CGROUP_SCHED */`

			`struct cfs_bandwidth { };`

			`#endif /* CONFIG_CGROUP_SCHED */`

sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`#ifdef CONFIG_SCHED_HMP`

sched: Add the mechanics of top task tracking for frequency guidance The previous patches in this rewrite of scheduler guided frequency selection reintroduces the part-picture problem that we addressed in our initial implementation. In that, when tasks migrate across CPUs within a cluster, we end up losing the complete picture of the sequential nature of the workload. This patch aims to solve that problem slightly differently. We track the top task on every CPU within a window. Top task is defined as the task that runs the most in a given window. This enhances our ability to detect the sequential nature of workloads. A single migrating task executing for an entire window will cause 100% load to be reported for frequency guidance instead of the maximum footprint left on any individual CPU in the task's trail. There are cases, that this new approach does not address. Namely, cases where the sum of two or more tasks accurately reflects the true sequential nature of the workload. Future optimizations might aim to tackle that problem. To track top tasks, we first realize that there is no strict need to maintain the task struct itself as long as we know the load exerted by the top task. We also realize that to maintain top tasks on every CPU we have to track the execution of every single task that runs during the window. The load associated with a task needs to be migrated when the task migrates from one CPU to another. When the top task migrates away, we need to locate the second top task and so on. Given the above realizations, we use hashmaps to track top task load both for the current and the previous window. This hashmap is implemented as an array of fixed size. The key of the hashmap is given by task_execution_time_in_a_window / array_size. The size of the array (number of buckets in the hashmap) dictate the load granularity of each bucket. The value stored in each bucket is a refcount of all the tasks that executed long enough to be in that bucket. This approach has a few benefits. Firstly, any top task stats update now take O(1) time. While task migration is also O(1), it does still involve going through up to the size of the array to find the second top task. Further patches will aim to optimize this behavior. Secondly, and more importantly, not having to store the task struct itself saves a lot of memory usage in that 1) there is no need to retrieve task structs later causing cache misses and 2) we don't have to unnecessarily hold up task memory for up to 2 full windows by calling get_task_struct() after a task exits. Change-Id: I004dba474f41590db7d3f40d9deafe86e71359ac Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-05-31 16:40:45 -07:00			`#define NUM_TRACKED_WINDOWS 2`
			`#define NUM_LOAD_INDICES 1000`
sched: Enhance the scheduler migration load fixup feature In the current frequency guidance implementation the scheduler migrates task load from the source CPU to the destination CPU when a task migrates. The underlying assumption is that a task will stay on the destination CPU following the migration. Hence a CPU's load should reflect the sum of all tasks that last ran on that CPU prior to window expiration even if these tasks executed on some other CPU in that window prior to being migrated. However, given the ubiquitous nature of migrations the above assumption is flawed causing the scheduler to often add up load on a single CPU that in reality ran concurrently on multiple CPUs and will continue to run concurrently in subsequent windows. This leads to load over reporting on a single CPU which in turn causes CPU frequency to be higher than necessary. This is the first patch in a series of patches that attempts to change how load fixups are done upon migration to prevent load over reporting. In this patch, we stop doing migration fixups for intra-cluster migrations. Inter-cluster migration fixups are still retained. In order to achieve the above, we make use the per CPU footprint of each task introduced in the previous patch. Upon inter cluster migration, we go through every CPU in the source cluster to subtract the migrating task's contribution to the busy time on each one of those CPUs. The sum of the contributions is then added to the destination CPU allowing it to ramp up to the appropriate frequency for that task. Subtracting load from each of the source CPUs is not trivial, however, as it would require all runqueue locks to held. To get around this we introduce a deferred load subtraction mechanism whereby subtracting load from each of the source CPUs in deferred until an opportune moment. This opportune moment is when the governor comes asking the scheduler for load. At that time, all necessary runqueue locks are already held. There are a few cases to consider when doing deferred subtraction. Since we are not holding all runqueue locks other CPUs in the source cluster can be in a different window than the source CPU where the task is migrating from. Case 1: Other CPU in the source cluster is in the same window No special consideration Case 2: Other CPU in the source cluster is ahead by 1 window In this case, we will be doing redundant updates to subtraction load for the prev window. There is no way to avoid this redundant update though, without holding the rq lock. Case 3: Other CPU in the source cluster is trailing by 1 window In this case, we might end up overwriting old data for that CPU. But this is not a problem as when the other CPU calls update_task_ravg() it will move to the same window. This relies on maintaining synchronized windows between CPUs, which is true today. Finally, we must deal with frequency aggregation. When frequency aggregation is in effect, there is little point in dealing with per CPU footprint since the load of all related tasks have to be reported on a single CPU. Therefore when a task enters a related group we clear out all per CPU contributions and add it to the task CPU's cpu_time struct. From that point onwards we stop managing per CPU contributions upon inter cluster migrations since that work is redundant. Finally when a task exits a related group we must walk every CPU in reset all CPU contributions. We then set the task CPU contribution to the respective curr/prev sum values and add that sum to the task CPU rq runnable sum. Change-Id: I1f8d596e6c930f3f6f00e24109ddbe8b121f8d6b Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-05-19 17:06:47 -07:00
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`struct hmp_sched_stats {`
sched: remove the notion of small tasks and small task packing Task packing will now be determined solely on the basis of the power cost of task placement. All tasks are eligible for packing. Remove the notion of "small" tasks from the scheduler. Change-Id: I72d52d04b2677c6a8d0bc6aa7d50ff0f1a4f5ebb Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-06-19 12:28:24 -07:00			`int nr_big_tasks;`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`u64 cumulative_runnable_avg;`
sched: Add separate load tracking histogram to predict loads Current window based load tracking only saves history for five windows. A historically heavy task's heavy load will be completely forgotten after five windows of light load. Even before the five window expires, a heavy task wakes up on same CPU it used to run won't trigger any frequency change until end of the window. It would starve for the entire window. It also adds one "small" load window to history because it's accumulating load at a low frequency, further reducing the tracked load for this heavy task. Ideally, scheduler should be able to identify such tasks and notify governor to increase frequency immediately after it wakes up. Add a histogram for each task to track a much longer load history. A prediction will be made based on runtime of previous or current window, histogram data and load tracked in recent windows. Prediction of all tasks that is currently running or runnable on a CPU is aggregated and reported to CPUFreq governor in sched_get_cpus_busy(). sched_get_cpus_busy() now returns predicted busy time in addition to previous window busy time and new task busy time, scaled to the CPU maximum possible frequency. Tunables: - /proc/sys/kernel/sched_gov_alert_freq (KHz) This tunable can be used to further filter the notifications. Frequency alert notification is sent only when the predicted load exceeds previous window load by sched_gov_alert_freq converted to load. Change-Id: If29098cd2c5499163ceaff18668639db76ee8504 Suggested-by: Saravana Kannan <skannan@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Junjie Wu <junjiew@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts around __migrate_task() and removed changes for CONFIG_SCHED_QHMP.] 2015-06-08 09:08:47 +05:30			`u64 pred_demands_sum;`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`};`

sched: Enhance the scheduler migration load fixup feature In the current frequency guidance implementation the scheduler migrates task load from the source CPU to the destination CPU when a task migrates. The underlying assumption is that a task will stay on the destination CPU following the migration. Hence a CPU's load should reflect the sum of all tasks that last ran on that CPU prior to window expiration even if these tasks executed on some other CPU in that window prior to being migrated. However, given the ubiquitous nature of migrations the above assumption is flawed causing the scheduler to often add up load on a single CPU that in reality ran concurrently on multiple CPUs and will continue to run concurrently in subsequent windows. This leads to load over reporting on a single CPU which in turn causes CPU frequency to be higher than necessary. This is the first patch in a series of patches that attempts to change how load fixups are done upon migration to prevent load over reporting. In this patch, we stop doing migration fixups for intra-cluster migrations. Inter-cluster migration fixups are still retained. In order to achieve the above, we make use the per CPU footprint of each task introduced in the previous patch. Upon inter cluster migration, we go through every CPU in the source cluster to subtract the migrating task's contribution to the busy time on each one of those CPUs. The sum of the contributions is then added to the destination CPU allowing it to ramp up to the appropriate frequency for that task. Subtracting load from each of the source CPUs is not trivial, however, as it would require all runqueue locks to held. To get around this we introduce a deferred load subtraction mechanism whereby subtracting load from each of the source CPUs in deferred until an opportune moment. This opportune moment is when the governor comes asking the scheduler for load. At that time, all necessary runqueue locks are already held. There are a few cases to consider when doing deferred subtraction. Since we are not holding all runqueue locks other CPUs in the source cluster can be in a different window than the source CPU where the task is migrating from. Case 1: Other CPU in the source cluster is in the same window No special consideration Case 2: Other CPU in the source cluster is ahead by 1 window In this case, we will be doing redundant updates to subtraction load for the prev window. There is no way to avoid this redundant update though, without holding the rq lock. Case 3: Other CPU in the source cluster is trailing by 1 window In this case, we might end up overwriting old data for that CPU. But this is not a problem as when the other CPU calls update_task_ravg() it will move to the same window. This relies on maintaining synchronized windows between CPUs, which is true today. Finally, we must deal with frequency aggregation. When frequency aggregation is in effect, there is little point in dealing with per CPU footprint since the load of all related tasks have to be reported on a single CPU. Therefore when a task enters a related group we clear out all per CPU contributions and add it to the task CPU's cpu_time struct. From that point onwards we stop managing per CPU contributions upon inter cluster migrations since that work is redundant. Finally when a task exits a related group we must walk every CPU in reset all CPU contributions. We then set the task CPU contribution to the respective curr/prev sum values and add that sum to the task CPU rq runnable sum. Change-Id: I1f8d596e6c930f3f6f00e24109ddbe8b121f8d6b Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-05-19 17:06:47 -07:00			`struct load_subtractions {`
			`u64 window_start;`
			`u64 subs;`
			`u64 new_subs;`
			`};`

sched: maintain group busy time counters in runqueue There is no advantage of tracking busy time counters per related thread group. We need busy time across all groups for either a CPU or a frequency domain. Hence maintain group busy time counters in the runqueue itself. When CPU window is rolled over, the group busy counters are also rolled over. This eliminates the overhead of individual group's window_start maintenance. As we are preallocating related thread group now, this patch saves 40 * nr_cpu_ids * (nr_grp - 1) bytes memory. Change-Id: Ieaaccea483b377f54ea1761e6939ee23a78a5e9c Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2017-01-09 13:56:33 +05:30			`struct group_cpu_time {`
			`u64 curr_runnable_sum;`
			`u64 prev_runnable_sum;`
			`u64 nt_curr_runnable_sum;`
			`u64 nt_prev_runnable_sum;`
			`};`

sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`struct sched_cluster {`
sched: Enhance the scheduler migration load fixup feature In the current frequency guidance implementation the scheduler migrates task load from the source CPU to the destination CPU when a task migrates. The underlying assumption is that a task will stay on the destination CPU following the migration. Hence a CPU's load should reflect the sum of all tasks that last ran on that CPU prior to window expiration even if these tasks executed on some other CPU in that window prior to being migrated. However, given the ubiquitous nature of migrations the above assumption is flawed causing the scheduler to often add up load on a single CPU that in reality ran concurrently on multiple CPUs and will continue to run concurrently in subsequent windows. This leads to load over reporting on a single CPU which in turn causes CPU frequency to be higher than necessary. This is the first patch in a series of patches that attempts to change how load fixups are done upon migration to prevent load over reporting. In this patch, we stop doing migration fixups for intra-cluster migrations. Inter-cluster migration fixups are still retained. In order to achieve the above, we make use the per CPU footprint of each task introduced in the previous patch. Upon inter cluster migration, we go through every CPU in the source cluster to subtract the migrating task's contribution to the busy time on each one of those CPUs. The sum of the contributions is then added to the destination CPU allowing it to ramp up to the appropriate frequency for that task. Subtracting load from each of the source CPUs is not trivial, however, as it would require all runqueue locks to held. To get around this we introduce a deferred load subtraction mechanism whereby subtracting load from each of the source CPUs in deferred until an opportune moment. This opportune moment is when the governor comes asking the scheduler for load. At that time, all necessary runqueue locks are already held. There are a few cases to consider when doing deferred subtraction. Since we are not holding all runqueue locks other CPUs in the source cluster can be in a different window than the source CPU where the task is migrating from. Case 1: Other CPU in the source cluster is in the same window No special consideration Case 2: Other CPU in the source cluster is ahead by 1 window In this case, we will be doing redundant updates to subtraction load for the prev window. There is no way to avoid this redundant update though, without holding the rq lock. Case 3: Other CPU in the source cluster is trailing by 1 window In this case, we might end up overwriting old data for that CPU. But this is not a problem as when the other CPU calls update_task_ravg() it will move to the same window. This relies on maintaining synchronized windows between CPUs, which is true today. Finally, we must deal with frequency aggregation. When frequency aggregation is in effect, there is little point in dealing with per CPU footprint since the load of all related tasks have to be reported on a single CPU. Therefore when a task enters a related group we clear out all per CPU contributions and add it to the task CPU's cpu_time struct. From that point onwards we stop managing per CPU contributions upon inter cluster migrations since that work is redundant. Finally when a task exits a related group we must walk every CPU in reset all CPU contributions. We then set the task CPU contribution to the respective curr/prev sum values and add that sum to the task CPU rq runnable sum. Change-Id: I1f8d596e6c930f3f6f00e24109ddbe8b121f8d6b Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-05-19 17:06:47 -07:00			`raw_spinlock_t load_lock;`
sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`struct list_head list;`
			`struct cpumask cpus;`
			`int id;`
			`int max_power_cost;`
sched: Take cluster's minimum power into account for optimizing sbc() The select_best_cpu() algorithm iterates over all the clusters and selects the most power efficient CPU that satisfies the task needs. During the search, skip the next cluster if its minimum power cost is higher than the power cost of an eligible CPU found in the previous cluster. In a b.L system, if the BIG cluster minimum power cost is higher than the maximum power cost of the little cluster, this optimization avoids searching the BIG cluster if an eligible CPU is found in the little cluster. Change-Id: I5e3755f107edb6c72180edbec2a658be931c276d Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2015-12-14 14:23:24 +05:30			`int min_power_cost;`
sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`int max_possible_capacity;`
			`int capacity;`
			`int efficiency; /* Differentiate cpus with different IPC capability */`
			`int load_scale_factor;`
sched: kill unnecessary divisions on fast path The max_possible_efficiency and CPU's efficiency are fixed values which are determined at cluster allocation time. Avoid division on the fast by using precomputed scale factor. Also update_cpu_busy_time() doesn't need to know how many full windows have elapsed. Thus replace unneeded division with simple comparison. Change-Id: I2be1aad3fb9b895e4f0917d05bd8eade985bbccf Suggested-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-06-17 15:15:04 -07:00			`unsigned int exec_scale_factor;`
sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`/*`
sched: take into account of limited CPU min and max frequencies Actual CPU's min and max frequencies can be limited by hardware components while governor's not aware of. Provide an API for them to notify for scheduler to be able to notice accurate currently operating frequency boundaries which helps better task placement decision. CRs-fixed: 1006303 Change-Id: I608f5fa8b0baff8d9e998731dcddec59c9073d20 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-03-28 14:22:52 -07:00			`* max_freq = user maximum`
			`* max_mitigated_freq = thermal defined maximum`
sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`* max_possible_freq = maximum supported by hardware`
			`*/`
sched: take into account of limited CPU min and max frequencies Actual CPU's min and max frequencies can be limited by hardware components while governor's not aware of. Provide an API for them to notify for scheduler to be able to notice accurate currently operating frequency boundaries which helps better task placement decision. CRs-fixed: 1006303 Change-Id: I608f5fa8b0baff8d9e998731dcddec59c9073d20 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-03-28 14:22:52 -07:00			`unsigned int cur_freq, max_freq, max_mitigated_freq, min_freq;`
			`unsigned int max_possible_freq;`
sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`bool freq_init_done;`
			`int dstate, dstate_wakeup_latency, dstate_wakeup_energy;`
			`unsigned int static_cluster_pwr_cost;`
sched: handle frequency alert notifications better The load reporting during frequency alert notifications is broken under load aggregation. When aggregation is enabled, the total group busy time is accounted towards the maximum busy CPU of a frequency domain. If this CPU has a notification pending, it's group busy time alone is accounted and other CPU's group busy time is completely ignored. Similarly if any CPU other than maximum busy CPU has a pending notification, its group busy time is accounted twice. Maintain the frequency alert notification flag per frequency domain. When the notification is pending, don't clip the load to 100% @ fur for any of the CPUs in the frequency domain. Change-Id: Iebc7d74d6fafa20430fa1c7d80f34a6ab198832d Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-08-12 16:12:53 +05:30			`int notifier_sent;`
sched: Convert the global wake_up_idle flag to a per cluster flag Since clusters can vary significantly in the power and performance characteristics, there may be a need to have different CPU selection policies based on which cluster a task is being placed on. For example the placement policy can be more aggressive in using idle CPUs on cluster that are power efficient and less aggressive on clusters that are geared towards performance. Add support for per cluster wake_up_idle flag to allow greater flexibility in placement policies. Change-Id: I18cd3d907cd965db03a13f4655870dc10c07acfe Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2017-01-04 15:56:51 -08:00			`bool wake_up_idle;`
sched: hmp: Optimize cycle counter reads The cycle counter read is a bit of an expensive operation and requires locking across all CPUs in a cluster. Optimize this by returning the same value if the delta between two reads is zero (so if two reads are done in the same sched context) or if the last read was within a specific time period prior to the current read. Change-Id: I99da5a704d3652f53c8564ba7532783d3288f227 Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org> 2017-05-30 14:38:55 -07:00			`atomic64_t last_cc_update;`
			`atomic64_t cycles;`
sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`};`

			`extern unsigned long all_cluster_ids[];`

			`static inline int cluster_first_cpu(struct sched_cluster *cluster)`
			`{`
			`return cpumask_first(&cluster->cpus);`
			`}`

sched: colocate related threads Provide userspace interface for tasks to be grouped together as "related" threads. For example, all threads involved in updating display buffer could be tagged as related. Scheduler will attempt to provide special treatment for group of related threads such as: 1) Colocation of related threads in same "preferred" cluster 2) Aggregation of demand towards determination of cluster frequency This patch extends scheduler to provide best-effort colocation support for a group of related threads. Change-Id: Ic2cd769faf5da4d03a8f3cb0ada6224d0101a5f5 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [joonwoop@codeaurora.org: fixed minor merge conflicts. removed ifdefry for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-04-24 15:44:31 +05:30			`struct related_thread_group {`
			`int id;`
			`raw_spinlock_t lock;`
			`struct list_head tasks;`
			`struct list_head list;`
			`struct sched_cluster *preferred_cluster;`
			`struct rcu_head rcu;`
			`u64 last_update;`
			`};`

sched: Update fair and rt placement logic to use scheduler clusters Make use of clusters in the fair and rt scheduling classes. This is needed as the freq domain mask can no longer be used to do correct task placement. The freq domain mask was being used to demarcate clusters. Change-Id: I57f74147c7006f22d6760256926c10fd0bf50cbd Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts due to omitted changes for CONFIG_SCHED_QHMP.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-04-22 17:12:09 +05:30			`extern struct list_head cluster_head;`
			`extern int num_clusters;`
			`extern struct sched_cluster *sched_cluster[NR_CPUS];`

sched: preserve CPU cycle counter in rq Preserve cycle counter in rq in preparation for wait time accounting while CPU idle fix. Change-Id: I469263c90e12f39bb36bde5ed26298b7c1c77597 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-04-28 15:22:12 -07:00			`struct cpu_cycle {`
			`u64 cycles;`
			`u64 time;`
			`};`

sched: Update fair and rt placement logic to use scheduler clusters Make use of clusters in the fair and rt scheduling classes. This is needed as the freq domain mask can no longer be used to do correct task placement. The freq domain mask was being used to demarcate clusters. Change-Id: I57f74147c7006f22d6760256926c10fd0bf50cbd Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts due to omitted changes for CONFIG_SCHED_QHMP.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-04-22 17:12:09 +05:30			`#define for_each_sched_cluster(cluster) \`
			`list_for_each_entry_rcu(cluster, &cluster_head, list)`

sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00			`#endif /* CONFIG_SCHED_HMP */`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`/* CFS-related fields in a runqueue */`
			`struct cfs_rq {`
			`struct load_weight load;`
sched: Change rq->nr_running to unsigned int Since there's a PID space limit of 30bits (see futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a lower bound of 2 pages per task) would still take 8T of memory it seems reasonable to say that unsigned int is sufficient for rq->nr_running. When we do get anywhere near that amount of tasks I suspect other things would go funny, load-balancer load computations would really need to be hoisted to 128bit etc. So save a few bytes and convert rq->nr_running and friends to unsigned int. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-04-26 13:12:27 +02:00			`unsigned int nr_running, h_nr_running;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
			`u64 exec_clock;`
			`u64 min_vruntime;`
			`#ifndef CONFIG_64BIT`
			`u64 min_vruntime_copy;`
			`#endif`

			`struct rb_root tasks_timeline;`
			`struct rb_node *rb_leftmost;`

			`/*`
			`* 'curr' points to currently running entity on this cfs_rq.`
			`* It is set to NULL otherwise (i.e when none are currently running).`
			`*/`
			`struct sched_entity curr, next, last, skip;`

			`#ifdef CONFIG_SCHED_DEBUG`
			`unsigned int nr_spread_over;`
			`#endif`

sched: Aggregate load contributed by task entities on parenting cfs_rq For a given task t, we can compute its contribution to load as: task_load(t) = runnable_avg(t) * weight(t) On a parenting cfs_rq we can then aggregate: runnable_load(cfs_rq) = \Sum task_load(t), for all runnable children t Maintain this bottom up, with task entities adding their contributed load to the parenting cfs_rq sum. When a task entity's load changes we add the same delta to the maintained sum. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.514678907@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-10-04 13:18:30 +02:00			`#ifdef CONFIG_SMP`
			`/*`
sched/fair: Rewrite runnable load and utilization average tracking The idea of runnable load average (let runnable time contribute to weight) was proposed by Paul Turner and Ben Segall, and it is still followed by this rewrite. This rewrite aims to solve the following issues: 1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is updated at the granularity of an entity at a time, which results in the cfs_rq's load average is stale or partially updated: at any time, only one entity is up to date, all other entities are effectively lagging behind. This is undesirable. To illustrate, if we have n runnable entities in the cfs_rq, as time elapses, they certainly become outdated: t0: cfs_rq { e1_old, e2_old, ..., en_old } and when we update: t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old } t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old } ... We solve this by combining all runnable entities' load averages together in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based on the fact that if we regard the update as a function, then: w * update(e) = update(w * e) and update(e1) + update(e2) = update(e1 + e2), then w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2) therefore, by this rewrite, we have an entirely updated cfs_rq at the time we update it: t1: update cfs_rq { e1_new, e2_new, ..., en_new } t2: update cfs_rq { e1_new, e2_new, ..., en_new } ... 2. cfs_rq's load average is different between top rq->cfs_rq and other task_group's per CPU cfs_rqs in whether or not blocked_load_average contributes to the load. The basic idea behind runnable load average (the same for utilization) is that the blocked state is taken into account as opposed to only accounting for the currently runnable state. Therefore, the average should include both the runnable/running and blocked load averages. This rewrite does that. In addition, we also combine runnable/running and blocked averages of all entities into the cfs_rq's average, and update it together at once. This is based on the fact that: update(runnable) + update(blocked) = update(runnable + blocked) This significantly reduces the code as we don't need to separately maintain/update runnable/running load and blocked load. 3. How task_group entities' share is calculated is complex and imprecise. We reduce the complexity in this rewrite to allow a very simple rule: the task_group's load_avg is aggregated from its per CPU cfs_rqs's load_avgs. Then group entity's weight is simply proportional to its own cfs_rq's load_avg / task_group's load_avg. To illustrate, if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then, task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share To sum up, this rewrite in principle is equivalent to the current one, but fixes the issues described above. Turns out, it significantly reduces the code complexity and hence increases clarity and efficiency. In addition, the new averages are more smooth/continuous (no spurious spikes and valleys) and updated more consistently and quickly to reflect the load dynamics. As a result, we have less load tracking overhead, better performance, and especially better power efficiency due to more balanced load. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-07-15 08:04:37 +08:00			`* CFS load tracking`
sched: Aggregate load contributed by task entities on parenting cfs_rq For a given task t, we can compute its contribution to load as: task_load(t) = runnable_avg(t) * weight(t) On a parenting cfs_rq we can then aggregate: runnable_load(cfs_rq) = \Sum task_load(t), for all runnable children t Maintain this bottom up, with task entities adding their contributed load to the parenting cfs_rq sum. When a task entity's load changes we add the same delta to the maintained sum. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.514678907@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-10-04 13:18:30 +02:00			`*/`
sched/fair: Rewrite runnable load and utilization average tracking The idea of runnable load average (let runnable time contribute to weight) was proposed by Paul Turner and Ben Segall, and it is still followed by this rewrite. This rewrite aims to solve the following issues: 1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is updated at the granularity of an entity at a time, which results in the cfs_rq's load average is stale or partially updated: at any time, only one entity is up to date, all other entities are effectively lagging behind. This is undesirable. To illustrate, if we have n runnable entities in the cfs_rq, as time elapses, they certainly become outdated: t0: cfs_rq { e1_old, e2_old, ..., en_old } and when we update: t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old } t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old } ... We solve this by combining all runnable entities' load averages together in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based on the fact that if we regard the update as a function, then: w * update(e) = update(w * e) and update(e1) + update(e2) = update(e1 + e2), then w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2) therefore, by this rewrite, we have an entirely updated cfs_rq at the time we update it: t1: update cfs_rq { e1_new, e2_new, ..., en_new } t2: update cfs_rq { e1_new, e2_new, ..., en_new } ... 2. cfs_rq's load average is different between top rq->cfs_rq and other task_group's per CPU cfs_rqs in whether or not blocked_load_average contributes to the load. The basic idea behind runnable load average (the same for utilization) is that the blocked state is taken into account as opposed to only accounting for the currently runnable state. Therefore, the average should include both the runnable/running and blocked load averages. This rewrite does that. In addition, we also combine runnable/running and blocked averages of all entities into the cfs_rq's average, and update it together at once. This is based on the fact that: update(runnable) + update(blocked) = update(runnable + blocked) This significantly reduces the code as we don't need to separately maintain/update runnable/running load and blocked load. 3. How task_group entities' share is calculated is complex and imprecise. We reduce the complexity in this rewrite to allow a very simple rule: the task_group's load_avg is aggregated from its per CPU cfs_rqs's load_avgs. Then group entity's weight is simply proportional to its own cfs_rq's load_avg / task_group's load_avg. To illustrate, if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then, task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share To sum up, this rewrite in principle is equivalent to the current one, but fixes the issues described above. Turns out, it significantly reduces the code complexity and hence increases clarity and efficiency. In addition, the new averages are more smooth/continuous (no spurious spikes and valleys) and updated more consistently and quickly to reflect the load dynamics. As a result, we have less load tracking overhead, better performance, and especially better power efficiency due to more balanced load. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-07-15 08:04:37 +08:00			`struct sched_avg avg;`
sched/fair: Provide runnable_load_avg back to cfs_rq The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg. Before this series, sometimes the runnable_load_avg is used, and sometimes the load_avg is used. Completely replacing all uses of runnable_load_avg with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned to result in overrated load. Therefore, we get runnable_load_avg back. The new cfs_rq's runnable_load_avg is improved to be updated with all of the runnable sched_eneities at the same time, so the one sched_entity updated and the others stale problem is solved. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-7-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-07-15 08:04:41 +08:00			`u64 runnable_load_sum;`
			`unsigned long runnable_load_avg;`
sched: Aggregate total task_group load Maintain a global running sum of the average load seen on each cfs_rq belonging to each task group so that it may be used in calculating an appropriate shares:weight distribution. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.792901086@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-10-04 13:18:30 +02:00			`#ifdef CONFIG_FAIR_GROUP_SCHED`
sched/fair: Rewrite runnable load and utilization average tracking The idea of runnable load average (let runnable time contribute to weight) was proposed by Paul Turner and Ben Segall, and it is still followed by this rewrite. This rewrite aims to solve the following issues: 1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is updated at the granularity of an entity at a time, which results in the cfs_rq's load average is stale or partially updated: at any time, only one entity is up to date, all other entities are effectively lagging behind. This is undesirable. To illustrate, if we have n runnable entities in the cfs_rq, as time elapses, they certainly become outdated: t0: cfs_rq { e1_old, e2_old, ..., en_old } and when we update: t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old } t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old } ... We solve this by combining all runnable entities' load averages together in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based on the fact that if we regard the update as a function, then: w * update(e) = update(w * e) and update(e1) + update(e2) = update(e1 + e2), then w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2) therefore, by this rewrite, we have an entirely updated cfs_rq at the time we update it: t1: update cfs_rq { e1_new, e2_new, ..., en_new } t2: update cfs_rq { e1_new, e2_new, ..., en_new } ... 2. cfs_rq's load average is different between top rq->cfs_rq and other task_group's per CPU cfs_rqs in whether or not blocked_load_average contributes to the load. The basic idea behind runnable load average (the same for utilization) is that the blocked state is taken into account as opposed to only accounting for the currently runnable state. Therefore, the average should include both the runnable/running and blocked load averages. This rewrite does that. In addition, we also combine runnable/running and blocked averages of all entities into the cfs_rq's average, and update it together at once. This is based on the fact that: update(runnable) + update(blocked) = update(runnable + blocked) This significantly reduces the code as we don't need to separately maintain/update runnable/running load and blocked load. 3. How task_group entities' share is calculated is complex and imprecise. We reduce the complexity in this rewrite to allow a very simple rule: the task_group's load_avg is aggregated from its per CPU cfs_rqs's load_avgs. Then group entity's weight is simply proportional to its own cfs_rq's load_avg / task_group's load_avg. To illustrate, if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then, task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share To sum up, this rewrite in principle is equivalent to the current one, but fixes the issues described above. Turns out, it significantly reduces the code complexity and hence increases clarity and efficiency. In addition, the new averages are more smooth/continuous (no spurious spikes and valleys) and updated more consistently and quickly to reflect the load dynamics. As a result, we have less load tracking overhead, better performance, and especially better power efficiency due to more balanced load. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-07-15 08:04:37 +08:00			`unsigned long tg_load_avg_contrib;`
UPSTREAM: sched/fair: Propagate load during synchronous attach/detach When a task moves from/to a cfs_rq, we set a flag which is then used to propagate the change at parent level (sched_entity and cfs_rq) during next update. If the cfs_rq is throttled, the flag will stay pending until the cfs_rq is unthrottled. For propagating the utilization, we copy the utilization of group cfs_rq to the sched_entity. For propagating the load, we have to take into account the load of the whole task group in order to evaluate the load of the sched_entity. Similarly to what was done before the rewrite of PELT, we add a correction factor in case the task group's load is greater than its share so it will contribute the same load of a task of equal weight. Change-Id: Id34a9888484716961c9027299c0b4d82881a39d1 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: kernellwp@gmail.com Cc: pjt@google.com Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 09a43ace1f986b003c118fdf6ddf1fd685692d49) Signed-off-by: Chris Redpath <chris.redpath@arm.com> 2016-11-08 10:53:45 +01:00			`unsigned long propagate_avg;`
sched/fair: Rewrite runnable load and utilization average tracking The idea of runnable load average (let runnable time contribute to weight) was proposed by Paul Turner and Ben Segall, and it is still followed by this rewrite. This rewrite aims to solve the following issues: 1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is updated at the granularity of an entity at a time, which results in the cfs_rq's load average is stale or partially updated: at any time, only one entity is up to date, all other entities are effectively lagging behind. This is undesirable. To illustrate, if we have n runnable entities in the cfs_rq, as time elapses, they certainly become outdated: t0: cfs_rq { e1_old, e2_old, ..., en_old } and when we update: t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old } t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old } ... We solve this by combining all runnable entities' load averages together in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based on the fact that if we regard the update as a function, then: w * update(e) = update(w * e) and update(e1) + update(e2) = update(e1 + e2), then w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2) therefore, by this rewrite, we have an entirely updated cfs_rq at the time we update it: t1: update cfs_rq { e1_new, e2_new, ..., en_new } t2: update cfs_rq { e1_new, e2_new, ..., en_new } ... 2. cfs_rq's load average is different between top rq->cfs_rq and other task_group's per CPU cfs_rqs in whether or not blocked_load_average contributes to the load. The basic idea behind runnable load average (the same for utilization) is that the blocked state is taken into account as opposed to only accounting for the currently runnable state. Therefore, the average should include both the runnable/running and blocked load averages. This rewrite does that. In addition, we also combine runnable/running and blocked averages of all entities into the cfs_rq's average, and update it together at once. This is based on the fact that: update(runnable) + update(blocked) = update(runnable + blocked) This significantly reduces the code as we don't need to separately maintain/update runnable/running load and blocked load. 3. How task_group entities' share is calculated is complex and imprecise. We reduce the complexity in this rewrite to allow a very simple rule: the task_group's load_avg is aggregated from its per CPU cfs_rqs's load_avgs. Then group entity's weight is simply proportional to its own cfs_rq's load_avg / task_group's load_avg. To illustrate, if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then, task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share To sum up, this rewrite in principle is equivalent to the current one, but fixes the issues described above. Turns out, it significantly reduces the code complexity and hence increases clarity and efficiency. In addition, the new averages are more smooth/continuous (no spurious spikes and valleys) and updated more consistently and quickly to reflect the load dynamics. As a result, we have less load tracking overhead, better performance, and especially better power efficiency due to more balanced load. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-07-15 08:04:37 +08:00			`#endif`
			`atomic_long_t removed_load_avg, removed_util_avg;`
			`#ifndef CONFIG_64BIT`
			`u64 load_last_update_time_copy;`
			`#endif`
sched: Replace update_shares weight distribution with per-entity computation Now that the machinery in place is in place to compute contributed load in a bottom up fashion; replace the shares distribution code within update_shares() accordingly. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141507.061208672@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-10-04 13:18:31 +02:00
sched/fair: Rewrite runnable load and utilization average tracking The idea of runnable load average (let runnable time contribute to weight) was proposed by Paul Turner and Ben Segall, and it is still followed by this rewrite. This rewrite aims to solve the following issues: 1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is updated at the granularity of an entity at a time, which results in the cfs_rq's load average is stale or partially updated: at any time, only one entity is up to date, all other entities are effectively lagging behind. This is undesirable. To illustrate, if we have n runnable entities in the cfs_rq, as time elapses, they certainly become outdated: t0: cfs_rq { e1_old, e2_old, ..., en_old } and when we update: t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old } t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old } ... We solve this by combining all runnable entities' load averages together in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based on the fact that if we regard the update as a function, then: w * update(e) = update(w * e) and update(e1) + update(e2) = update(e1 + e2), then w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2) therefore, by this rewrite, we have an entirely updated cfs_rq at the time we update it: t1: update cfs_rq { e1_new, e2_new, ..., en_new } t2: update cfs_rq { e1_new, e2_new, ..., en_new } ... 2. cfs_rq's load average is different between top rq->cfs_rq and other task_group's per CPU cfs_rqs in whether or not blocked_load_average contributes to the load. The basic idea behind runnable load average (the same for utilization) is that the blocked state is taken into account as opposed to only accounting for the currently runnable state. Therefore, the average should include both the runnable/running and blocked load averages. This rewrite does that. In addition, we also combine runnable/running and blocked averages of all entities into the cfs_rq's average, and update it together at once. This is based on the fact that: update(runnable) + update(blocked) = update(runnable + blocked) This significantly reduces the code as we don't need to separately maintain/update runnable/running load and blocked load. 3. How task_group entities' share is calculated is complex and imprecise. We reduce the complexity in this rewrite to allow a very simple rule: the task_group's load_avg is aggregated from its per CPU cfs_rqs's load_avgs. Then group entity's weight is simply proportional to its own cfs_rq's load_avg / task_group's load_avg. To illustrate, if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then, task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share To sum up, this rewrite in principle is equivalent to the current one, but fixes the issues described above. Turns out, it significantly reduces the code complexity and hence increases clarity and efficiency. In addition, the new averages are more smooth/continuous (no spurious spikes and valleys) and updated more consistently and quickly to reflect the load dynamics. As a result, we have less load tracking overhead, better performance, and especially better power efficiency due to more balanced load. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-07-15 08:04:37 +08:00			`#ifdef CONFIG_FAIR_GROUP_SCHED`
sched: Replace update_shares weight distribution with per-entity computation Now that the machinery in place is in place to compute contributed load in a bottom up fashion; replace the shares distribution code within update_shares() accordingly. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141507.061208672@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-10-04 13:18:31 +02:00			`/*`
			`* h_load = weight * f(tg)`
			`*`
			`* Where f(tg) is the recursive weight fraction assigned to`
			`* this group.`
			`*/`
			`unsigned long h_load;`
sched: Move h_load calculation to task_h_load() The bad thing about update_h_load(), which computes hierarchical load factor for task groups, is that it is called for each task group in the system before every load balancer run, and since rebalance can be triggered very often, this function can eat really a lot of cpu time if there are many cpu cgroups in the system. Although the situation was improved significantly by commit a35b646 ('sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies'), the problem still can arise under some kinds of loads, e.g. when cpus are switching from idle to busy and back very frequently. For instance, when I start 1000 of processes that wake up every millisecond on my 8 cpus host, 'top' and 'perf top' show: Cpu(s): 17.8%us, 24.3%sy, 0.0%ni, 57.9%id, 0.0%wa, 0.0%hi, 0.0%si Events: 243K cycles 7.57% [kernel] [k] __schedule 7.08% [kernel] [k] timerqueue_add 6.13% libc-2.12.so [.] usleep Then if I create 10000 idle cpu cgroups (no processes in them), cpu usage increases significantly although the 'wakers' are still executing in the root cpu cgroup: Cpu(s): 19.1%us, 48.7%sy, 0.0%ni, 31.6%id, 0.0%wa, 0.0%hi, 0.7%si Events: 230K cycles 24.56% [kernel] [k] tg_load_down 5.76% [kernel] [k] __schedule This happens because this particular kind of load triggers 'new idle' rebalance very frequently, which requires calling update_h_load(), which, in turn, calls tg_load_down() for every idle cpu cgroup even though it is absolutely useless, because idle cpu cgroups have no tasks to pull. This patch tries to improve the situation by making h_load calculation proceed only when h_load is really necessary. To achieve this, it substitutes update_h_load() with update_cfs_rq_h_load(), which computes h_load only for a given cfs_rq and all its ascendants, and makes the load balancer call this function whenever it considers if a task should be pulled, i.e. it moves h_load calculations directly to task_h_load(). For h_load of the same cfs_rq not to be updated multiple times (in case several tasks in the same cgroup are considered during the same balance run), the patch keeps the time of the last h_load update for each cfs_rq and breaks calculation when it finds h_load to be uptodate. The benefit of it is that h_load is computed only for those cfs_rq's, which really need it, in particular all idle task groups are skipped. Although this, in fact, moves h_load calculation under rq lock, it should not affect latency much, because the amount of work done under rq lock while trying to pull tasks is limited by sched_nr_migrate. After the patch applied with the setup described above (1000 wakers in the root cgroup and 10000 idle cgroups), I get: Cpu(s): 16.9%us, 24.8%sy, 0.0%ni, 58.4%id, 0.0%wa, 0.0%hi, 0.0%si Events: 242K cycles 7.57% [kernel] [k] __schedule 6.70% [kernel] [k] timerqueue_add 5.93% libc-2.12.so [.] usleep Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1373896159-1278-1-git-send-email-vdavydov@parallels.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-07-15 17:49:19 +04:00			`u64 last_h_load_update;`
			`struct sched_entity *h_load_next;`
			`#endif /* CONFIG_FAIR_GROUP_SCHED */`
sched: Replace update_shares weight distribution with per-entity computation Now that the machinery in place is in place to compute contributed load in a bottom up fashion; replace the shares distribution code within update_shares() accordingly. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141507.061208672@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-10-04 13:18:31 +02:00			`#endif /* CONFIG_SMP */`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#ifdef CONFIG_FAIR_GROUP_SCHED`
			`struct rq rq; / cpu runqueue to which this cfs_rq is attached */`

			`/*`
			`* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in`
			`* a hierarchy). Non-leaf lrqs hold other higher schedulable entities`
			`* (like users, containers etc.)`
			`*`
			`* leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This`
			`* list is used during load balance.`
			`*/`
			`int on_list;`
			`struct list_head leaf_cfs_rq_list;`
			`struct task_group tg; / group that "owns" this runqueue */`

			`#ifdef CONFIG_CFS_BANDWIDTH`
sched: Support CFS_BANDWIDTH feature in HMP scheduler CFS_BANDWIDTH feature is not currently well-supported by HMP scheduler. Issues encountered include a kernel panic when rq->nr_big_tasks count becomes negative. This patch fixes HMP scheduler code to better handle CFS_BANDWIDTH feature. The most prominent change introduced is maintenance of HMP stats (nr_big_tasks, nr_small_tasks, cumulative_runnable_avg) per 'struct cfs_rq' in addition to being maintained in each 'struct rq'. This allows HMP stats to be updated easily when a group is throttled on a cpu. Change-Id: Iad9f378b79ab5d9d76f86d1775913cc1941e266a Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in dequeue_task_fair().] 2015-01-16 13:57:02 +05:30
			`#ifdef CONFIG_SCHED_HMP`
			`struct hmp_sched_stats hmp_stats;`
			`#endif`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`int runtime_enabled;`
			`u64 runtime_expires;`
			`s64 runtime_remaining;`

sched: Maintain runnable averages across throttled periods With bandwidth control tracked entities may cease execution according to user specified bandwidth limits. Charging this time as either throttled or blocked however, is incorrect and would falsely skew in either direction. What we actually want is for any throttled periods to be "invisible" to load-tracking as they are removed from the system for that interval and contribute normally otherwise. Do this by moderating the progression of time to omit any periods in which the entity belonged to a throttled hierarchy. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.998912151@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-10-04 13:18:31 +02:00			`u64 throttled_clock, throttled_clock_task;`
			`u64 throttled_clock_task_time;`
sched/fair: Initialize throttle_count for new task-groups lazily commit 094f469172e00d6ab0a3130b0e01c83b3cf3a98d upstream. Cgroup created inside throttled group must inherit current throttle_count. Broken throttle_count allows to nominate throttled entries as a next buddy, later this leads to null pointer dereference in pick_next_task_fair(). This patch initialize cfs_rq->throttle_count at first enqueue: laziness allows to skip locking all rq at group creation. Lazy approach also allows to skip full sub-tree scan at throttling hierarchy (not in this patch). Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Ben Pineau <benjamin.pineau@mirakl.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 2016-06-16 15:57:01 +03:00			`int throttled, throttle_count, throttle_uptodate;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`struct list_head throttled_list;`
			`#endif /* CONFIG_CFS_BANDWIDTH */`
			`#endif /* CONFIG_FAIR_GROUP_SCHED */`
			`};`

			`static inline int rt_bandwidth_enabled(void)`
			`{`
			`return sysctl_sched_rt_runtime >= 0;`
			`}`

sched/rt: Use IPI to trigger RT task push migration instead of pulling When debugging the latencies on a 40 core box, where we hit 300 to 500 microsecond latencies, I found there was a huge contention on the runqueue locks. Investigating it further, running ftrace, I found that it was due to the pulling of RT tasks. The test that was run was the following: cyclictest --numa -p95 -m -d0 -i100 This created a thread on each CPU, that would set its wakeup in iterations of 100 microseconds. The -d0 means that all the threads had the same interval (100us). Each thread sleeps for 100us and wakes up and measures its latencies. cyclictest is maintained at: git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git What happened was another RT task would be scheduled on one of the CPUs that was running our test, when the other CPU tests went to sleep and scheduled idle. This caused the "pull" operation to execute on all these CPUs. Each one of these saw the RT task that was overloaded on the CPU of the test that was still running, and each one tried to grab that task in a thundering herd way. To grab the task, each thread would do a double rq lock grab, grabbing its own lock as well as the rq of the overloaded CPU. As the sched domains on this box was rather flat for its size, I saw up to 12 CPUs block on this lock at once. This caused a ripple affect with the rq locks especially since the taking was done via a double rq lock, which means that several of the CPUs had their own rq locks held while trying to take this rq lock. As these locks were blocked, any wakeups or load balanceing on these CPUs would also block on these locks, and the wait time escalated. I've tried various methods to lessen the load, but things like an atomic counter to only let one CPU grab the task wont work, because the task may have a limited affinity, and we may pick the wrong CPU to take that lock and do the pull, to only find out that the CPU we picked isn't in the task's affinity. Instead of doing the PULL, I now have the CPUs that want the pull to send over an IPI to the overloaded CPU, and let that CPU pick what CPU to push the task to. No more need to grab the rq lock, and the push/pull algorithm still works fine. With this patch, the latency dropped to just 150us over a 20 hour run. Without the patch, the huge latencies would trigger in seconds. I've created a new sched feature called RT_PUSH_IPI, which is enabled by default. When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks and having the pulling CPU do the work is implemented. When RT_PUSH_IPI is enabled, the IPI is sent to the overloaded CPU to do a push. To enabled or disable this at run time: # mount -t debugfs nodev /sys/kernel/debug # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features or # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features Update: This original patch would send an IPI to all CPUs in the RT overload list. But that could theoretically cause the reverse issue. That is, there could be lots of overloaded RT queues and one CPU lowers its priority. It would then send an IPI to all the overloaded RT queues and they could then all try to grab the rq lock of the CPU lowering its priority, and then we have the same problem. The latest design sends out only one IPI to the first overloaded CPU. It tries to push any tasks that it can, and then looks for the next overloaded CPU that can push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable tasks that have priorities greater than the source CPU are covered. In case the source CPU lowers its priority again, a flag is set to tell the IPI traversal to restart with the first RT overloaded CPU after the source CPU. Parts-suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joern Engel <joern@purestorage.com> Cc: Clark Williams <williams@redhat.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-03-18 14:49:46 -04:00			`/* RT IPI pull logic requires IRQ_WORK */`
sched/rt: Simplify the IPI based RT balancing logic commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream. When a CPU lowers its priority (schedules out a high priority task for a lower priority one), a check is made to see if any other CPU has overloaded RT tasks (more than one). It checks the rto_mask to determine this and if so it will request to pull one of those tasks to itself if the non running RT task is of higher priority than the new priority of the next task to run on the current CPU. When we deal with large number of CPUs, the original pull logic suffered from large lock contention on a single CPU run queue, which caused a huge latency across all CPUs. This was caused by only having one CPU having overloaded RT tasks and a bunch of other CPUs lowering their priority. To solve this issue, commit: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") changed the way to request a pull. Instead of grabbing the lock of the overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. Although the IPI logic worked very well in removing the large latency build up, it still could suffer from a large number of IPIs being sent to a single CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet, when I tested this on a 120 CPU box, with a stress test that had lots of RT tasks scheduling on all CPUs, it actually triggered the hard lockup detector! One CPU had so many IPIs sent to it, and due to the restart mechanism that is triggered when the source run queue has a priority status change, the CPU spent minutes! processing the IPIs. Thinking about this further, I realized there's no reason for each run queue to send its own IPI. As all CPUs with overloaded tasks must be scanned regardless if there's one or many CPUs lowering their priority, because there's no current way to find the CPU with the highest priority task that can schedule to one of these CPUs, there really only needs to be one IPI being sent around at a time. This greatly simplifies the code! The new approach is to have each root domain have its own irq work, as the rto_mask is per root domain. The root domain has the following fields attached to it: rto_push_work - the irq work to process each CPU set in rto_mask rto_lock - the lock to protect some of the other rto fields rto_loop_start - an atomic that keeps contention down on rto_lock the first CPU scheduling in a lower priority task is the one to kick off the process. rto_loop_next - an atomic that gets incremented for each CPU that schedules in a lower priority task. rto_loop - a variable protected by rto_lock that is used to compare against rto_loop_next rto_cpu - The cpu to send the next IPI to, also protected by the rto_lock. When a CPU schedules in a lower priority task and wants to make sure overloaded CPUs know about it. It increments the rto_loop_next. Then it atomically sets rto_loop_start with a cmpxchg. If the old value is not "0", then it is done, as another CPU is kicking off the IPI loop. If the old value is "0", then it will take the rto_lock to synchronize with a possible IPI being sent around to the overloaded CPUs. If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no IPI being sent around, or one is about to finish. Then rto_cpu is set to the first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set in rto_mask, then there's nothing to be done. When the CPU receives the IPI, it will first try to push any RT tasks that is queued on the CPU but can't run because a higher priority RT task is currently running on that CPU. Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it finds one, it simply sends an IPI to that CPU and the process continues. If there's no more CPUs in the rto_mask, then rto_loop is compared with rto_loop_next. If they match, everything is done and the process is over. If they do not match, then a CPU scheduled in a lower priority task as the IPI was being passed around, and the process needs to start again. The first CPU in rto_mask is sent the IPI. This change removes this duplication of work in the IPI logic, and greatly lowers the latency caused by the IPIs. This removed the lockup happening on the 120 CPU machine. It also simplifies the code tremendously. What else could anyone ask for? Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and supplying me with the rto_start_trylock() and rto_start_unlock() helper functions. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott Wood <swood@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 2017-10-06 14:05:04 -04:00			`#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP)`
sched/rt: Use IPI to trigger RT task push migration instead of pulling When debugging the latencies on a 40 core box, where we hit 300 to 500 microsecond latencies, I found there was a huge contention on the runqueue locks. Investigating it further, running ftrace, I found that it was due to the pulling of RT tasks. The test that was run was the following: cyclictest --numa -p95 -m -d0 -i100 This created a thread on each CPU, that would set its wakeup in iterations of 100 microseconds. The -d0 means that all the threads had the same interval (100us). Each thread sleeps for 100us and wakes up and measures its latencies. cyclictest is maintained at: git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git What happened was another RT task would be scheduled on one of the CPUs that was running our test, when the other CPU tests went to sleep and scheduled idle. This caused the "pull" operation to execute on all these CPUs. Each one of these saw the RT task that was overloaded on the CPU of the test that was still running, and each one tried to grab that task in a thundering herd way. To grab the task, each thread would do a double rq lock grab, grabbing its own lock as well as the rq of the overloaded CPU. As the sched domains on this box was rather flat for its size, I saw up to 12 CPUs block on this lock at once. This caused a ripple affect with the rq locks especially since the taking was done via a double rq lock, which means that several of the CPUs had their own rq locks held while trying to take this rq lock. As these locks were blocked, any wakeups or load balanceing on these CPUs would also block on these locks, and the wait time escalated. I've tried various methods to lessen the load, but things like an atomic counter to only let one CPU grab the task wont work, because the task may have a limited affinity, and we may pick the wrong CPU to take that lock and do the pull, to only find out that the CPU we picked isn't in the task's affinity. Instead of doing the PULL, I now have the CPUs that want the pull to send over an IPI to the overloaded CPU, and let that CPU pick what CPU to push the task to. No more need to grab the rq lock, and the push/pull algorithm still works fine. With this patch, the latency dropped to just 150us over a 20 hour run. Without the patch, the huge latencies would trigger in seconds. I've created a new sched feature called RT_PUSH_IPI, which is enabled by default. When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks and having the pulling CPU do the work is implemented. When RT_PUSH_IPI is enabled, the IPI is sent to the overloaded CPU to do a push. To enabled or disable this at run time: # mount -t debugfs nodev /sys/kernel/debug # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features or # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features Update: This original patch would send an IPI to all CPUs in the RT overload list. But that could theoretically cause the reverse issue. That is, there could be lots of overloaded RT queues and one CPU lowers its priority. It would then send an IPI to all the overloaded RT queues and they could then all try to grab the rq lock of the CPU lowering its priority, and then we have the same problem. The latest design sends out only one IPI to the first overloaded CPU. It tries to push any tasks that it can, and then looks for the next overloaded CPU that can push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable tasks that have priorities greater than the source CPU are covered. In case the source CPU lowers its priority again, a flag is set to tell the IPI traversal to restart with the first RT overloaded CPU after the source CPU. Parts-suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joern Engel <joern@purestorage.com> Cc: Clark Williams <williams@redhat.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-03-18 14:49:46 -04:00			`# define HAVE_RT_PUSH_IPI`
			`#endif`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`/* Real-Time classes' related field in a runqueue: */`
			`struct rt_rq {`
			`struct rt_prio_array active;`
sched: Change rq->nr_running to unsigned int Since there's a PID space limit of 30bits (see futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a lower bound of 2 pages per task) would still take 8T of memory it seems reasonable to say that unsigned int is sufficient for rq->nr_running. When we do get anywhere near that amount of tasks I suspect other things would go funny, load-balancer load computations would really need to be hoisted to 128bit etc. So save a few bytes and convert rq->nr_running and friends to unsigned int. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-04-26 13:12:27 +02:00			`unsigned int rt_nr_running;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#if defined CONFIG_SMP \|\| defined CONFIG_RT_GROUP_SCHED`
			`struct {`
			`int curr; /* highest queued rt task prio */`
			`#ifdef CONFIG_SMP`
			`int next; /* next highest */`
			`#endif`
			`} highest_prio;`
			`#endif`
			`#ifdef CONFIG_SMP`
			`unsigned long rt_nr_migratory;`
			`unsigned long rt_nr_total;`
			`int overloaded;`
			`struct plist_head pushable_tasks;`
sched/rt: Use IPI to trigger RT task push migration instead of pulling When debugging the latencies on a 40 core box, where we hit 300 to 500 microsecond latencies, I found there was a huge contention on the runqueue locks. Investigating it further, running ftrace, I found that it was due to the pulling of RT tasks. The test that was run was the following: cyclictest --numa -p95 -m -d0 -i100 This created a thread on each CPU, that would set its wakeup in iterations of 100 microseconds. The -d0 means that all the threads had the same interval (100us). Each thread sleeps for 100us and wakes up and measures its latencies. cyclictest is maintained at: git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git What happened was another RT task would be scheduled on one of the CPUs that was running our test, when the other CPU tests went to sleep and scheduled idle. This caused the "pull" operation to execute on all these CPUs. Each one of these saw the RT task that was overloaded on the CPU of the test that was still running, and each one tried to grab that task in a thundering herd way. To grab the task, each thread would do a double rq lock grab, grabbing its own lock as well as the rq of the overloaded CPU. As the sched domains on this box was rather flat for its size, I saw up to 12 CPUs block on this lock at once. This caused a ripple affect with the rq locks especially since the taking was done via a double rq lock, which means that several of the CPUs had their own rq locks held while trying to take this rq lock. As these locks were blocked, any wakeups or load balanceing on these CPUs would also block on these locks, and the wait time escalated. I've tried various methods to lessen the load, but things like an atomic counter to only let one CPU grab the task wont work, because the task may have a limited affinity, and we may pick the wrong CPU to take that lock and do the pull, to only find out that the CPU we picked isn't in the task's affinity. Instead of doing the PULL, I now have the CPUs that want the pull to send over an IPI to the overloaded CPU, and let that CPU pick what CPU to push the task to. No more need to grab the rq lock, and the push/pull algorithm still works fine. With this patch, the latency dropped to just 150us over a 20 hour run. Without the patch, the huge latencies would trigger in seconds. I've created a new sched feature called RT_PUSH_IPI, which is enabled by default. When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks and having the pulling CPU do the work is implemented. When RT_PUSH_IPI is enabled, the IPI is sent to the overloaded CPU to do a push. To enabled or disable this at run time: # mount -t debugfs nodev /sys/kernel/debug # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features or # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features Update: This original patch would send an IPI to all CPUs in the RT overload list. But that could theoretically cause the reverse issue. That is, there could be lots of overloaded RT queues and one CPU lowers its priority. It would then send an IPI to all the overloaded RT queues and they could then all try to grab the rq lock of the CPU lowering its priority, and then we have the same problem. The latest design sends out only one IPI to the first overloaded CPU. It tries to push any tasks that it can, and then looks for the next overloaded CPU that can push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable tasks that have priorities greater than the source CPU are covered. In case the source CPU lowers its priority again, a flag is set to tell the IPI traversal to restart with the first RT overloaded CPU after the source CPU. Parts-suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joern Engel <joern@purestorage.com> Cc: Clark Williams <williams@redhat.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-03-18 14:49:46 -04:00			`#endif /* CONFIG_SMP */`
sched/rt: Substract number of tasks of throttled queues from rq->nr_running Now rq->rt becomes to be able to be in dequeued or enqueued state. We add new member rt_rq->rt_queued, which is used to indicate this. The member is used only for top queue rq->rt_rq. The goal is to fit generic scheme which is used in deadline and fair classes, i.e. throttled rt_rq's rt_nr_running is beeing substracted from rq->nr_running. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1394835300.18748.33.camel@HP-250-G1-Notebook-PC Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-03-15 02:15:00 +04:00			`int rt_queued;`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`int rt_throttled;`
			`u64 rt_time;`
			`u64 rt_runtime;`
			`/* Nests inside the rq lock: */`
			`raw_spinlock_t rt_runtime_lock;`

			`#ifdef CONFIG_RT_GROUP_SCHED`
			`unsigned long rt_nr_boosted;`

			`struct rq *rq;`
			`struct task_group *tg;`
			`#endif`
			`};`

sched/deadline: Add SCHED_DEADLINE structures & implementation Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-28 11:14:43 +01:00			`/* Deadline class' related fields in a runqueue */`
			`struct dl_rq {`
			`/* runqueue is an rbtree, ordered by deadline */`
			`struct rb_root rb_root;`
			`struct rb_node *rb_leftmost;`

			`unsigned long dl_nr_running;`
sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic Introduces data structures relevant for implementing dynamic migration of -deadline tasks and the logic for checking if runqueues are overloaded with -deadline tasks and for choosing where a task should migrate, when it is the case. Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can be moved among CPUs when necessary. It is also possible to bind a task to a (set of) CPU(s), thus restricting its capability of migrating, or forbidding migrations at all. The very same approach used in sched_rt is utilised: - -deadline tasks are kept into CPU-specific runqueues, - -deadline tasks are migrated among runqueues to achieve the following: * on an M-CPU system the M earliest deadline ready tasks are always running; * affinity/cpusets settings of all the -deadline tasks is always respected. Therefore, this very special form of "load balancing" is done with an active method, i.e., the scheduler pushes or pulls tasks between runqueues when they are woken up and/or (de)scheduled. IOW, every time a preemption occurs, the descheduled task might be sent to some other CPU (depending on its deadline) to continue executing (push). On the other hand, every time a CPU becomes idle, it might pull the second earliest deadline ready task from some other CPU. To enforce this, a pull operation is always attempted before taking any scheduling decision (pre_schedule()), as well as a push one after each scheduling decision (post_schedule()). In addition, when a task arrives or wakes up, the best CPU where to resume it is selected taking into account its affinity mask, the system topology, but also its deadline. E.g., from the scheduling point of view, the best CPU where to wake up (and also where to push) a task is the one which is running the task with the latest deadline among the M executing ones. In order to facilitate these decisions, per-runqueue "caching" of the deadlines of the currently running and of the first ready task is used. Queued but not running tasks are also parked in another rb-tree to speed-up pushes. Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:38 +01:00
			`#ifdef CONFIG_SMP`
			`/*`
			`* Deadline values of the currently executing and the`
			`* earliest ready task on this rq. Caching these facilitates`
			`* the decision wether or not a ready but not running task`
			`* should migrate somewhere else.`
			`*/`
			`struct {`
			`u64 curr;`
			`u64 next;`
			`} earliest_dl;`

			`unsigned long dl_nr_migratory;`
			`int overloaded;`

			`/*`
			`* Tasks on this rq that can be pushed away. They are kept in`
			`* an rb-tree, ordered by tasks' deadlines, with caching`
			`* of the leftmost (earliest deadline) element.`
			`*/`
			`struct rb_root pushable_dl_tasks_root;`
			`struct rb_node *pushable_dl_tasks_leftmost;`
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks In order of deadline scheduling to be effective and useful, it is important that some method of having the allocation of the available CPU bandwidth to tasks and task groups under control. This is usually called "admission control" and if it is not performed at all, no guarantee can be given on the actual scheduling of the -deadline tasks. Since when RT-throttling has been introduced each task group have a bandwidth associated to itself, calculated as a certain amount of runtime over a period. Moreover, to make it possible to manipulate such bandwidth, readable/writable controls have been added to both procfs (for system wide settings) and cgroupfs (for per-group settings). Therefore, the same interface is being used for controlling the bandwidth distrubution to -deadline tasks and task groups, i.e., new controls but with similar names, equivalent meaning and with the same usage paradigm are added. However, more discussion is needed in order to figure out how we want to manage SCHED_DEADLINE bandwidth at the task group level. Therefore, this patch adds a less sophisticated, but actually very sensible, mechanism to ensure that a certain utilization cap is not overcome per each root_domain (the single rq for !SMP configurations). Another main difference between deadline bandwidth management and RT-throttling is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!), and thus we don't need an higher level throttling mechanism to enforce the desired bandwidth. This patch, therefore: - adds system wide deadline bandwidth management by means of: * /proc/sys/kernel/sched_dl_runtime_us, * /proc/sys/kernel/sched_dl_period_us, that determine (i.e., runtime / period) the total bandwidth available on each CPU of each root_domain for -deadline tasks; - couples the RT and deadline bandwidth management, i.e., enforces that the sum of how much bandwidth is being devoted to -rt -deadline tasks to stay below 100%. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created until the sum of their bandwidths stay below: M * (sched_dl_runtime_us / sched_dl_period_us) It is also possible to disable this bandwidth management logic, and be thus free of oversubscribing the system up to any arbitrary level. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:45 +01:00			`#else`
			`struct dl_bw dl_bw;`
sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic Introduces data structures relevant for implementing dynamic migration of -deadline tasks and the logic for checking if runqueues are overloaded with -deadline tasks and for choosing where a task should migrate, when it is the case. Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can be moved among CPUs when necessary. It is also possible to bind a task to a (set of) CPU(s), thus restricting its capability of migrating, or forbidding migrations at all. The very same approach used in sched_rt is utilised: - -deadline tasks are kept into CPU-specific runqueues, - -deadline tasks are migrated among runqueues to achieve the following: * on an M-CPU system the M earliest deadline ready tasks are always running; * affinity/cpusets settings of all the -deadline tasks is always respected. Therefore, this very special form of "load balancing" is done with an active method, i.e., the scheduler pushes or pulls tasks between runqueues when they are woken up and/or (de)scheduled. IOW, every time a preemption occurs, the descheduled task might be sent to some other CPU (depending on its deadline) to continue executing (push). On the other hand, every time a CPU becomes idle, it might pull the second earliest deadline ready task from some other CPU. To enforce this, a pull operation is always attempted before taking any scheduling decision (pre_schedule()), as well as a push one after each scheduling decision (post_schedule()). In addition, when a task arrives or wakes up, the best CPU where to resume it is selected taking into account its affinity mask, the system topology, but also its deadline. E.g., from the scheduling point of view, the best CPU where to wake up (and also where to push) a task is the one which is running the task with the latest deadline among the M executing ones. In order to facilitate these decisions, per-runqueue "caching" of the deadlines of the currently running and of the first ready task is used. Queued but not running tasks are also parked in another rb-tree to speed-up pushes. Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:38 +01:00			`#endif`
sched: deadline: use deadline bandwidth in scale_rt_capacity Instead of monitoring the exec time of deadline tasks to evaluate the CPU capacity consumed by deadline scheduler class, we can directly calculate it thanks to the sum of utilization of deadline tasks on the CPU. We can remove deadline tasks from rt_avg metric and directly use the average bandwidth of deadline scheduler in scale_rt_capacity. Based in part on a similar patch from Luca Abeni <luca.abeni@unitn.it>. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Steve Muckle <smuckle@linaro.org> 2015-11-03 10:39:01 +01:00			`/* This is the "average utilization" for this runqueue */`
			`s64 avg_bw;`
sched/deadline: Add SCHED_DEADLINE structures & implementation Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-28 11:14:43 +01:00			`};`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#ifdef CONFIG_SMP`

sched: Update max cpu capacity in case of max frequency constraints Wakeup balancing uses cpu capacity awareness and needs to know the system-wide maximum cpu capacity. Patch "sched: Store system-wide maximum cpu capacity in root domain" finds the system-wide maximum cpu capacity during scheduler domain hierarchy setup. This is sufficient as long as maximum frequency invariance is not enabled. If it is enabled, the system-wide maximum cpu capacity can change between scheduler domain hierarchy setups due to frequency capping. The cpu capacity is changed in update_cpu_capacity() which is called in load balance on the lowest scheduler domain hierarchy level. To be able to know if a change in cpu capacity for a certain cpu also has an effect on the system-wide maximum cpu capacity it is normally necessary to iterate over all cpus. This would be way too costly. That's why this patch follows a different approach. The unsigned long max_cpu_capacity value in struct root_domain is replaced with a struct max_cpu_capacity, containing value (the max_cpu_capacity) and cpu (the cpu index of the cpu providing the maximum cpu_capacity). Changes to the system-wide maximum cpu capacity and the cpu index are made if: 1 System-wide maximum cpu capacity < cpu capacity 2 System-wide maximum cpu capacity > cpu capacity and cpu index == cpu There are no changes to the system-wide maximum cpu capacity in all other cases. Atomic read and write access to the pair (max_cpu_capacity.val, max_cpu_capacity.cpu) is enforced by max_cpu_capacity.lock. The access to max_cpu_capacity.val in task_fits_max() is still performed without taking the max_cpu_capacity.lock. The code to set max cpu capacity in build_sched_domains() has been removed because the whole functionality is now provided by update_cpu_capacity() instead. This approach can introduce errors temporarily, e.g. in case the cpu currently providing the max cpu capacity has its cpu capacity lowered due to frequency capping and calls update_cpu_capacity() before any cpu which might provide the max cpu now. There is also an outstanding question: Should the cpu capacity of a cpu going idle be set to a very small value? Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> 2015-09-26 18:19:54 +01:00			`struct max_cpu_capacity {`
			`raw_spinlock_t lock;`
			`unsigned long val;`
			`int cpu;`
			`};`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`/*`
			`* We add the notion of a root-domain which will be used to define per-domain`
			`* variables. Each exclusive cpuset essentially defines an island domain by`
			`* fully partitioning the member cpus from any other cpuset. Whenever a new`
			`* exclusive cpuset is created, we also create and attach a new root-domain`
			`* object.`
			`*`
			`*/`
			`struct root_domain {`
			`atomic_t refcount;`
			`atomic_t rto_count;`
			`struct rcu_head rcu;`
			`cpumask_var_t span;`
			`cpumask_var_t online;`

sched/fair: Implement fast idling of CPUs when the system is partially loaded When a system is lightly loaded (i.e. no more than 1 job per cpu), attempt to pull job to a cpu before putting it to idle is unnecessary and can be skipped. This patch adds an indicator so the scheduler can know when there's no more than 1 active job is on any CPU in the system to skip needless job pulls. On a 4 socket machine with a request/response kind of workload from clients, we saw about 0.13 msec delay when we go through a full load balance to try pull job from all the other cpus. While 0.1 msec was spent on processing the request and generating a response, the 0.13 msec load balance overhead was actually more than the actual work being done. This overhead can be skipped much of the time for lightly loaded systems. With this patch, we tested with a netperf request/response workload that has the server busy with half the cpus in a 4 socket system. We found the patch eliminated 75% of the load balance attempts before idling a cpu. The overhead of setting/clearing the indicator is low as we already gather the necessary info while we call add_nr_running() and update_sd_lb_stats.() We switch to full load balance load immediately if any cpu got more than one job on its run queue in add_nr_running. We'll clear the indicator to avoid load balance when we detect no cpu's have more than one job when we scan the work queues in update_sg_lb_stats(). We are aggressive in turning on the load balance and opportunistic in skipping the load balance. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Jason Low <jason.low2@hp.com> Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-06-23 12:16:49 -07:00			`/* Indicate more than one runnable task for any CPU */`
			`bool overload;`

sched: Add over-utilization/tipping point indicator Energy-aware scheduling is only meant to be active while the system is _not_ over-utilized. That is, there are spare cycles available to shift tasks around based on their actual utilization to get a more energy-efficient task distribution without depriving any tasks. When above the tipping point task placement is done the traditional way based on load_avg, spreading the tasks across as many cpus as possible based on priority scaled load to preserve smp_nice. Below the tipping point we want to use util_avg instead. We need to define a criteria for when we make the switch. The util_avg for each cpu converges towards 100% (1024) regardless of how many task additional task we may put on it. If we define over-utilized as: sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity) some individual cpus may be over-utilized running multiple tasks even when the above condition is false. That should be okay as long as we try to spread the tasks out to avoid per-cpu over-utilization as much as possible and if all tasks have the _same_ priority. If the latter isn't true, we have to consider priority to preserve smp_nice. For example, we could have n_cpus nice=-10 util_avg=55% tasks and n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less over-utilized than 55%+60% for those cpus that have to be shared. The system utilization is only 85% of the system capacity, but we are breaking smp_nice. To be sure not to break smp_nice, we have defined over-utilization conservatively as when any cpu in the system is fully utilized at it's highest frequency instead: cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg to factor in priority to preserve smp_nice. With this definition, we can skip periodic load-balance as no cpu has an always-running task when the system is not over-utilized. All tasks will be periodic and we can balance them at wake-up. This conservative condition does however mean that some scenarios that could benefit from energy-aware decisions even if one cpu is fully utilized would not get those benefits. For system where some cpus might have reduced capacity on some cpus (RT-pressure and/or big.LITTLE), we want periodic load-balance checks as soon a just a single cpu is fully utilized as it might one of those with reduced capacity and in that case we want to migrate it. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> 2015-05-09 16:49:57 +01:00			`/* Indicate one or more cpus over-utilized (tipping point) */`
			`bool overutilized;`

sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic Introduces data structures relevant for implementing dynamic migration of -deadline tasks and the logic for checking if runqueues are overloaded with -deadline tasks and for choosing where a task should migrate, when it is the case. Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can be moved among CPUs when necessary. It is also possible to bind a task to a (set of) CPU(s), thus restricting its capability of migrating, or forbidding migrations at all. The very same approach used in sched_rt is utilised: - -deadline tasks are kept into CPU-specific runqueues, - -deadline tasks are migrated among runqueues to achieve the following: * on an M-CPU system the M earliest deadline ready tasks are always running; * affinity/cpusets settings of all the -deadline tasks is always respected. Therefore, this very special form of "load balancing" is done with an active method, i.e., the scheduler pushes or pulls tasks between runqueues when they are woken up and/or (de)scheduled. IOW, every time a preemption occurs, the descheduled task might be sent to some other CPU (depending on its deadline) to continue executing (push). On the other hand, every time a CPU becomes idle, it might pull the second earliest deadline ready task from some other CPU. To enforce this, a pull operation is always attempted before taking any scheduling decision (pre_schedule()), as well as a push one after each scheduling decision (post_schedule()). In addition, when a task arrives or wakes up, the best CPU where to resume it is selected taking into account its affinity mask, the system topology, but also its deadline. E.g., from the scheduling point of view, the best CPU where to wake up (and also where to push) a task is the one which is running the task with the latest deadline among the M executing ones. In order to facilitate these decisions, per-runqueue "caching" of the deadlines of the currently running and of the first ready task is used. Queued but not running tasks are also parked in another rb-tree to speed-up pushes. Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:38 +01:00			`/*`
			`* The bit corresponding to a CPU gets set here if such CPU has more`
			`* than one runnable -deadline task (as it is below for RT tasks).`
			`*/`
			`cpumask_var_t dlo_mask;`
			`atomic_t dlo_count;`
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks In order of deadline scheduling to be effective and useful, it is important that some method of having the allocation of the available CPU bandwidth to tasks and task groups under control. This is usually called "admission control" and if it is not performed at all, no guarantee can be given on the actual scheduling of the -deadline tasks. Since when RT-throttling has been introduced each task group have a bandwidth associated to itself, calculated as a certain amount of runtime over a period. Moreover, to make it possible to manipulate such bandwidth, readable/writable controls have been added to both procfs (for system wide settings) and cgroupfs (for per-group settings). Therefore, the same interface is being used for controlling the bandwidth distrubution to -deadline tasks and task groups, i.e., new controls but with similar names, equivalent meaning and with the same usage paradigm are added. However, more discussion is needed in order to figure out how we want to manage SCHED_DEADLINE bandwidth at the task group level. Therefore, this patch adds a less sophisticated, but actually very sensible, mechanism to ensure that a certain utilization cap is not overcome per each root_domain (the single rq for !SMP configurations). Another main difference between deadline bandwidth management and RT-throttling is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!), and thus we don't need an higher level throttling mechanism to enforce the desired bandwidth. This patch, therefore: - adds system wide deadline bandwidth management by means of: * /proc/sys/kernel/sched_dl_runtime_us, * /proc/sys/kernel/sched_dl_period_us, that determine (i.e., runtime / period) the total bandwidth available on each CPU of each root_domain for -deadline tasks; - couples the RT and deadline bandwidth management, i.e., enforces that the sum of how much bandwidth is being devoted to -rt -deadline tasks to stay below 100%. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created until the sum of their bandwidths stay below: M * (sched_dl_runtime_us / sched_dl_period_us) It is also possible to disable this bandwidth management logic, and be thus free of oversubscribing the system up to any arbitrary level. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:45 +01:00			`struct dl_bw dl_bw;`
sched/deadline: speed up SCHED_DEADLINE pushes with a push-heap Data from tests confirmed that the original active load balancing logic didn't scale neither in the number of CPU nor in the number of tasks (as sched_rt does). Here we provide a global data structure to keep track of deadlines of the running tasks in the system. The structure is composed by a bitmask showing the free CPUs and a max-heap, needed when the system is heavily loaded. The implementation and concurrent access scheme are kept simple by design. However, our measurements show that we can compete with sched_rt on large multi-CPUs machines [1]. Only the push path is addressed, the extension to use this structure also for pull decisions is straightforward. However, we are currently evaluating different (in order to decrease/avoid contention) data structures to solve possibly both problems. We are also going to re-run tests considering recent changes inside cpupri [2]. [1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf [2] http://www.spinics.net/lists/linux-rt-users/msg06778.html Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-14-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:47 +01:00			`struct cpudl cpudl;`
sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic Introduces data structures relevant for implementing dynamic migration of -deadline tasks and the logic for checking if runqueues are overloaded with -deadline tasks and for choosing where a task should migrate, when it is the case. Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can be moved among CPUs when necessary. It is also possible to bind a task to a (set of) CPU(s), thus restricting its capability of migrating, or forbidding migrations at all. The very same approach used in sched_rt is utilised: - -deadline tasks are kept into CPU-specific runqueues, - -deadline tasks are migrated among runqueues to achieve the following: * on an M-CPU system the M earliest deadline ready tasks are always running; * affinity/cpusets settings of all the -deadline tasks is always respected. Therefore, this very special form of "load balancing" is done with an active method, i.e., the scheduler pushes or pulls tasks between runqueues when they are woken up and/or (de)scheduled. IOW, every time a preemption occurs, the descheduled task might be sent to some other CPU (depending on its deadline) to continue executing (push). On the other hand, every time a CPU becomes idle, it might pull the second earliest deadline ready task from some other CPU. To enforce this, a pull operation is always attempted before taking any scheduling decision (pre_schedule()), as well as a push one after each scheduling decision (post_schedule()). In addition, when a task arrives or wakes up, the best CPU where to resume it is selected taking into account its affinity mask, the system topology, but also its deadline. E.g., from the scheduling point of view, the best CPU where to wake up (and also where to push) a task is the one which is running the task with the latest deadline among the M executing ones. In order to facilitate these decisions, per-runqueue "caching" of the deadlines of the currently running and of the first ready task is used. Queued but not running tasks are also parked in another rb-tree to speed-up pushes. Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:38 +01:00
sched/rt: Simplify the IPI based RT balancing logic commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream. When a CPU lowers its priority (schedules out a high priority task for a lower priority one), a check is made to see if any other CPU has overloaded RT tasks (more than one). It checks the rto_mask to determine this and if so it will request to pull one of those tasks to itself if the non running RT task is of higher priority than the new priority of the next task to run on the current CPU. When we deal with large number of CPUs, the original pull logic suffered from large lock contention on a single CPU run queue, which caused a huge latency across all CPUs. This was caused by only having one CPU having overloaded RT tasks and a bunch of other CPUs lowering their priority. To solve this issue, commit: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") changed the way to request a pull. Instead of grabbing the lock of the overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. Although the IPI logic worked very well in removing the large latency build up, it still could suffer from a large number of IPIs being sent to a single CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet, when I tested this on a 120 CPU box, with a stress test that had lots of RT tasks scheduling on all CPUs, it actually triggered the hard lockup detector! One CPU had so many IPIs sent to it, and due to the restart mechanism that is triggered when the source run queue has a priority status change, the CPU spent minutes! processing the IPIs. Thinking about this further, I realized there's no reason for each run queue to send its own IPI. As all CPUs with overloaded tasks must be scanned regardless if there's one or many CPUs lowering their priority, because there's no current way to find the CPU with the highest priority task that can schedule to one of these CPUs, there really only needs to be one IPI being sent around at a time. This greatly simplifies the code! The new approach is to have each root domain have its own irq work, as the rto_mask is per root domain. The root domain has the following fields attached to it: rto_push_work - the irq work to process each CPU set in rto_mask rto_lock - the lock to protect some of the other rto fields rto_loop_start - an atomic that keeps contention down on rto_lock the first CPU scheduling in a lower priority task is the one to kick off the process. rto_loop_next - an atomic that gets incremented for each CPU that schedules in a lower priority task. rto_loop - a variable protected by rto_lock that is used to compare against rto_loop_next rto_cpu - The cpu to send the next IPI to, also protected by the rto_lock. When a CPU schedules in a lower priority task and wants to make sure overloaded CPUs know about it. It increments the rto_loop_next. Then it atomically sets rto_loop_start with a cmpxchg. If the old value is not "0", then it is done, as another CPU is kicking off the IPI loop. If the old value is "0", then it will take the rto_lock to synchronize with a possible IPI being sent around to the overloaded CPUs. If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no IPI being sent around, or one is about to finish. Then rto_cpu is set to the first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set in rto_mask, then there's nothing to be done. When the CPU receives the IPI, it will first try to push any RT tasks that is queued on the CPU but can't run because a higher priority RT task is currently running on that CPU. Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it finds one, it simply sends an IPI to that CPU and the process continues. If there's no more CPUs in the rto_mask, then rto_loop is compared with rto_loop_next. If they match, everything is done and the process is over. If they do not match, then a CPU scheduled in a lower priority task as the IPI was being passed around, and the process needs to start again. The first CPU in rto_mask is sent the IPI. This change removes this duplication of work in the IPI logic, and greatly lowers the latency caused by the IPIs. This removed the lockup happening on the 120 CPU machine. It also simplifies the code tremendously. What else could anyone ask for? Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and supplying me with the rto_start_trylock() and rto_start_unlock() helper functions. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott Wood <swood@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 2017-10-06 14:05:04 -04:00			`#ifdef HAVE_RT_PUSH_IPI`
			`/*`
			`* For IPI pull requests, loop across the rto_mask.`
			`*/`
			`struct irq_work rto_push_work;`
			`raw_spinlock_t rto_lock;`
			`/* These are only updated and read within rto_lock */`
			`int rto_loop;`
			`int rto_cpu;`
			`/* These atomics are updated outside of a lock */`
			`atomic_t rto_loop_next;`
			`atomic_t rto_loop_start;`
			`#endif`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`/*`
			`* The "RT overload" flag: it gets set if a CPU has more than`
			`* one runnable RT task.`
			`*/`
			`cpumask_var_t rto_mask;`
			`struct cpupri cpupri;`
sched: Store system-wide maximum cpu capacity in root domain To be able to compare the capacity of the target cpu with the highest cpu capacity of the system in the wakeup path, store the system-wide maximum cpu capacity in the root domain. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> 2015-05-07 18:46:15 +01:00
			`/* Maximum cpu capacity in the system. */`
sched: Update max cpu capacity in case of max frequency constraints Wakeup balancing uses cpu capacity awareness and needs to know the system-wide maximum cpu capacity. Patch "sched: Store system-wide maximum cpu capacity in root domain" finds the system-wide maximum cpu capacity during scheduler domain hierarchy setup. This is sufficient as long as maximum frequency invariance is not enabled. If it is enabled, the system-wide maximum cpu capacity can change between scheduler domain hierarchy setups due to frequency capping. The cpu capacity is changed in update_cpu_capacity() which is called in load balance on the lowest scheduler domain hierarchy level. To be able to know if a change in cpu capacity for a certain cpu also has an effect on the system-wide maximum cpu capacity it is normally necessary to iterate over all cpus. This would be way too costly. That's why this patch follows a different approach. The unsigned long max_cpu_capacity value in struct root_domain is replaced with a struct max_cpu_capacity, containing value (the max_cpu_capacity) and cpu (the cpu index of the cpu providing the maximum cpu_capacity). Changes to the system-wide maximum cpu capacity and the cpu index are made if: 1 System-wide maximum cpu capacity < cpu capacity 2 System-wide maximum cpu capacity > cpu capacity and cpu index == cpu There are no changes to the system-wide maximum cpu capacity in all other cases. Atomic read and write access to the pair (max_cpu_capacity.val, max_cpu_capacity.cpu) is enforced by max_cpu_capacity.lock. The access to max_cpu_capacity.val in task_fits_max() is still performed without taking the max_cpu_capacity.lock. The code to set max cpu capacity in build_sched_domains() has been removed because the whole functionality is now provided by update_cpu_capacity() instead. This approach can introduce errors temporarily, e.g. in case the cpu currently providing the max cpu capacity has its cpu capacity lowered due to frequency capping and calls update_cpu_capacity() before any cpu which might provide the max cpu now. There is also an outstanding question: Should the cpu capacity of a cpu going idle be set to a very small value? Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> 2015-09-26 18:19:54 +01:00			`struct max_cpu_capacity max_cpu_capacity;`
sched/core: Add first cpu w/ max/min orig capacity to root domain This will allow to start iterating from a cpu with max or min original capacity in the wakeup path regardless on which cpu the scheduler is currently running (smp_processor_id()) or the previous cpu of the task (task_cpu(p)). This iteration has to happen on a sched_domain spanning all cpus in the order of the sched_groups of this sched_domain seen by the starting cpu. In case of an SMP system the first cpu with max orig capacity and the the one with min orig capacity is the same. This can temporally happen on a big.LITTLE system with hotplug as well. E.g. the different order of cpu iteration can be used to map schedtune task parameter 'boosted' into the cpu iteration order in find_best_target(). Use of READ_ONCE()/WRITE_ONCE() to avoid load/store tearing. Change-Id: I812fbd9c7e5f506617e456c0eec3edcd2c016e92 Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> (cherry picked from commit fd6e9543c1fd8971a5e2e68e39b2f6e591d46114) Signed-off-by: Chris Redpath <chris.redpath@arm.com> 2017-01-08 16:16:59 +00:00
			`/* First cpu with maximum and minimum original capacity */`
			`int max_cap_orig_cpu, min_cap_orig_cpu;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`};`

			`extern struct root_domain def_root_domain;`
sched/rt: Up the root domain ref count when passing it around via IPIs commit 364f56653708ba8bcdefd4f0da2a42904baa8eeb upstream. When issuing an IPI RT push, where an IPI is sent to each CPU that has more than one RT task scheduled on it, it references the root domain's rto_mask, that contains all the CPUs within the root domain that has more than one RT task in the runable state. The problem is, after the IPIs are initiated, the rq->lock is released. This means that the root domain that is associated to the run queue could be freed while the IPIs are going around. Add a sched_get_rd() and a sched_put_rd() that will increment and decrement the root domain's ref count respectively. This way when initiating the IPIs, the scheduler will up the root domain's ref count before releasing the rq->lock, ensuring that the root domain does not go away until the IPI round is complete. Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic") Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 2018-01-23 20:45:38 -05:00			`extern void sched_get_rd(struct root_domain *rd);`
			`extern void sched_put_rd(struct root_domain *rd);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched/rt: Simplify the IPI based RT balancing logic commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream. When a CPU lowers its priority (schedules out a high priority task for a lower priority one), a check is made to see if any other CPU has overloaded RT tasks (more than one). It checks the rto_mask to determine this and if so it will request to pull one of those tasks to itself if the non running RT task is of higher priority than the new priority of the next task to run on the current CPU. When we deal with large number of CPUs, the original pull logic suffered from large lock contention on a single CPU run queue, which caused a huge latency across all CPUs. This was caused by only having one CPU having overloaded RT tasks and a bunch of other CPUs lowering their priority. To solve this issue, commit: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") changed the way to request a pull. Instead of grabbing the lock of the overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. Although the IPI logic worked very well in removing the large latency build up, it still could suffer from a large number of IPIs being sent to a single CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet, when I tested this on a 120 CPU box, with a stress test that had lots of RT tasks scheduling on all CPUs, it actually triggered the hard lockup detector! One CPU had so many IPIs sent to it, and due to the restart mechanism that is triggered when the source run queue has a priority status change, the CPU spent minutes! processing the IPIs. Thinking about this further, I realized there's no reason for each run queue to send its own IPI. As all CPUs with overloaded tasks must be scanned regardless if there's one or many CPUs lowering their priority, because there's no current way to find the CPU with the highest priority task that can schedule to one of these CPUs, there really only needs to be one IPI being sent around at a time. This greatly simplifies the code! The new approach is to have each root domain have its own irq work, as the rto_mask is per root domain. The root domain has the following fields attached to it: rto_push_work - the irq work to process each CPU set in rto_mask rto_lock - the lock to protect some of the other rto fields rto_loop_start - an atomic that keeps contention down on rto_lock the first CPU scheduling in a lower priority task is the one to kick off the process. rto_loop_next - an atomic that gets incremented for each CPU that schedules in a lower priority task. rto_loop - a variable protected by rto_lock that is used to compare against rto_loop_next rto_cpu - The cpu to send the next IPI to, also protected by the rto_lock. When a CPU schedules in a lower priority task and wants to make sure overloaded CPUs know about it. It increments the rto_loop_next. Then it atomically sets rto_loop_start with a cmpxchg. If the old value is not "0", then it is done, as another CPU is kicking off the IPI loop. If the old value is "0", then it will take the rto_lock to synchronize with a possible IPI being sent around to the overloaded CPUs. If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no IPI being sent around, or one is about to finish. Then rto_cpu is set to the first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set in rto_mask, then there's nothing to be done. When the CPU receives the IPI, it will first try to push any RT tasks that is queued on the CPU but can't run because a higher priority RT task is currently running on that CPU. Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it finds one, it simply sends an IPI to that CPU and the process continues. If there's no more CPUs in the rto_mask, then rto_loop is compared with rto_loop_next. If they match, everything is done and the process is over. If they do not match, then a CPU scheduled in a lower priority task as the IPI was being passed around, and the process needs to start again. The first CPU in rto_mask is sent the IPI. This change removes this duplication of work in the IPI logic, and greatly lowers the latency caused by the IPIs. This removed the lockup happening on the 120 CPU machine. It also simplifies the code tremendously. What else could anyone ask for? Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and supplying me with the rto_start_trylock() and rto_start_unlock() helper functions. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott Wood <swood@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 2017-10-06 14:05:04 -04:00			`#ifdef HAVE_RT_PUSH_IPI`
			`extern void rto_push_irq_work_func(struct irq_work *work);`
			`#endif`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#endif /* CONFIG_SMP */`

			`/*`
			`* This is the main, per-CPU runqueue data structure.`
			`*`
			`* Locking rule: those places that want to lock multiple runqueues`
			`* (such as the load balancing or the thread migration code), lock`
			`* acquire operations must be ordered by ascending &runqueue.`
			`*/`
			`struct rq {`
			`/* runqueue lock: */`
			`raw_spinlock_t lock;`

			`/*`
			`* nr_running and cpu_load should be in the same cacheline because`
			`* remote CPUs use both these fields when doing load calculation.`
			`*/`
sched: Change rq->nr_running to unsigned int Since there's a PID space limit of 30bits (see futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a lower bound of 2 pages per task) would still take 8T of memory it seems reasonable to say that unsigned int is sufficient for rq->nr_running. When we do get anywhere near that amount of tasks I suspect other things would go funny, load-balancer load computations would really need to be hoisted to 128bit etc. So save a few bytes and convert rq->nr_running and friends to unsigned int. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-04-26 13:12:27 +02:00			`unsigned int nr_running;`
sched/numa: Avoid migrating tasks that are placed on their preferred node This patch classifies scheduler domains and runqueues into types depending the number of tasks that are about their NUMA placement and the number that are currently running on their preferred node. The types are regular: There are tasks running that do not care about their NUMA placement. remote: There are tasks running that care about their placement but are currently running on a node remote to their ideal placement all: No distinction To implement this the patch tracks the number of tasks that are optimally NUMA placed (rq->nr_preferred_running) and the number of tasks running that care about their placement (nr_numa_running). The load balancer uses this information to avoid migrating idea placed NUMA tasks as long as better options for load balancing exists. For example, it will not consider balancing between a group whose tasks are all perfectly placed and a group with remote tasks. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-56-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-07 11:29:33 +01:00			`#ifdef CONFIG_NUMA_BALANCING`
			`unsigned int nr_numa_running;`
			`unsigned int nr_preferred_running;`
			`#endif`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#define CPU_LOAD_IDX_MAX 5`
			`unsigned long cpu_load[CPU_LOAD_IDX_MAX];`
			`unsigned long last_load_update_tick;`
sched: Add group_misfit_task load-balance type To maximize throughput in systems with reduced capacity cpus (e.g. high RT/IRQ load and/or ARM big.LITTLE) load-balancing has to consider task and cpu utilization as well as per-cpu compute capacity when load-balancing in addition to the current average load based load-balancing policy. Tasks that are scheduled on a reduced capacity cpu need to be identified and migrated to a higher capacity cpu if possible. To implement this additional policy an additional group_type (load-balance scenario) is added: group_misfit_task. This represents scenarios where a sched_group has tasks that are not suitable for its per-cpu capacity. group_misfit_task is only considered if the system is not overloaded in any other way (group_imbalanced or group_overloaded). Identifying misfit tasks requires the rq lock to be held. To avoid taking remote rq locks to examine source sched_groups for misfit tasks, each cpu is responsible for tracking misfit tasks themselves and update the rq->misfit_task flag. This means checking task utilization when tasks are scheduled and on sched_tick. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> 2016-02-25 12:47:54 +00:00			`unsigned int misfit_task;`
nohz: Rename CONFIG_NO_HZ to CONFIG_NO_HZ_COMMON We are planning to convert the dynticks Kconfig options layout into a choice menu. The user must be able to easily pick any of the following implementations: constant periodic tick, idle dynticks, full dynticks. As this implies a mutual exclusion, the two dynticks implementions need to converge on the selection of a common Kconfig option in order to ease the sharing of a common infrastructure. It would thus seem pretty natural to reuse CONFIG_NO_HZ to that end. It already implements all the idle dynticks code and the full dynticks depends on all that code for now. So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ. On the other hand we want to stay backward compatible: if CONFIG_NO_HZ is set in an older config file, we want to enable CONFIG_NO_HZ_IDLE by default. But we can't afford both at the same time or we run into a circular dependency: 1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select CONFIG_NO_HZ 2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE We might be able to support that from Kconfig/Kbuild but it may not be wise to introduce such a confusing behaviour. So to solve this, create a new CONFIG_NO_HZ_COMMON option which gathers the common code between idle and full dynticks (that common code for now is simply the idle dynticks code) and select it from their referring Kconfig. Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ to it for backward compatibility. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> 2011-08-10 23:21:01 +02:00			`#ifdef CONFIG_NO_HZ_COMMON`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`u64 nohz_stamp;`
sched, nohz: Introduce nohz_flags in 'struct rq' Introduce nohz_flags in the struct rq, which will track these two flags for now. NOHZ_TICK_STOPPED keeps track of the tick stopped status that gets set when the tick is stopped. It will be used to update the nohz idle load balancer data structures during the first busy tick after the tick is restarted. At this first busy tick after tickless idle, NOHZ_TICK_STOPPED flag will be reset. This will minimize the nohz idle load balancer status updates that currently happen for every tickless exit, making it more scalable when there are many logical cpu's that enter and exit idle often. NOHZ_BALANCE_KICK will track the need for nohz idle load balance on this rq. This will replace the nohz_balance_kick in the rq, which was not being updated atomically. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.499438999@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-01 17:07:32 -08:00			`unsigned long nohz_flags;`
sched: Keep at least 1 tick per second for active dynticks tasks The scheduler doesn't yet fully support environments with a single task running without a periodic tick. In order to ensure we still maintain the duties of scheduler_tick(), keep at least 1 tick per second. This makes sure that we keep the progression of various scheduler accounting and background maintainance even with a very low granularity. Examples include cpu load, sched average, CFS entity vruntime, avenrun and events such as load balancing, amongst other details handled in sched_class::task_tick(). This limitation will be removed in the future once we get these individual items to work in full dynticks CPUs. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> 2013-05-03 03:39:05 +02:00			`#endif`
			`#ifdef CONFIG_NO_HZ_FULL`
			`unsigned long last_sched_tick;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#endif`
CHROMIUM: sched: update the average of nr_running Doing a Exponential moving average per nr_running++/-- does not guarantee a fixed sample rate which induces errors if there are lots of threads being enqueued/dequeued from the rq (Linpack mt). Instead of keeping track of the avg, the scheduler now keeps track of the integral of nr_running and allows the readers to perform filtering on top. Original-author: Sai Charan Gurrappadi <sgurrappadi@nvidia.com> Change-Id: Id946654f32fa8be0eaf9d8fa7c9a8039b5ef9fab Signed-off-by: Joseph Lo <josephl@nvidia.com> Signed-off-by: Andrew Bresticker <abrestic@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/174694 Reviewed-on: https://chromium-review.googlesource.com/272853 [jstultz: fwdported to 4.4] Signed-off-by: John Stultz <john.stultz@linaro.org> 2013-04-22 14:39:18 +08:00
			`#ifdef CONFIG_CPU_QUIET`
			`/* time-based average load */`
			`u64 nr_last_stamp;`
			`u64 nr_running_integral;`
			`seqcount_t ave_seqcnt;`
			`#endif`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`/* capture load from all tasks on this cpu: */`
			`struct load_weight load;`
			`unsigned long nr_load_updates;`
			`u64 nr_switches;`

			`struct cfs_rq cfs;`
			`struct rt_rq rt;`
sched/deadline: Add SCHED_DEADLINE structures & implementation Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-28 11:14:43 +01:00			`struct dl_rq dl;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
			`#ifdef CONFIG_FAIR_GROUP_SCHED`
			`/* list of leaf cfs_rq on this cpu: */`
			`struct list_head leaf_cfs_rq_list;`
UPSTREAM: sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a child will always be called before its parent. The hierarchical order in shares update list has been introduced by commit: 67e86250f8ea ("sched: Introduce hierarchal order on shares update list") With the current implementation a child can be still put after its parent. Lets take the example of: root \ b /\ c d* \| e* with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail The branch d -> e will be added the first time that they are enqueued, starting with e then d. When e is added, its parents is not already on the list so e is put at the tail : head -> c -> b -> root -> e -> tail Then, d is added at the head because its parent is already on the list: head -> d -> c -> b -> root -> e -> tail e is not placed at the right position and will be called the last whereas it should be called at the beginning. Because it follows the bottom-up enqueue sequence, we are sure that we will finished to add either a cfs_rq without parent or a cfs_rq with a parent that is already on the list. We can use this event to detect when we have finished to add a new branch. For the others, whose parents are not already added, we have to ensure that they will be added after their children that have just been inserted the steps before, and after any potential parents that are already in the list. The easiest way is to put the cfs_rq just after the last inserted one and to keep track of it untl the branch is fully added. Change-Id: I4fe0b8502ea628c13d14e8e5c5279bce67fb8845 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: kernellwp@gmail.com Cc: pjt@google.com Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 9c2791f936ef5fd04a118b5c284f2c9a95f4a647) Signed-off-by: Chris Redpath <chris.redpath@arm.com> 2016-11-08 10:53:43 +01:00			`struct list_head *tmp_alone_branch;`
sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies Peter Portante reported that for large cgroup hierarchies (and or on large CPU counts) we get immense lock contention on rq->lock and stuff stops working properly. His workload was a ton of processes, each in their own cgroup, everybody idling except for a sporadic wakeup once every so often. It was found that: schedule() idle_balance() load_balance() local_irq_save() double_rq_lock() update_h_load() walk_tg_tree(tg_load_down) tg_load_down() Results in an entire cgroup hierarchy walk under rq->lock for every new-idle balance and since new-idle balance isn't throttled this results in a lot of work while holding the rq->lock. This patch does two things, it removes the work from under rq->lock based on the good principle of race and pray which is widely employed in the load-balancer as a whole. And secondly it throttles the update_h_load() calculation to max once per jiffy. I considered excluding update_h_load() for new-idle balance all-together, but purely relying on regular balance passes to update this data might not work out under some rare circumstances where the new-idle busiest isn't the regular busiest for a while (unlikely, but a nightmare to debug if someone hits it and suffers). Cc: pjt@google.com Cc: Larry Woodman <lwoodman@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Reported-by: Peter Portante <pportant@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2012-08-08 21:46:40 +02:00			`#endif /* CONFIG_FAIR_GROUP_SCHED */`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`/*`
			`* This is part of a global counter where only the total sum`
			`* over all CPUs matters. A task can increase this counter on`
			`* one CPU and if it got migrated afterwards it may decrease`
			`* it on another CPU. Always updated under the runqueue lock:`
			`*/`
			`unsigned long nr_uninterruptible;`

			`struct task_struct curr, idle, *stop;`
			`unsigned long next_balance;`
			`struct mm_struct *prev_mm;`

sched/core: Rework rq->clock update skips The original purpose of rq::skip_clock_update was to avoid 'costly' clock updates for back to back wakeup-preempt pairs. The big problem with it has always been that the rq variable is unaware of the context and causes indiscrimiate clock skips. Rework the entire thing and create a sense of context by only allowing schedule() to skip clock updates. (XXX can we measure the cost of the added store?) By ensuring only schedule can ever skip an update, we guarantee we're never more than 1 tick behind on the update. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150105103554.432381549@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-01-05 11:18:11 +01:00			`unsigned int clock_skip_update;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`u64 clock;`
			`u64 clock_task;`

			`atomic_t nr_iowait;`

			`#ifdef CONFIG_SMP`
			`struct root_domain *rd;`
			`struct sched_domain *sd;`

sched: Remove remaining dubious usage of "power" It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. This is the remaining "power" -> "capacity" rename for local symbols. Those symbols visible to the rest of the kernel are not included yet. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-yyyhohzhkwnaotr3lx8zd5aa@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-26 18:19:38 -04:00			`unsigned long cpu_capacity;`
sched: Add struct rq::cpu_capacity_orig This new field 'cpu_capacity_orig' reflects the original capacity of a CPU before being altered by rt tasks and/or IRQ The cpu_capacity_orig will be used: - to detect when the capacity of a CPU has been noticeably reduced so we can trig load balance to look for a CPU with better capacity. As an example, we can detect when a CPU handles a significant amount of irq (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by scheduler whereas CPUs, which are really idle, are available. - evaluate the available capacity for CFS tasks Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-02-27 16:54:09 +01:00			`unsigned long cpu_capacity_orig;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched: Replace post_schedule with a balance callback list Generalize the post_schedule() stuff into a balance callback list. This allows us to more easily use it outside of schedule() and cross sched_class. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124742.424032725@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2015-06-11 14:46:37 +02:00			`struct callback_head *balance_callback;`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`unsigned char idle_balance;`
			`/* For active balancing */`
			`int active_balance;`
			`int push_cpu;`
sched: Extend active balance to accept 'push_task' argument Active balance currently picks one task to migrate from busy cpu to a chosen cpu (push_cpu). This patch extends active load balance to recognize a particular task ('push_task') that needs to be migrated to 'push_cpu'. This capability will be leveraged by HMP-aware task placement in a subsequent patch. Change-Id: If31320111e6cc7044e617b5c3fd6d8e0c0e16952 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2014-03-31 10:34:41 -07:00			`struct task_struct *push_task;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`struct cpu_stop_work active_balance_work;`
			`/* cpu of this runqueue: */`
			`int cpu;`
			`int online;`

sched: Ditch per cgroup task lists for load-balancing Per cgroup load-balance has numerous problems, chief amongst them that there is no real sane order in them. So stop pretending it makes sense and enqueue all tasks on a single list. This also allows us to more easily fix the fwd progress issue uncovered by the lock-break stuff. Rotate the list on failure to migreate and limit the total iterations to nr_running (which with releasing the lock isn't strictly accurate but close enough). Also add a filter that skips very light tasks on the first attempt around the list, this attempts to avoid shooting whole cgroups around without affecting over balance. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Link: http://lkml.kernel.org/n/tip-tx8yqydc7eimgq7i4rkc3a4g@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2012-02-20 21:49:09 +01:00			`struct list_head cfs_tasks;`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`u64 rt_avg;`
			`u64 age_stamp;`
			`u64 idle_stamp;`
			`u64 avg_idle;`
sched/balancing: Consider max cost of idle balance per sched domain In this patch, we keep track of the max cost we spend doing idle load balancing for each sched domain. If the avg time the CPU remains idle is less then the time we have already spent on idle balancing + the max cost of idle balancing in the sched domain, then we don't continue to attempt the balance. We also keep a per rq variable, max_idle_balance_cost, which keeps track of the max time spent on newidle load balances throughout all its domains so that we can determine the avg_idle's max value. By using the max, we avoid overrunning the average. This further reduces the chance we attempt balancing when the CPU is not idle for longer than the cost to balance. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379096813-3032-3-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-09-13 11:26:52 -07:00
			`/* This is used to determine avg_idle's max value */`
			`u64 max_idle_balance_cost;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#endif`

sched: window-stats: Enhance cpu busy time accounting rq->curr/prev_runnable_sum counters represent cpu demand from various tasks that have run on a cpu. Any task that runs on a cpu will have a representation in rq->curr_runnable_sum. Their partial_demand value will be included in rq->curr_runnable_sum. Since partial_demand is derived from historical load samples for a task, rq->curr_runnable_sum could represent "inflated/un-realistic" cpu usage. As an example, lets say that task with partial_demand of 10ms runs for only 1ms on a cpu. What is included in rq->curr_runnable_sum is 10ms (and not the actual execution time of 1ms). This leads to cpu busy time being reported on the upside causing frequency to stay higher than necessary. This patch fixes cpu busy accounting scheme to strictly represent actual usage. It also provides for conditional fixup of busy time upon migration and upon heavy-task wakeup. CRs-Fixed: 691443 Change-Id: Ic4092627668053934049af4dfef65d9b6b901e6b Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in init_task_load(), se.avg.decay_count has deprecated.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-09-01 13:26:53 +05:30			`#ifdef CONFIG_SCHED_HMP`
sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`struct sched_cluster *cluster;`
sched: add migration load change notifier for frequency guidance When a task moves between CPUs in two different frequency domains the cpufreq governor may wish to immediately modify the frequency of both the source and destination CPUs of the migrating task. A tunable is provided to establish what size task is considered "significant" enough to warrant notifying cpufreq. Also fix a bug that would cause load to not be accounted properly during wakeup migrations. Change-Id: Ie8f6b1cc4d43a602840dac18590b42a81327c95a Signed-off-by: Steve Muckle <smuckle@codeaurora.org> [rameezmustafa@codeaurora.org: Add double rq locking for set_task_cpu()] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2014-05-06 18:05:50 -07:00			`struct cpumask freq_domain_cpumask;`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`struct hmp_sched_stats hmp_stats;`

sched: Move CPU cstate tracking under CONFIG_SCHED_HMP While tracking C-states makes sense under CONFIG_SMP as well, cstate information is currently unused under CONFIG_SMP. Move it under CONFIG_SCHED_HMP for now since that is the only place it is relevant at the moment. Change-Id: Ifc5812cfe14ebf2b4d447100dcd87f02ab29ff7a Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 11:22:08 -07:00			`int cstate, wakeup_latency, wakeup_energy;`
sched: window-stats: synchronize windows across cpus Synchronizing windows across cpus for task load measurements simplifies cpu busy time accounting during migrations. For task migrations, its usage in current window can be carried over to its new cpu. This lets cpufreq governor see a correct picture of cpu busy time that is not affected by migrations. This patch lines up windows across cpus. One of the cpu, sync_cpu, serves as a reference for all others. During bootup sync_cpu would initialize its window_start (from its sched_clock()). Other cpus will synchronize their window_start in reference to sync_cpu. This patch assumes synchronous sched_clock() across cpus and may need some change to address architectures which do not provide such synchronized sched_clock(). Change-Id: I13381389a72f5f9f85cc2446401d493a55c78ab7 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-04-29 12:44:43 -07:00			`u64 window_start;`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`unsigned long hmp_flags;`
sched: window-stats: Add aggregated runqueue windowed stats Add counters per-cpu to track its busy time in the latest window and one previous to that. This would be needed to track accurate busy time per-cpu that accounts for migrations. Basically once a task migrates, its execution time in current window is migrated as well to new cpu. The idle task's runtime is not accounted since it should not count towards runqueue busy time. Change-Id: I4014dd686f95dbbfaa4274269bc36ed716573421 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-04-29 14:01:50 -07:00
sched: track soft/hard irqload per-RQ with decaying avg The scheduler currently ignores irq activity when deciding which CPUs to place tasks on. If a CPU is getting hammered with IRQ activity but has no tasks it will look attractive to the scheduler as it will not be in a low power mode. Track irqload with a decaying average. This quantity can be used in the task placement logic to avoid CPUs which are under high irqload. The decay factor is 3/4. Note that with this algorithm the tracked irqload quantity will be higher than the actual irq time observed in any single window. Some sample outcomes with steady irqloads per 10ms window and the 3/4 decay factor (irqload of 10 is used as a threshold in a subsequent patch): irqload per window load value asymptote # windows to > 10 2ms 8 n/a 3ms 12 7 4ms 16 4 5ms 20 3 Of course irqload will not be constant in each window, these are just given as simple examples. Change-Id: I9dba049f5dfdcecc04339f727c8dd4ff554e01a5 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> 2014-11-13 13:01:31 -08:00			`u64 cur_irqload;`
			`u64 avg_irqload;`
			`u64 irqload_ts;`
sched: Add tunables for static cpu and cluster cost Add per-cpu tunable to set the extra cost to use a CPU that is idle. Add the same for a cluster. Change-Id: I4aa53f3c42c963df7abc7480980f747f0413d389 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org> [joonwoop@codeaurora.org: omitted changes for qhmp*.[c,h] stripped out CONFIG_SCHED_QHMP in drivers/base/cpu.c and include/linux/sched.h] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-08-10 16:41:44 -07:00			`unsigned int static_cpu_pwr_cost;`
sched: Notify cpufreq governor early about potential big tasks Tasks that are on the runqueue continuously for a certain amount of time have the potential to be big tasks at the end of the window in which they are runnable. In such scenarios ramping the CPU frequency early can boost performance rather than waiting till the end of a window for the governor to query load. Notify the governor early at every tick when a task has been observed to execute beyond some percentage of the tick period. The threshold beyond which a task is eligible for early detection can be changed via the tunable sched_early_detection_duration. The feature itself is enabled only when scheduler boost is in effect. Change-Id: I528b72bbc79a55b4593d1b8ab45450411c6d70f3 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in scheduler_tick() in kernel/sched/core.c. fixed minor conflicts in include/linux/sched.h, include/linux/sched/sysctl.h and kernel/sysctl.c due to CONFIG_SCHED_QHMP.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-09-15 12:17:51 -07:00			`struct task_struct *ed_task;`
sched: preserve CPU cycle counter in rq Preserve cycle counter in rq in preparation for wait time accounting while CPU idle fix. Change-Id: I469263c90e12f39bb36bde5ed26298b7c1c77597 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-04-28 15:22:12 -07:00			`struct cpu_cycle cc;`
sched: Aggregate for frequency Related threads in a group could execute on different CPUs and hence present a split-demand picture to cpufreq governor. IOW the governor fails to see the net cpu demand of all related threads in a given window if the threads's execution were to be split across CPUs. That could result in sub-optimal frequency chosen in comparison to the ideal frequency at which the aggregate work (taken up by related threads) needs to be run. This patch aggregates cpu execution stats in a window for all related threads in a group. This helps present cpu busy time to governor as if all related threads were part of the same thread and thus help select the right frequency required by related threads. This aggregation is done per-cluster. Change-Id: I71e6047620066323721c6d542034ddd4b2950e7f Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: Fixed notify_migration() to hold rcu read lock as this version of Linux doesn't hold p->pi_lock when the function gets called while keeping use of rcu_access_pointer() since we never dereference return value.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-05-12 15:01:15 +05:30			`u64 old_busy_time, old_busy_time_group;`
sched: Add separate load tracking histogram to predict loads Current window based load tracking only saves history for five windows. A historically heavy task's heavy load will be completely forgotten after five windows of light load. Even before the five window expires, a heavy task wakes up on same CPU it used to run won't trigger any frequency change until end of the window. It would starve for the entire window. It also adds one "small" load window to history because it's accumulating load at a low frequency, further reducing the tracked load for this heavy task. Ideally, scheduler should be able to identify such tasks and notify governor to increase frequency immediately after it wakes up. Add a histogram for each task to track a much longer load history. A prediction will be made based on runtime of previous or current window, histogram data and load tracked in recent windows. Prediction of all tasks that is currently running or runnable on a CPU is aggregated and reported to CPUFreq governor in sched_get_cpus_busy(). sched_get_cpus_busy() now returns predicted busy time in addition to previous window busy time and new task busy time, scaled to the CPU maximum possible frequency. Tunables: - /proc/sys/kernel/sched_gov_alert_freq (KHz) This tunable can be used to further filter the notifications. Frequency alert notification is sent only when the predicted load exceeds previous window load by sched_gov_alert_freq converted to load. Change-Id: If29098cd2c5499163ceaff18668639db76ee8504 Suggested-by: Saravana Kannan <skannan@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Junjie Wu <junjiew@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts around __migrate_task() and removed changes for CONFIG_SCHED_QHMP.] 2015-06-08 09:08:47 +05:30			`u64 old_estimated_time;`
sched: window-stats: 64-bit type for curr/prev_runnable_sum Expand rq->curr_runnable_sum and rq->prev_runnable_sum to be 64-bit counters as otherwise they can easily overflow when a cpu has many tasks. Change-Id: I68ab2658ac6a3174ddb395888ecd6bf70ca70473 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-08-06 15:29:58 +05:30			`u64 curr_runnable_sum;`
			`u64 prev_runnable_sum;`
sched: account new task load so that governor can apply different policy Account amount of load contributed by new tasks within CPU load so that governor can apply different policy when CPU is loaded by new tasks. To be able to distinguish new task load a new tunable sched_new_task_windows also introduced. The tunable defines tasks as new when the tasks are have been active less than configured windows. Change-Id: I2e2e62e4103882f7362154b792ab978b181b9f59 Suggested-by: Saravana Kannan <skannan@codeaurora.org> [joonwoop@codeaurora.org: ommited changes for drivers/cpufreq/cpufreq_interactive.c. cpufreq changes needs to be applied separately later. fixed conflict in include/linux/sched.h and include/linux/sched/sysctl.h. omitted changes for qhmp_core.c] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-09-15 09:35:53 -07:00			`u64 nt_curr_runnable_sum;`
			`u64 nt_prev_runnable_sum;`
sched: maintain group busy time counters in runqueue There is no advantage of tracking busy time counters per related thread group. We need busy time across all groups for either a CPU or a frequency domain. Hence maintain group busy time counters in the runqueue itself. When CPU window is rolled over, the group busy counters are also rolled over. This eliminates the overhead of individual group's window_start maintenance. As we are preallocating related thread group now, this patch saves 40 * nr_cpu_ids * (nr_grp - 1) bytes memory. Change-Id: Ieaaccea483b377f54ea1761e6939ee23a78a5e9c Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2017-01-09 13:56:33 +05:30			`struct group_cpu_time grp_time;`
sched: Add the mechanics of top task tracking for frequency guidance The previous patches in this rewrite of scheduler guided frequency selection reintroduces the part-picture problem that we addressed in our initial implementation. In that, when tasks migrate across CPUs within a cluster, we end up losing the complete picture of the sequential nature of the workload. This patch aims to solve that problem slightly differently. We track the top task on every CPU within a window. Top task is defined as the task that runs the most in a given window. This enhances our ability to detect the sequential nature of workloads. A single migrating task executing for an entire window will cause 100% load to be reported for frequency guidance instead of the maximum footprint left on any individual CPU in the task's trail. There are cases, that this new approach does not address. Namely, cases where the sum of two or more tasks accurately reflects the true sequential nature of the workload. Future optimizations might aim to tackle that problem. To track top tasks, we first realize that there is no strict need to maintain the task struct itself as long as we know the load exerted by the top task. We also realize that to maintain top tasks on every CPU we have to track the execution of every single task that runs during the window. The load associated with a task needs to be migrated when the task migrates from one CPU to another. When the top task migrates away, we need to locate the second top task and so on. Given the above realizations, we use hashmaps to track top task load both for the current and the previous window. This hashmap is implemented as an array of fixed size. The key of the hashmap is given by task_execution_time_in_a_window / array_size. The size of the array (number of buckets in the hashmap) dictate the load granularity of each bucket. The value stored in each bucket is a refcount of all the tasks that executed long enough to be in that bucket. This approach has a few benefits. Firstly, any top task stats update now take O(1) time. While task migration is also O(1), it does still involve going through up to the size of the array to find the second top task. Further patches will aim to optimize this behavior. Secondly, and more importantly, not having to store the task struct itself saves a lot of memory usage in that 1) there is no need to retrieve task structs later causing cache misses and 2) we don't have to unnecessarily hold up task memory for up to 2 full windows by calling get_task_struct() after a task exits. Change-Id: I004dba474f41590db7d3f40d9deafe86e71359ac Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-05-31 16:40:45 -07:00			`struct load_subtractions load_subs[NUM_TRACKED_WINDOWS];`
sched: Optimize the next top task search logic upon task migration find_next_top_index() is responsible for finding the second top task on a CPU when the top task migrates away from that CPU. This operation is expensive as we need to iterate the entire array of top tasks to find the second top task. Optimize this by introducing bitmaps for tracking top task indices. There are two bitmaps; one for the previous window and one for the current window. Each bit in a bitmap tracks whether the corresponding bucket in the top task hashmap has a non zero refcount. The bit is set when the refcount becomes non zero and is cleared when it becomes zero. Finding the second top task upon migration is then simply a matter of finding the highest set bit in the bitmap. Change-Id: Ibafaf66eed756b0328704dfaa89c17ab0d84e359 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-06-07 15:18:37 -07:00			`DECLARE_BITMAP_ARRAY(top_tasks_bitmap,`
			`NUM_TRACKED_WINDOWS, NUM_LOAD_INDICES);`
sched: Add the mechanics of top task tracking for frequency guidance The previous patches in this rewrite of scheduler guided frequency selection reintroduces the part-picture problem that we addressed in our initial implementation. In that, when tasks migrate across CPUs within a cluster, we end up losing the complete picture of the sequential nature of the workload. This patch aims to solve that problem slightly differently. We track the top task on every CPU within a window. Top task is defined as the task that runs the most in a given window. This enhances our ability to detect the sequential nature of workloads. A single migrating task executing for an entire window will cause 100% load to be reported for frequency guidance instead of the maximum footprint left on any individual CPU in the task's trail. There are cases, that this new approach does not address. Namely, cases where the sum of two or more tasks accurately reflects the true sequential nature of the workload. Future optimizations might aim to tackle that problem. To track top tasks, we first realize that there is no strict need to maintain the task struct itself as long as we know the load exerted by the top task. We also realize that to maintain top tasks on every CPU we have to track the execution of every single task that runs during the window. The load associated with a task needs to be migrated when the task migrates from one CPU to another. When the top task migrates away, we need to locate the second top task and so on. Given the above realizations, we use hashmaps to track top task load both for the current and the previous window. This hashmap is implemented as an array of fixed size. The key of the hashmap is given by task_execution_time_in_a_window / array_size. The size of the array (number of buckets in the hashmap) dictate the load granularity of each bucket. The value stored in each bucket is a refcount of all the tasks that executed long enough to be in that bucket. This approach has a few benefits. Firstly, any top task stats update now take O(1) time. While task migration is also O(1), it does still involve going through up to the size of the array to find the second top task. Further patches will aim to optimize this behavior. Secondly, and more importantly, not having to store the task struct itself saves a lot of memory usage in that 1) there is no need to retrieve task structs later causing cache misses and 2) we don't have to unnecessarily hold up task memory for up to 2 full windows by calling get_task_struct() after a task exits. Change-Id: I004dba474f41590db7d3f40d9deafe86e71359ac Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-05-31 16:40:45 -07:00			`u8 *top_tasks[NUM_TRACKED_WINDOWS];`
			`u8 curr_table;`
			`int prev_top;`
			`int curr_top;`
sched: Introduce CONFIG_SCHED_FREQ_INPUT Introduce a compile time flag to enable scheduler guidance of frequency selection. This flag is also used to turn on or off window-based load stats feature. Having a compile time flag will let some platforms avoid any overhead that may be present with this scheduler feature. Change-Id: Id8dec9839f90dcac82f58ef7e2bd0ccd0b6bd16c Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict around sysctl_timer_migration.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 16:56:45 -07:00			`#endif`
sched: Make scheduler aware of cpu frequency state Capacity of a cpu (how much performance it can deliver) is partly determined by its frequency (P) state, both current frequency as well as max frequency it can reach. Knowing frequency state of cpus will help scheduler optimize various functions such as tracking every task's cpu demand and placing tasks on various cpus. This patch has scheduler registering for cpufreq notifications to become aware of cpu's frequency state. Subsequent patches will make use of derived information for various purposes, such as task's scaled load (cpu demand) accounting and task placement. Change-Id: I376dffa1e7f3f47d0496cd7e6ef8b5642ab79016 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in kernel/sched/core.c.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2013-12-12 17:06:11 -08:00
sched: restore discarded ifdef CONFIG_SCHED_WALT code Code closed in ifdef CONFIG_SCHED_WALT blocks is not used in msm-4.4 builds, hence in order to be as much as closer to upstream and subsequently to have less merge conflicts in the future, let's restore this code. Restore below CONFIG_SCHED_WALT changes in file [1]: be832f6 sched: walt: Leverage existing ^^^^^^^ Discarded in dbad9b8. efb86bd sched: Introduce Window Assisted Load Tracking (WALT) ^^^^^^^ Restore only the block, which is modified by be832f6. Discarded in efbe378. dbad9b8 Merge android-4.4@89074de (v4.4.94) into msm-4.4 efbe378 Merge branch 'v4.4-16.09-android-tmp' into lsk-v4.4-16.09-android [1] kernel/sched/sched.h Change-Id: Ifd7e230b3b47dde61abf2472f092ff78d80b7427 Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org> 2017-11-06 15:07:22 -08:00			`#ifdef CONFIG_SCHED_WALT`
			`u64 cumulative_runnable_avg;`
			`u64 window_start;`
			`u64 curr_runnable_sum;`
			`u64 prev_runnable_sum;`
			`u64 nt_curr_runnable_sum;`
			`u64 nt_prev_runnable_sum;`
			`u64 cur_irqload;`
			`u64 avg_irqload;`
			`u64 irqload_ts;`
sched: WALT: account cumulative window demand Energy cost estimation has been a long lasting challenge for WALT because WALT guides CPU frequency based on the CPU utilization of previous window. Consequently it's not possible to know newly waking-up task's energy cost until WALT's end of the current window. The WALT already tracks 'Previous Runnable Sum' (prev_runnable_sum) and 'Cumulative Runnable Average' (cr_avg). They are designed for CPU frequency guidance and task placement but unfortunately both are not suitable for the energy cost estimation. It's because using prev_runnable_sum for energy cost calculation would make us to account CPU and task's energy solely based on activity in the previous window so for example, any task didn't have an activity in the previous window will be accounted as a 'zero energy cost' task. Energy estimation with cr_avg is what energy_diff() relies on at present. However cr_avg can only represent instantaneous picture of energy cost thus for example, if a CPU was fully occupied for an entire WALT window and became idle just before window boundary, and if there is a wake-up, energy_diff() accounts that CPU is a 'zero energy cost' CPU. As a result, introduce a new accounting unit 'Cumulative Window Demand'. The cumulative window demand tracks all the tasks' demands have seen in current window which is neither instantaneous nor actual execution time. Because task demand represents estimated scaled execution time when the task runs a full window, accumulation of all the demands represents predicted CPU load at the end of window. Thus we can estimate CPU's frequency at the end of current WALT window with the cumulative window demand. The use of prev_runnable_sum for the CPU frequency guidance and cr_avg for the task placement have not changed and these are going to be used for both purpose while this patch aims to add an additional statistics. Change-Id: I9908c77ead9973a26dea2b36c001c2baf944d4f5 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2017-02-03 11:15:31 -08:00			`u64 cum_window_demand;`
sched: restore discarded ifdef CONFIG_SCHED_WALT code Code closed in ifdef CONFIG_SCHED_WALT blocks is not used in msm-4.4 builds, hence in order to be as much as closer to upstream and subsequently to have less merge conflicts in the future, let's restore this code. Restore below CONFIG_SCHED_WALT changes in file [1]: be832f6 sched: walt: Leverage existing ^^^^^^^ Discarded in dbad9b8. efb86bd sched: Introduce Window Assisted Load Tracking (WALT) ^^^^^^^ Restore only the block, which is modified by be832f6. Discarded in efbe378. dbad9b8 Merge android-4.4@89074de (v4.4.94) into msm-4.4 efbe378 Merge branch 'v4.4-16.09-android-tmp' into lsk-v4.4-16.09-android [1] kernel/sched/sched.h Change-Id: Ifd7e230b3b47dde61abf2472f092ff78d80b7427 Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org> 2017-11-06 15:07:22 -08:00			`#endif /* CONFIG_SCHED_WALT */`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#ifdef CONFIG_IRQ_TIME_ACCOUNTING`
			`u64 prev_irq_time;`
			`#endif`
			`#ifdef CONFIG_PARAVIRT`
			`u64 prev_steal_time;`
			`#endif`
			`#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING`
			`u64 prev_steal_time_rq;`
			`#endif`

			`/* calc_load related fields */`
			`unsigned long calc_load_update;`
			`long calc_load_active;`

			`#ifdef CONFIG_SCHED_HRTICK`
			`#ifdef CONFIG_SMP`
			`int hrtick_csd_pending;`
			`struct call_single_data hrtick_csd;`
			`#endif`
			`struct hrtimer hrtick_timer;`
			`#endif`

			`#ifdef CONFIG_SCHEDSTATS`
			`/* latency stats */`
			`struct sched_info rq_sched_info;`
			`unsigned long long rq_cpu_time;`
			`/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */`

			`/* sys_sched_yield() stats */`
			`unsigned int yld_count;`

			`/* schedule() stats */`
			`unsigned int sched_count;`
			`unsigned int sched_goidle;`

			`/* try_to_wake_up() stats */`
			`unsigned int ttwu_count;`
			`unsigned int ttwu_local;`
schedstats/eas: guard properly to avoid breaking non-smp schedstats users Add appropriate #ifdef guards to ensure the smp-only easstats structs are not used when smp is not enabled. Arnd got a report from buildbot, analysed it, and pointed out exactly what the issue was. Reported-by: "Arnd Bergmann" <arnd@arndb.de> Suggested-by: "Arnd Bergmann" <arnd@arndb.de> Fixes: 4b85765a3dd9 ("sched/fair: Add eas (& cas) specific rq, sd and task stats") Signed-off-by: Chris Redpath <chris.redpath@arm.com> Change-Id: I60554dea20137f6774db3f59b4afd40a06554cfc 2017-06-03 15:03:03 +01:00			`#ifdef CONFIG_SMP`
sched/fair: Add eas (& cas) specific rq, sd and task stats The statistic counter are placed in the eas (& cas) wakeup path. Each of them has one representation for the runqueue (rq), the sched_domain (sd) and the task. A task counter is always incremented. A rq counter is always incremented for the rq the scheduler is currently running on. A sd counter is only incremented if a relation to a sd exists. The counters are exposed: (1) In /proc/schedstat for rq's and sd's: $ cat /proc/schedstat ... cpu0 71422 0 2321254 ... eas 44144 0 0 19446 0 24698 568435 51621 156932 133 222011 17459 120279 516814 83 0 156962 359235 176439 139981 <- runqueue for cpu0 ... domain0 3 42430 42331 ... eas 0 0 0 14200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 66355 0 <- MC sched domain for cpu0 ... The per-cpu eas vector has the following elements: sis_attempts sis_idle sis_cache_affine sis_suff_cap sis_idle_cpu sis_count \|\| secb_attempts secb_sync secb_idle_bt secb_insuff_cap secb_no_nrg_sav secb_nrg_sav secb_count \|\| fbt_attempts fbt_no_cpu fbt_no_sd fbt_pref_idle fbt_count \|\| cas_attempts cas_count The following relations exist between these counters (from cpu0 eas vector above): sis_attempts = sis_idle + sis_cache_affine + sis_suff_cap + sis_idle_cpu + sis_count 44144 = 0 + 0 + 19446 + 0 + 24698 secb_attempts = secb_sync + secb_idle_bt + secb_insuff_cap + secb_no_nrg_sav + secb_nrg_sav + secb_count 568435 = 51621 + 156932 + 133 + 222011 + 17459 + 120279 fbt_attempts = fbt_no_cpu + fbt_no_sd + fbt_pref_idle + fbt_count + (return -1) 516814 = 83 + 0 + 156962 + 359235 + (534) cas_attempts = cas_count + (return -1 or smp_processor_id()) 176439 = 139981 + (36458) (2) In /proc/$PROCESS_PID/task/$TASK_PID/sched for a task. example: main thread of system_server $ cat /proc/1083/task/1083/sched ... se.statistics.nr_wakeups_sis_attempts : 945 se.statistics.nr_wakeups_sis_idle : 0 se.statistics.nr_wakeups_sis_cache_affine : 0 se.statistics.nr_wakeups_sis_suff_cap : 219 se.statistics.nr_wakeups_sis_idle_cpu : 0 se.statistics.nr_wakeups_sis_count : 726 se.statistics.nr_wakeups_secb_attempts : 10376 se.statistics.nr_wakeups_secb_sync : 1462 se.statistics.nr_wakeups_secb_idle_bt : 6984 se.statistics.nr_wakeups_secb_insuff_cap : 3 se.statistics.nr_wakeups_secb_no_nrg_sav : 927 se.statistics.nr_wakeups_secb_nrg_sav : 206 se.statistics.nr_wakeups_secb_count : 794 se.statistics.nr_wakeups_fbt_attempts : 8914 se.statistics.nr_wakeups_fbt_no_cpu : 0 se.statistics.nr_wakeups_fbt_no_sd : 0 se.statistics.nr_wakeups_fbt_pref_idle : 6987 se.statistics.nr_wakeups_fbt_count : 1554 se.statistics.nr_wakeups_cas_attempts : 3107 se.statistics.nr_wakeups_cas_count : 1195 ... The same relation between the counters as in the per-cpu case apply. Change-Id: Ie7d01267c78a3f41f60a3ef52917d5a5d463f195 Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> 2017-03-22 18:23:13 +00:00			`struct eas_stats eas_stats;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#endif`
			`#endif`

			`#ifdef CONFIG_SMP`
			`struct llist_head wake_list;`
			`#endif`
sched: Let the scheduler see CPU idle states When the cpu enters idle, it stores the cpuidle state pointer in its struct rq instance which in turn could be used to make a better decision when balancing tasks. As soon as the cpu exits its idle state, the struct rq reference is cleared. There are a couple of situations where the idle state pointer could be changed while it is being consulted: 1. For x86/acpi with dynamic c-states, when a laptop switches from battery to AC that could result on removing the deeper idle state. The acpi driver triggers: 'acpi_processor_cst_has_changed' 'cpuidle_pause_and_lock' 'cpuidle_uninstall_idle_handler' 'kick_all_cpus_sync'. All cpus will exit their idle state and the pointed object will be set to NULL. 2. The cpuidle driver is unloaded. Logically that could happen but not in practice because the drivers are always compiled in and 95% of them are not coded to unregister themselves. In any case, the unloading code must call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock' leading to 'kick_all_cpus_sync' as mentioned above. A race can happen if we use the pointer and then one of these two scenarios occurs at the same moment. In order to be safe, the idle state pointer stored in the rq must be used inside a rcu_read_lock section where we are protected with the 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The idle_get_state() and idle_put_state() accessors should be used to that effect. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: linaro-kernel@lists.linaro.org Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-09-04 11:32:09 -04:00
			`#ifdef CONFIG_CPU_IDLE`
			`/* Must be inspected within a rcu lock section */`
			`struct cpuidle_state *idle_state;`
sched, cpuidle: Track cpuidle state index in the scheduler The idle-state of each cpu is currently pointed to by rq->idle_state but there isn't any information in the struct cpuidle_state that can used to look up the idle-state energy model data stored in struct sched_group_energy. For this purpose is necessary to store the idle state index as well. Ideally, the idle-state data should be unified. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> 2015-01-27 13:48:07 +00:00			`int idle_state_idx;`
sched: Let the scheduler see CPU idle states When the cpu enters idle, it stores the cpuidle state pointer in its struct rq instance which in turn could be used to make a better decision when balancing tasks. As soon as the cpu exits its idle state, the struct rq reference is cleared. There are a couple of situations where the idle state pointer could be changed while it is being consulted: 1. For x86/acpi with dynamic c-states, when a laptop switches from battery to AC that could result on removing the deeper idle state. The acpi driver triggers: 'acpi_processor_cst_has_changed' 'cpuidle_pause_and_lock' 'cpuidle_uninstall_idle_handler' 'kick_all_cpus_sync'. All cpus will exit their idle state and the pointed object will be set to NULL. 2. The cpuidle driver is unloaded. Logically that could happen but not in practice because the drivers are always compiled in and 95% of them are not coded to unregister themselves. In any case, the unloading code must call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock' leading to 'kick_all_cpus_sync' as mentioned above. A race can happen if we use the pointer and then one of these two scenarios occurs at the same moment. In order to be safe, the idle state pointer stored in the rq must be used inside a rcu_read_lock section where we are protected with the 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The idle_get_state() and idle_put_state() accessors should be used to that effect. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: linaro-kernel@lists.linaro.org Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-09-04 11:32:09 -04:00			`#endif`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`};`

			`static inline int cpu_of(struct rq *rq)`
			`{`
			`#ifdef CONFIG_SMP`
			`return rq->cpu;`
			`#else`
			`return 0;`
			`#endif`
			`}`

sched: Match declaration with definition Match the declaration of runqueues with the definition. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1407950893-32731-1-git-send-email-bobby.prani@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-08-13 13:28:12 -04:00			`DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched: Only queue remote wakeups when crossing cache boundaries Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-07 15:07:31 +01:00			`#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))`
scheduler: Replace __get_cpu_var with this_cpu_ptr Convert all uses of __get_cpu_var for address calculation to use this_cpu_ptr instead. [Uses of __get_cpu_var with cpumask_var_t are no longer handled by this patch] Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org> 2014-08-17 12:30:27 -05:00			`#define this_rq() this_cpu_ptr(&runqueues)`
sched: Only queue remote wakeups when crossing cache boundaries Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-07 15:07:31 +01:00			`#define task_rq(p) cpu_rq(task_cpu(p))`
			`#define cpu_curr(cpu) (cpu_rq(cpu)->curr)`
scheduler: Replace __get_cpu_var with this_cpu_ptr Convert all uses of __get_cpu_var for address calculation to use this_cpu_ptr instead. [Uses of __get_cpu_var with cpumask_var_t are no longer handled by this patch] Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org> 2014-08-17 12:30:27 -05:00			`#define raw_rq() raw_cpu_ptr(&runqueues)`
sched: Only queue remote wakeups when crossing cache boundaries Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-07 15:07:31 +01:00
sched/core: Validate rq_clock*() serialization rq->clock{,_task} are serialized by rq->lock, verify this. One immediate fail is the usage in scale_rt_capability, so 'annotate' that for now, there's more 'funny' there. Maybe change rq->lock into a raw_seqlock_t? (Only 32-bit is affected) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150105103554.361872747@infradead.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-01-05 11:18:10 +01:00			`static inline u64 __rq_clock_broken(struct rq *rq)`
			`{`
sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE() ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use the new READ_ONCE() and WRITE_ONCE() APIs as appropriate. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Waiman Long <Waiman.Long@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-04-28 13:00:20 -07:00			`return READ_ONCE(rq->clock);`
sched/core: Validate rq_clock*() serialization rq->clock{,_task} are serialized by rq->lock, verify this. One immediate fail is the usage in scale_rt_capability, so 'annotate' that for now, there's more 'funny' there. Maybe change rq->lock into a raw_seqlock_t? (Only 32-bit is affected) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150105103554.361872747@infradead.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-01-05 11:18:10 +01:00			`}`

sched: Use an accessor to read the rq clock Read the runqueue clock through an accessor. This prepares for adding a debugging infrastructure to detect missing or redundant calls to update_rq_clock() between a scheduler's entry and exit point. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Turner <pjt@google.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365724262-20142-6-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-04-12 01:51:02 +02:00			`static inline u64 rq_clock(struct rq *rq)`
			`{`
sched/core: Validate rq_clock*() serialization rq->clock{,_task} are serialized by rq->lock, verify this. One immediate fail is the usage in scale_rt_capability, so 'annotate' that for now, there's more 'funny' there. Maybe change rq->lock into a raw_seqlock_t? (Only 32-bit is affected) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150105103554.361872747@infradead.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-01-05 11:18:10 +01:00			`lockdep_assert_held(&rq->lock);`
sched: Use an accessor to read the rq clock Read the runqueue clock through an accessor. This prepares for adding a debugging infrastructure to detect missing or redundant calls to update_rq_clock() between a scheduler's entry and exit point. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Turner <pjt@google.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365724262-20142-6-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-04-12 01:51:02 +02:00			`return rq->clock;`
			`}`

			`static inline u64 rq_clock_task(struct rq *rq)`
			`{`
sched/core: Validate rq_clock*() serialization rq->clock{,_task} are serialized by rq->lock, verify this. One immediate fail is the usage in scale_rt_capability, so 'annotate' that for now, there's more 'funny' there. Maybe change rq->lock into a raw_seqlock_t? (Only 32-bit is affected) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150105103554.361872747@infradead.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-01-05 11:18:10 +01:00			`lockdep_assert_held(&rq->lock);`
sched: Use an accessor to read the rq clock Read the runqueue clock through an accessor. This prepares for adding a debugging infrastructure to detect missing or redundant calls to update_rq_clock() between a scheduler's entry and exit point. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Turner <pjt@google.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365724262-20142-6-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-04-12 01:51:02 +02:00			`return rq->clock_task;`
			`}`

sched/core: Rework rq->clock update skips The original purpose of rq::skip_clock_update was to avoid 'costly' clock updates for back to back wakeup-preempt pairs. The big problem with it has always been that the rq variable is unaware of the context and causes indiscrimiate clock skips. Rework the entire thing and create a sense of context by only allowing schedule() to skip clock updates. (XXX can we measure the cost of the added store?) By ensuring only schedule can ever skip an update, we guarantee we're never more than 1 tick behind on the update. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150105103554.432381549@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-01-05 11:18:11 +01:00			`#define RQCF_REQ_SKIP 0x01`
			`#define RQCF_ACT_SKIP 0x02`

			`static inline void rq_clock_skip_update(struct rq *rq, bool skip)`
			`{`
			`lockdep_assert_held(&rq->lock);`
			`if (skip)`
			`rq->clock_skip_update \|= RQCF_REQ_SKIP;`
			`else`
			`rq->clock_skip_update &= ~RQCF_REQ_SKIP;`
			`}`

sched/numa: Export info needed for NUMA balancing on complex topologies Export some information that is necessary to do placement of tasks on systems with multi-level NUMA topologies. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1413530994-9732-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-10-17 03:29:49 -04:00			`#ifdef CONFIG_NUMA`
sched/numa: Classify the NUMA topology of a system Smaller NUMA systems tend to have all NUMA nodes directly connected to each other. This includes the degenerate case of a system with just one node, ie. a non-NUMA system. Larger systems can have two kinds of NUMA topology, which affects how tasks and memory should be placed on the system. On glueless mesh systems, nodes that are not directly connected to each other will bounce traffic through intermediary nodes. Task groups can be run closer to each other by moving tasks from a node to an intermediary node between it and the task's preferred node. On NUMA systems with backplane controllers, the intermediary hops are incapable of running programs. This creates "islands" of nodes that are at an equal distance to anywhere else in the system. Each kind of topology requires a slightly different placement algorithm; this patch provides the mechanism to detect the kind of NUMA topology of a system. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Chegu Vinod <chegu_vinod@hp.com> [ Changed to use kernel/sched/sched.h ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Link: http://lkml.kernel.org/r/1413530994-9732-3-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-10-17 03:29:50 -04:00			`enum numa_topology_type {`
			`NUMA_DIRECT,`
			`NUMA_GLUELESS_MESH,`
			`NUMA_BACKPLANE,`
			`};`
			`extern enum numa_topology_type sched_numa_topology_type;`
sched/numa: Export info needed for NUMA balancing on complex topologies Export some information that is necessary to do placement of tasks on systems with multi-level NUMA topologies. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1413530994-9732-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-10-17 03:29:49 -04:00			`extern int sched_max_numa_distance;`
			`extern bool find_numa_distance(int distance);`
			`#endif`

sched/numa: Track NUMA hinting faults on per-node basis This patch tracks what nodes numa hinting faults were incurred on. This information is later used to schedule a task on the node storing the pages most frequently faulted by the task. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-07 11:28:57 +01:00			`#ifdef CONFIG_NUMA_BALANCING`
sched: Refactor task_struct to use numa_faults instead of numa_* pointers This patch simplifies task_struct by removing the four numa_* pointers in the same array and replacing them with the array pointer. By doing this, on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on x86_64). A new parameter is added to the task_faults_idx function so that it can return an index to the correct offset, corresponding with the old precalculated pointers. All of the code in sched/ that depended on task_faults_idx and numa_* was changed in order to match the new logic. Signed-off-by: Iulia Manda <iulia.manda21@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: mgorman@suse.de Cc: dave@stgolabs.net Cc: riel@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-10-31 02:13:31 +02:00			`/* The regions in numa_faults array from task_struct */`
			`enum numa_faults_stats {`
			`NUMA_MEM = 0,`
			`NUMA_CPU,`
			`NUMA_MEMBUF,`
			`NUMA_CPUBUF`
			`};`
sched/numa: Avoid migrating tasks that are placed on their preferred node This patch classifies scheduler domains and runqueues into types depending the number of tasks that are about their NUMA placement and the number that are currently running on their preferred node. The types are regular: There are tasks running that do not care about their NUMA placement. remote: There are tasks running that care about their placement but are currently running on a node remote to their ideal placement all: No distinction To implement this the patch tracks the number of tasks that are optimally NUMA placed (rq->nr_preferred_running) and the number of tasks running that care about their placement (nr_numa_running). The load balancer uses this information to avoid migrating idea placed NUMA tasks as long as better options for load balancing exists. For example, it will not consider balancing between a group whose tasks are all perfectly placed and a group with remote tasks. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-56-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-07 11:29:33 +01:00			`extern void sched_setnuma(struct task_struct *p, int node);`
sched/numa: Reschedule task on preferred NUMA node once selected A preferred node is selected based on the node the most NUMA hinting faults was incurred on. There is no guarantee that the task is running on that node at the time so this patch rescheules the task to run on the most idle CPU of the selected node when selected. This avoids waiting for the balancer to make a decision. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-25-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-07 11:29:02 +01:00			`extern int migrate_task_to(struct task_struct *p, int cpu);`
sched/numa: Introduce migrate_swap() Use the new stop_two_cpus() to implement migrate_swap(), a function that flips two tasks between their respective cpus. I'm fairly sure there's a less crude way than employing the stop_two_cpus() method, but everything I tried either got horribly fragile and/or complex. So keep it simple for now. The notable detail is how we 'migrate' tasks that aren't runnable anymore. We'll make it appear like we migrated them before they went to sleep. The sole difference is the previous cpu in the wakeup path, so we override this. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/1381141781-10992-39-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-07 11:29:16 +01:00			`extern int migrate_swap(struct task_struct , struct task_struct );`
sched/numa: Track NUMA hinting faults on per-node basis This patch tracks what nodes numa hinting faults were incurred on. This information is later used to schedule a task on the node storing the pages most frequently faulted by the task. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-07 11:28:57 +01:00			`#endif /* CONFIG_NUMA_BALANCING */`

sched: Only queue remote wakeups when crossing cache boundaries Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-07 15:07:31 +01:00			`#ifdef CONFIG_SMP`

sched: Replace post_schedule with a balance callback list Generalize the post_schedule() stuff into a balance callback list. This allows us to more easily use it outside of schedule() and cross sched_class. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124742.424032725@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2015-06-11 14:46:37 +02:00			`static inline void`
			`queue_balance_callback(struct rq *rq,`
			`struct callback_head *head,`
			`void (func)(struct rq rq))`
			`{`
			`lockdep_assert_held(&rq->lock);`

			`if (unlikely(head->next))`
			`return;`

			`head->func = (void ()(struct callback_head ))func;`
			`head->next = rq->balance_callback;`
			`rq->balance_callback = head;`
			`}`

sched/idle: Optimize try-to-wake-up IPI [ This series reduces the number of IPIs on Andy's workload by something like 99%. It's down from many hundreds per second to very few. The basic idea behind this series is to make TIF_POLLING_NRFLAG be a reliable indication that the idle task is polling. Once that's done, the rest is reasonably straightforward. ] When enqueueing tasks on remote LLC domains, we send an IPI to do the work 'locally' and avoid bouncing all the cachelines over. However, when the remote CPU is idle (and polling, say x86 mwait), we don't need to send an IPI, we can simply kick the TIF word to wake it up and have the 'idle' loop do the work. So when _TIF_POLLING_NRFLAG is set, but _TIF_NEED_RESCHED is not (yet) set, set _TIF_NEED_RESCHED and avoid sending the IPI. Much-requested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra <peterz@infradead.org> [Edited by Andy Lutomirski, but this is mostly Peter Zijlstra's code.] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: umgwanakikbuti@gmail.com Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/ce06f8b02e7e337be63e97597fc4b248d3aa6f9b.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-06-04 10:31:18 -07:00			`extern void sched_ttwu_pending(void);`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#define rcu_dereference_check_sched_domain(p) \`
			`rcu_dereference_check((p), \`
			`lockdep_is_held(&sched_domains_mutex))`

			`/*`
			`* The domain tree (rq->sd) is protected by RCU's quiescent state transition.`
			`* See detach_destroy_domains: synchronize_sched for details.`
			`*`
			`* The domain tree of any CPU may only be accessed from within`
			`* preempt-disabled sections.`
			`*/`
			`#define for_each_domain(cpu, __sd) \`
sched: Only queue remote wakeups when crossing cache boundaries Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-07 15:07:31 +01:00			`for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \`
			`__sd; __sd = __sd->parent)`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched: Clean up domain traversal in select_idle_sibling() Instead of going through the scheduler domain hierarchy multiple times (for giving priority to an idle core over an idle SMT sibling in a busy core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES flag and traverse the domain hierarchy down till we find an idle group. This cleanup also addresses an issue reported by Mike where the recent changes returned the busy thread even in the presence of an idle SMT sibling in single socket platforms. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321556904.15339.25.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-11-17 11:08:23 -08:00			`#define for_each_lower_domain(sd) for (; sd; sd = sd->child)`

sched: Only queue remote wakeups when crossing cache boundaries Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-07 15:07:31 +01:00			`/**`
			`* highest_flag_domain - Return highest sched_domain containing flag.`
			`* @cpu: The cpu whose highest level of sched domain is to`
			`* be returned.`
			`* @flag: The flag to check for the highest sched_domain`
			`* for the given cpu.`
			`*`
			`* Returns the highest sched_domain of a cpu which contains the given flag.`
			`*/`
			`static inline struct sched_domain *highest_flag_domain(int cpu, int flag)`
			`{`
			`struct sched_domain sd, hsd = NULL;`

			`for_each_domain(cpu, sd) {`
			`if (!(sd->flags & flag))`
			`break;`
			`hsd = sd;`
			`}`

			`return hsd;`
			`}`

sched/numa: Use a system-wide search to find swap/migration candidates This patch implements a system-wide search for swap/migration candidates based on total NUMA hinting faults. It has a balance limit, however it doesn't properly consider total node balance. In the old scheme a task selected a preferred node based on the highest number of private faults recorded on the node. In this scheme, the preferred node is based on the total number of faults. If the preferred node for a task changes then task_numa_migrate will search the whole system looking for tasks to swap with that would improve both the overall compute balance and minimise the expected number of remote NUMA hinting faults. Not there is no guarantee that the node the source task is placed on by task_numa_migrate() has any relationship to the newly selected task->numa_preferred_nid due to compute overloading. Signed-off-by: Mel Gorman <mgorman@suse.de> [ Do not swap with tasks that cannot run on source cpu] Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [ Fixed compiler warning on UP. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-40-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-07 11:29:17 +01:00			`static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)`
			`{`
			`struct sched_domain *sd;`

			`for_each_domain(cpu, sd) {`
			`if (sd->flags & flag)`
			`break;`
			`}`

			`return sd;`
			`}`

sched: Only queue remote wakeups when crossing cache boundaries Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-07 15:07:31 +01:00			`DECLARE_PER_CPU(struct sched_domain *, sd_llc);`
sched: Micro-optimize the smart wake-affine logic Smart wake-affine is using node-size as the factor currently, but the overhead of the mask operation is high. Thus, this patch introduce the 'sd_llc_size' percpu variable, which will record the highest cache-share domain size, and make it to be the new factor, in order to reduce the overhead and make it more reasonable. Tested-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michael Wang <wangyun@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Link: http://lkml.kernel.org/r/51D5008E.6030102@linux.vnet.ibm.com [ Tidied up the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-07-04 12:56:46 +08:00			`DECLARE_PER_CPU(int, sd_llc_size);`
sched: Only queue remote wakeups when crossing cache boundaries Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-07 15:07:31 +01:00			`DECLARE_PER_CPU(int, sd_llc_id);`
sched/numa: Use a system-wide search to find swap/migration candidates This patch implements a system-wide search for swap/migration candidates based on total NUMA hinting faults. It has a balance limit, however it doesn't properly consider total node balance. In the old scheme a task selected a preferred node based on the highest number of private faults recorded on the node. In this scheme, the preferred node is based on the total number of faults. If the preferred node for a task changes then task_numa_migrate will search the whole system looking for tasks to swap with that would improve both the overall compute balance and minimise the expected number of remote NUMA hinting faults. Not there is no guarantee that the node the source task is placed on by task_numa_migrate() has any relationship to the newly selected task->numa_preferred_nid due to compute overloading. Signed-off-by: Mel Gorman <mgorman@suse.de> [ Do not swap with tasks that cannot run on source cpu] Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [ Fixed compiler warning on UP. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-40-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-07 11:29:17 +01:00			`DECLARE_PER_CPU(struct sched_domain *, sd_numa);`
sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus nr_busy_cpus parameter is used by nohz_kick_needed() to find out the number of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES flag set. Therefore instead of updating nr_busy_cpus at every level of sched domain, since it is irrelevant, we can update this parameter only at the parent domain of the sd which has this flag set. Introduce a per-cpu parameter sd_busy which represents this parent domain. In nohz_kick_needed() we directly query the nr_busy_cpus parameter associated with the groups of sd_busy. By associating sd_busy with the highest domain which has SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains which could have this flag set and trigger nohz_idle_balancing if any of the levels have more than one busy cpu. sd_busy is irrelevant for asymmetric load balancing. However sd_asym has been introduced to represent the highest sched domain which has SD_ASYM_PACKING flag set so that it can be queried directly when required. While we are at it, we might as well change the nohz_idle parameter to be updated at the sd_busy domain level alone and not the base domain level of a CPU. This will unify the concept of busy cpus at just one level of sched domain where it is currently used. Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: svaidy@linux.vnet.ibm.com Cc: vincent.guittot@linaro.org Cc: bitbucket@online.de Cc: benh@kernel.crashing.org Cc: anton@samba.org Cc: Morten.Rasmussen@arm.com Cc: pjt@google.com Cc: peterz@infradead.org Cc: mikey@neuling.org Link: http://lkml.kernel.org/r/20131030031252.23426.4417.stgit@preeti.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-30 08:42:52 +05:30			`DECLARE_PER_CPU(struct sched_domain *, sd_busy);`
			`DECLARE_PER_CPU(struct sched_domain *, sd_asym);`
sched: Highest energy aware balancing sched_domain level pointer Add another member to the family of per-cpu sched_domain shortcut pointers. This one, sd_ea, points to the highest level at which energy model is provided. At this level and all levels below all sched_groups have energy model data attached. Partial energy model information is possible but restricted to providing energy model data for lower level sched_domains (sd_ea and below) and leaving load-balancing on levels above to non-energy-aware load-balancing. For example, it is possible to apply energy-aware scheduling within each socket on a multi-socket system and let normal scheduling handle load-balancing between sockets. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> 2015-01-02 17:08:52 +00:00			`DECLARE_PER_CPU(struct sched_domain *, sd_ea);`
sched: Calculate energy consumption of sched_group For energy-aware load-balancing decisions it is necessary to know the energy consumption estimates of groups of cpus. This patch introduces a basic function, sched_group_energy(), which estimates the energy consumption of the cpus in the group and any resources shared by the members of the group. NOTE: The function has five levels of identation and breaks the 80 character limit. Refactoring is necessary. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> 2014-12-18 14:47:18 +00:00			`DECLARE_PER_CPU(struct sched_domain *, sd_scs);`
sched: Only queue remote wakeups when crossing cache boundaries Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-07 15:07:31 +01:00
sched: Let 'struct sched_group_power' care about CPU capacity It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Since struct sched_group_power is really about compute capacity of sched groups, let's rename it to struct sched_group_capacity. Similarly sgp becomes sgc. Related variables and functions dealing with groups are also adjusted accordingly. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-26 18:19:37 -04:00			`struct sched_group_capacity {`
sched: Move struct sched_group to kernel/sched/sched.h Move struct sched_group_power and sched_group and related inline functions to kernel/sched/sched.h, as they are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A77F.2010705@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:23 +08:00			`atomic_t ref;`
			`/*`
sched: Let 'struct sched_group_power' care about CPU capacity It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Since struct sched_group_power is really about compute capacity of sched groups, let's rename it to struct sched_group_capacity. Similarly sgp becomes sgc. Related variables and functions dealing with groups are also adjusted accordingly. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-26 18:19:37 -04:00			`* CPU capacity of this group, SCHED_LOAD_SCALE being max capacity`
			`* for a single CPU.`
sched: Move struct sched_group to kernel/sched/sched.h Move struct sched_group_power and sched_group and related inline functions to kernel/sched/sched.h, as they are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A77F.2010705@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:23 +08:00			`*/`
sched: Add per-cpu max capacity to sched_group_capacity struct sched_group_capacity currently represents the compute capacity sum of all cpus in the sched_group. Unless it is divided by the group_weight to get the average capacity per cpu it hides differences in cpu capacity for mixed capacity systems (e.g. high RT/IRQ utilization or ARM big.LITTLE). But even the average may not be sufficient if the group covers cpus of different capacities. Instead, by extending struct sched_group_capacity to indicate max per-cpu capacity in the group a suitable group for a given task utilization can easily be found such that cpus with reduced capacity can be avoided for tasks with high utilization (not implemented by this patch). Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> 2016-02-25 12:43:49 +00:00			`unsigned long capacity;`
			`unsigned long max_capacity; /* Max per-cpu capacity in group */`
UPSTREAM: sched/fair: Add per-CPU min capacity to sched_group_capacity struct sched_group_capacity currently represents the compute capacity sum of all CPUs in the sched_group. Unless it is divided by the group_weight to get the average capacity per CPU, it hides differences in CPU capacity for mixed capacity systems (e.g. high RT/IRQ utilization or ARM big.LITTLE). But even the average may not be sufficient if the group covers CPUs of different capacities. Instead, by extending struct sched_group_capacity to indicate min per-CPU capacity in the group a suitable group for a given task utilization can more easily be found such that CPUs with reduced capacity can be avoided for tasks with high utilization (not implemented by this patch). Change-Id: If3cae1be62d01a199e752bca5abb45357d5d0fbd Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: freedom.tan@mediatek.com Cc: keita.kobayashi.ym@renesas.com Cc: mgalbraith@suse.de Cc: sgurrappadi@nvidia.com Cc: vincent.guittot@linaro.org Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1476452472-24740-4-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit bf475ce0a3dd75b5d1df6c6c14ae25168caa15ac) Signed-off-by: Chris Redpath <chris.redpath@arm.com> 2016-10-14 14:41:09 +01:00			`unsigned long min_capacity; /* Min per-CPU capacity in group */`
sched: Move struct sched_group to kernel/sched/sched.h Move struct sched_group_power and sched_group and related inline functions to kernel/sched/sched.h, as they are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A77F.2010705@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:23 +08:00			`unsigned long next_update;`
sched: Let 'struct sched_group_power' care about CPU capacity It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Since struct sched_group_power is really about compute capacity of sched groups, let's rename it to struct sched_group_capacity. Similarly sgp becomes sgc. Related variables and functions dealing with groups are also adjusted accordingly. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-26 18:19:37 -04:00			`int imbalance; /* XXX unrelated to capacity but shared group state */`
sched: Move struct sched_group to kernel/sched/sched.h Move struct sched_group_power and sched_group and related inline functions to kernel/sched/sched.h, as they are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A77F.2010705@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:23 +08:00			`/*`
			`* Number of busy cpus in this group.`
			`*/`
			`atomic_t nr_busy_cpus;`

			`unsigned long cpumask[0]; /* iteration mask */`
			`};`

			`struct sched_group {`
			`struct sched_group next; / Must be a circular list */`
			`atomic_t ref;`

			`unsigned int group_weight;`
sched: Let 'struct sched_group_power' care about CPU capacity It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Since struct sched_group_power is really about compute capacity of sched groups, let's rename it to struct sched_group_capacity. Similarly sgp becomes sgc. Related variables and functions dealing with groups are also adjusted accordingly. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-26 18:19:37 -04:00			`struct sched_group_capacity *sgc;`
ANDROID: sched: fix duplicate sched_group_energy const specifiers EAS uses "const struct sched_group_energy * const" fairly consistently. But a couple of places swap the "" and second "const", making the pointer mutable. In the case of struct sched_group, " const" would have been an error, since init_sched_energy() writes to sd->groups->sge. Change-Id: Ic6a8fcf99e65c0f25d9cc55c32625ef3ca5c9aca Signed-off-by: Greg Hackmann <ghackmann@google.com> 2017-03-07 10:37:56 -08:00			`const struct sched_group_energy *sge;`
sched: Move struct sched_group to kernel/sched/sched.h Move struct sched_group_power and sched_group and related inline functions to kernel/sched/sched.h, as they are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A77F.2010705@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:23 +08:00
			`/*`
			`* The CPUs this group covers.`
			`*`
			`* NOTE: this field is variable length. (Allocated dynamically`
			`* by attaching extra space to the end of the structure,`
			`* depending on how many CPUs the kernel has booted up with)`
			`*/`
			`unsigned long cpumask[0];`
			`};`

			`static inline struct cpumask sched_group_cpus(struct sched_group sg)`
			`{`
			`return to_cpumask(sg->cpumask);`
			`}`

			`/*`
			`* cpumask masking which cpus in the group are allowed to iterate up the domain`
			`* tree.`
			`*/`
			`static inline struct cpumask sched_group_mask(struct sched_group sg)`
			`{`
sched: Let 'struct sched_group_power' care about CPU capacity It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Since struct sched_group_power is really about compute capacity of sched groups, let's rename it to struct sched_group_capacity. Similarly sgp becomes sgc. Related variables and functions dealing with groups are also adjusted accordingly. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-26 18:19:37 -04:00			`return to_cpumask(sg->sgc->cpumask);`
sched: Move struct sched_group to kernel/sched/sched.h Move struct sched_group_power and sched_group and related inline functions to kernel/sched/sched.h, as they are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A77F.2010705@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:23 +08:00			`}`

			`/**`
			`* group_first_cpu - Returns the first cpu in the cpumask of a sched_group.`
			`* @group: The group whose first cpu is to be returned.`
			`*/`
			`static inline unsigned int group_first_cpu(struct sched_group *group)`
			`{`
			`return cpumask_first(sched_group_cpus(group));`
			`}`

sched: Fix domain iteration Weird topologies can lead to asymmetric domain setups. This needs further consideration since these setups are typically non-minimal too. For now, make it work by adding an extra mask selecting which CPUs are allowed to iterate up. The topology that triggered it is the one from David Rientjes: 10 20 20 30 20 10 20 20 20 20 10 20 30 20 20 10 resulting in boxes that wouldn't even boot. Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-05-31 14:47:33 +02:00			`extern int group_balance_cpu(struct sched_group *sg);`

sched/idle: Optimize try-to-wake-up IPI [ This series reduces the number of IPIs on Andy's workload by something like 99%. It's down from many hundreds per second to very few. The basic idea behind this series is to make TIF_POLLING_NRFLAG be a reliable indication that the idle task is polling. Once that's done, the rest is reasonably straightforward. ] When enqueueing tasks on remote LLC domains, we send an IPI to do the work 'locally' and avoid bouncing all the cachelines over. However, when the remote CPU is idle (and polling, say x86 mwait), we don't need to send an IPI, we can simply kick the TIF word to wake it up and have the 'idle' loop do the work. So when _TIF_POLLING_NRFLAG is set, but _TIF_NEED_RESCHED is not (yet) set, set _TIF_NEED_RESCHED and avoid sending the IPI. Much-requested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra <peterz@infradead.org> [Edited by Andy Lutomirski, but this is mostly Peter Zijlstra's code.] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: umgwanakikbuti@gmail.com Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/ce06f8b02e7e337be63e97597fc4b248d3aa6f9b.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-06-04 10:31:18 -07:00			`#else`

			`static inline void sched_ttwu_pending(void) { }`

sched: Only queue remote wakeups when crossing cache boundaries Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-07 15:07:31 +01:00			`#endif /* CONFIG_SMP */`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched: Move all scheduler bits into kernel/sched/ There's too many sched*.[ch] files in kernel/, give them their own directory. (No code changed, other than Makefile glue added.) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-11-15 17:14:39 +01:00			`#include "stats.h"`
			`#include "auto_group.h"`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched: fix compiler errors with !SCHED_HMP HMP scheduler boost feature related functions are referred in SMP load balancer. Add the nop functions for the same to fix the compiler errors with !SCHED_HMP. Change-Id: I1cbcf67f728c2cbc7c0f47e8eaf1f4165649dce8 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2017-01-11 15:11:23 +05:30			`enum sched_boost_policy {`
			`SCHED_BOOST_NONE,`
			`SCHED_BOOST_ON_BIG,`
			`SCHED_BOOST_ON_ALL,`
			`};`

sched: window-stats: Enhance cpu busy time accounting rq->curr/prev_runnable_sum counters represent cpu demand from various tasks that have run on a cpu. Any task that runs on a cpu will have a representation in rq->curr_runnable_sum. Their partial_demand value will be included in rq->curr_runnable_sum. Since partial_demand is derived from historical load samples for a task, rq->curr_runnable_sum could represent "inflated/un-realistic" cpu usage. As an example, lets say that task with partial_demand of 10ms runs for only 1ms on a cpu. What is included in rq->curr_runnable_sum is 10ms (and not the actual execution time of 1ms). This leads to cpu busy time being reported on the upside causing frequency to stay higher than necessary. This patch fixes cpu busy accounting scheme to strictly represent actual usage. It also provides for conditional fixup of busy time upon migration and upon heavy-task wakeup. CRs-Fixed: 691443 Change-Id: Ic4092627668053934049af4dfef65d9b6b901e6b Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in init_task_load(), se.avg.decay_count has deprecated.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-09-01 13:26:53 +05:30			`#ifdef CONFIG_SCHED_HMP`
sched: Introduce CONFIG_SCHED_FREQ_INPUT Introduce a compile time flag to enable scheduler guidance of frequency selection. This flag is also used to turn on or off window-based load stats feature. Having a compile time flag will let some platforms avoid any overhead that may be present with this scheduler feature. Change-Id: Id8dec9839f90dcac82f58ef7e2bd0ccd0b6bd16c Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict around sysctl_timer_migration.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 16:56:45 -07:00
sched: window-stats: add a new AVG policy The current WINDOW_STATS_AVG policy is actually a misnomer since it uses the maximum value of the runtime in the recent window and the average of the past ravg_hist_size windows. Add a policy that only uses the average and call it WINDOW_STATS_AVG policy. Rename all the other polices to make them shorter and unambiguous. Change-Id: I080a4ea072a84a88858ca9da59a4151dfbdbe62c Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2014-09-04 16:24:42 -07:00			`#define WINDOW_STATS_RECENT 0`
			`#define WINDOW_STATS_MAX 1`
			`#define WINDOW_STATS_MAX_RECENT_AVG 2`
			`#define WINDOW_STATS_AVG 3`
			`#define WINDOW_STATS_INVALID_POLICY 4`
sched: window-stats: Code cleanup Remove code duplication associated with update of various window-stats related sysctl tunables Change-Id: I64e29ac065172464ba371a03758937999c42a71f Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-08-11 09:22:24 +05:30
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`#define SCHED_UPMIGRATE_MIN_NICE 15`
			`#define EXITING_TASK_MARKER 0xdeaddead`

			`#define UP_MIGRATION 1`
			`#define DOWN_MIGRATION 2`
			`#define IRQLOAD_MIGRATION 3`
sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00
sched: window-stats: use policy_mutex in sched_set_window() Several configuration variable change will result in reset_all_window_stats() being called. All of them, except sched_set_window(), are serialized via policy_mutex. Take policy_mutex in sched_set_window() as well to serialize use of reset_all_window_stats() function Change-Id: Iada7ff8ac85caa1517e2adcf6394c5b050e3968a Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-08-20 15:39:05 +05:30			`extern struct mutex policy_mutex;`
sched: Window-based load stat improvements Some tasks can have a sporadic load pattern such that they can suddenly start running for longer intervals of time after running for shorter durations. To recognize such sharp increase in tasks' demands, max between the average of 5 window load samples and the most recent sample is chosen as the task demand. Make the window size (sched_ravg_window) configurable at boot up time. To prevent users from setting inappropriate values for window size, min and max limits are defined. As 'ravg' struct tracks load for both real-time and non real-time tasks it is moved out of sched_entity struct. In order to prevent changing function signatures for move_tasks() and move_one_task() per-cpu variables are defined to track the total load moved. In case multiple tasks are selected to migrate in one load balance operation, loads > 100 could be sent through migration notifiers. Prevent this scenario by setting mnd.load to 100 in such cases. Define wrapper functions to compute cpu demands for tasks and to change rq->cumulative_runnable_avg. Change-Id: I9abfbf3b5fe23ae615a6acd3db9580cfdeb515b4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Rohit Gupta <rohgup@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18 and squash "dcf7256 sched: window-stats: Fix overflow bug" into this patch.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __migrate_task().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 11:40:16 -07:00			`extern unsigned int sched_ravg_window;`
sched: window_stats: Add "disable" mode support "disabled" mode (sched_disble_window_stats = 1) disables all window-stats related activity. This is useful when changing key configuration variables associated with window-stats feature (like policy or window size). Change-Id: I9e55c9eb7f7e3b1b646079c3aa338db6259a9cfe Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-08-19 12:31:54 +05:30			`extern unsigned int sched_disable_window_stats;`
sched: window-based load stats improvements Following cleanups and improvements are made to window-based load stats feature: * Add sysctl to pick max, avg or most recent samples as task's demand. * Fix overflow possibility in calculation of sum for average policy. * Use unscaled statistics when a task is running on a CPU which is thermally throttled. Change-Id: I8293565ca0c2a785dadf8adb6c67f579a445ed29 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-03-29 11:40:16 -07:00			`extern unsigned int max_possible_freq;`
			`extern unsigned int min_max_freq;`
sched: Window-based load stat improvements Some tasks can have a sporadic load pattern such that they can suddenly start running for longer intervals of time after running for shorter durations. To recognize such sharp increase in tasks' demands, max between the average of 5 window load samples and the most recent sample is chosen as the task demand. Make the window size (sched_ravg_window) configurable at boot up time. To prevent users from setting inappropriate values for window size, min and max limits are defined. As 'ravg' struct tracks load for both real-time and non real-time tasks it is moved out of sched_entity struct. In order to prevent changing function signatures for move_tasks() and move_one_task() per-cpu variables are defined to track the total load moved. In case multiple tasks are selected to migrate in one load balance operation, loads > 100 could be sent through migration notifiers. Prevent this scenario by setting mnd.load to 100 in such cases. Define wrapper functions to compute cpu demands for tasks and to change rq->cumulative_runnable_avg. Change-Id: I9abfbf3b5fe23ae615a6acd3db9580cfdeb515b4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Rohit Gupta <rohgup@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18 and squash "dcf7256 sched: window-stats: Fix overflow bug" into this patch.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __migrate_task().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 11:40:16 -07:00			`extern unsigned int pct_task_load(struct task_struct *p);`
sched: Introduce efficiency, load_scale_factor and capacity Efficiency reflects instructions per cycle capability of a cpu. load_scale_factor reflects magnification factor that is applied for task load when estimating bandwidth it will consume on a cpu. It accounts for the fact that task load is scaled in reference to "best" cpu that has best efficiency factor and also best possible max_freq. Note that there may be no single CPU in the system that has both the best efficiency and best possible max_freq, but that is still the combination that all task load in the system is scaled against. capacity reflects max_freq and efficiency metric of a cpu. It is defined such that the "least" performing cpu (one with lowest efficiency factor and max_freq) gets capacity of 1024. Again, there may not be a CPU in the system that has both the lowest efficiency and lowest max_freq. This is still the combination that is assigned a capacity of 1024 however, other CPU capacities are relative to this. Change-Id: I4a853f1f0f90020721d2a4ee8b10db3d226b287c Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2014-03-29 19:07:28 -07:00			`extern unsigned int max_possible_efficiency;`
			`extern unsigned int min_possible_efficiency;`
			`extern unsigned int max_capacity;`
			`extern unsigned int min_capacity;`
sched/fair: Introduce C-state aware task placement for small tasks Small tasks execute for small durations. This means that the power cost of taking CPUs out of a low power mode outweigh any performance advantage of using an idle core or power advantage of using the most power efficient CPU. Introduce C-state aware task placement for small tasks. This requires a two pass approach where we first determine the most power effecient CPU and establish a band of CPUs offering a similar power cost for the task. The order of preference then is as follows: 1) Any mostly idle CPU in active C-state in the same power band. 2) A CPU with the shallowest C-state in the same power band. 3) A CPU with the least load in the same power band. 4) Lowest power CPU in a higher power band. The patch also modifies the definition of a small task. Small tasks are now determined relative to minimum capacity CPUs in the system and not the task CPU. Change-Id: Ia09840a5972881cad7ba7bea8fe34c45f909725e Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2014-06-04 13:18:02 -07:00			`extern unsigned int max_load_scale_factor;`
sched: Update max_capacity when an entire cluster is hotplugged When an entire cluster is hotplugged, the scheduler's notion of max_capacity can get outdated. This introduces the following inefficiencies in behavior: * task_will_fit() does not return true on all tasks. Consequently all big tasks go through fallback CPU selection logic skipping C-state and power checks in select_best_cpu(). * During boost, migration_needed() return true unnecessarily causing an avoidable rerun of select_best_cpu(). * An unnecessary kick is sent to all little CPUs when boost is set. * An opportunity for early bailout from nohz_kick_needed() is lost. Start handling CPUFREQ_REMOVE_POLICY in the policy notifier callback which indicates the last CPU in a cluster being hotplugged out. Also modify update_min_max_capacity() to only iterate through online CPUs instead of possible CPUs. While we can't guarantee the integrity of the cpu_online_mask in the notifier callback, the scheduler will fix up all state soon after any changes to the online mask. The change does have one side effect; early termination from the notifier callback when min_max_freq or max_possible_freq remain unchanged is no longer possible. This is because when the last CPU in a cluster is hot removed, only max_capacity is updated without affecting min_max_freq or max_possible_freq. Therefore, when the first CPU in the same cluster gets hot added at a later point max_capacity must once again be recomputed despite there being no change in min_max_freq or max_possible_freq. Change-Id: I9a1256b5c2cd6fcddd85b069faf5e2ace177e122 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-02-20 17:09:41 -08:00			`extern unsigned int max_possible_capacity;`
sched: Revise the inter cluster load balance restrictions The frequency based inter cluster load balance restrictions are not reliable as frequency does not provide a good estimate of the CPU's current load. Replace them with the spill_load and spill_nr_run based checks. The higher capacity cluster is restricted from pulling the tasks from the lower capacity cluster unless all of the lower capacity CPUs are above spill. This behavior can be controlled by a sysctl tunable and it is disabled by default (i.e. no load balance restrictions). Change-Id: I45c09c8adcb61a8a7d4e08beadf2f97f1805fb42 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts due to omitted changes for CONFIG_SCHED_QHMP.] 2015-12-04 06:34:03 +05:30			`extern unsigned int min_max_possible_capacity;`
sched: don't assume higher capacity means higher power in lb The load balancer restrictions are in place to control the tasks migration from the lower capacity cluster to higher capacity cluster to save power. The assumption here is that higher capacity cluster will have higher power cost which may not be necessarily true for all platforms. Use power cost based checks instead of capacity based checks while applying the inter cluster migration restrictions. Change-Id: Id9519eb8f7b183a2e9fca87a23cf95e951aa4005 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-04-13 15:13:56 +05:30			`extern unsigned int max_power_cost;`
sched: Basic task placement support for HMP systems HMP systems have cpus with different power and performance characteristics. Some cpus could offer better power at cost of lower performance while other cpus could offer better performance at cost of higher power. As a result, bandwidth consumed by a task to do some "fixed" amount of work could vary across cpus. Optimal task placement on HMP would involve placing a task on a cpu where it can meet its performance goals at lowest power cost. Since kernel has little to no awareness of performance goals of applications, we guestimate whether task is meeting its performance goals or not by looking at its cpu bandwidth consumption. High bandwidth consumption could imply that task's performance can improve by running on cpus with better capacity/performance-characterisitcs. This patch makes the basic changes to support HMP. It provides a configurable threshold and any task consuming bandwidth in excess of threshold will be placed on a cpu with better capacity. Change-Id: I3fd98edd430f73342fbef06411e8b2d1cf2f56fa Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict about members of p->se which are not available anymore.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 20:04:42 -07:00			`extern unsigned int sched_init_task_load_windows;`
sched: auto adjust the upmigrate and downmigrate thresholds The load scale factor of a CPU gets boosted when its max freq is restricted. A task load at the same frequency is scaled higher than normal under this scenario. This results in tasks migrating early to the better capacity CPUs and their residency over there also gets increased as their inflated load would be relatively higher than than the downmigrate threshold. Auto adjust the upmigrate and downmigrate thresholds by a factor equal to rq->max_possible_freq/rq->max_freq of a lower capacity CPU. If the adjusted upmigrate threshold exceeds the window size, it is clipped to the window size. If the adjusted downmigrate threshold decreases the difference between the upmigrate and downmigrate, it is clipped to a value such that the difference between the modified and the original thresholds is same. Change-Id: Ifa70ee5d4ca5fe02789093c7f070c77629907f04 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2015-04-10 15:10:56 +05:30			`extern unsigned int up_down_migrate_scale_factor;`
sched: Provide a facility to restrict RT tasks to lower power cluster The current CPU selection algorithm for RT tasks looks for the least loaded CPU in all clusters. Stop the search at the lowest possible power cluster based on "sched_restrict_cluster_spill" sysctl tunable. Change-Id: I34fdaefea56e0d1b7e7178d800f1bb86aa0ec01c Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2015-12-14 14:50:12 +05:30			`extern unsigned int sysctl_sched_restrict_cluster_spill;`
sched: Add separate load tracking histogram to predict loads Current window based load tracking only saves history for five windows. A historically heavy task's heavy load will be completely forgotten after five windows of light load. Even before the five window expires, a heavy task wakes up on same CPU it used to run won't trigger any frequency change until end of the window. It would starve for the entire window. It also adds one "small" load window to history because it's accumulating load at a low frequency, further reducing the tracked load for this heavy task. Ideally, scheduler should be able to identify such tasks and notify governor to increase frequency immediately after it wakes up. Add a histogram for each task to track a much longer load history. A prediction will be made based on runtime of previous or current window, histogram data and load tracked in recent windows. Prediction of all tasks that is currently running or runnable on a CPU is aggregated and reported to CPUFreq governor in sched_get_cpus_busy(). sched_get_cpus_busy() now returns predicted busy time in addition to previous window busy time and new task busy time, scaled to the CPU maximum possible frequency. Tunables: - /proc/sys/kernel/sched_gov_alert_freq (KHz) This tunable can be used to further filter the notifications. Frequency alert notification is sent only when the predicted load exceeds previous window load by sched_gov_alert_freq converted to load. Change-Id: If29098cd2c5499163ceaff18668639db76ee8504 Suggested-by: Saravana Kannan <skannan@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Junjie Wu <junjiew@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts around __migrate_task() and removed changes for CONFIG_SCHED_QHMP.] 2015-06-08 09:08:47 +05:30			`extern unsigned int sched_pred_alert_load;`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`extern struct sched_cluster init_cluster;`
			`extern unsigned int __read_mostly sched_short_sleep_task_threshold;`
			`extern unsigned int __read_mostly sched_long_cpu_selection_threshold;`
			`extern unsigned int __read_mostly sched_big_waker_task_load;`
			`extern unsigned int __read_mostly sched_small_wakee_task_load;`
			`extern unsigned int __read_mostly sched_spill_load;`
			`extern unsigned int __read_mostly sched_upmigrate;`
			`extern unsigned int __read_mostly sched_downmigrate;`
			`extern unsigned int __read_mostly sysctl_sched_spill_nr_run;`
sched: Add the mechanics of top task tracking for frequency guidance The previous patches in this rewrite of scheduler guided frequency selection reintroduces the part-picture problem that we addressed in our initial implementation. In that, when tasks migrate across CPUs within a cluster, we end up losing the complete picture of the sequential nature of the workload. This patch aims to solve that problem slightly differently. We track the top task on every CPU within a window. Top task is defined as the task that runs the most in a given window. This enhances our ability to detect the sequential nature of workloads. A single migrating task executing for an entire window will cause 100% load to be reported for frequency guidance instead of the maximum footprint left on any individual CPU in the task's trail. There are cases, that this new approach does not address. Namely, cases where the sum of two or more tasks accurately reflects the true sequential nature of the workload. Future optimizations might aim to tackle that problem. To track top tasks, we first realize that there is no strict need to maintain the task struct itself as long as we know the load exerted by the top task. We also realize that to maintain top tasks on every CPU we have to track the execution of every single task that runs during the window. The load associated with a task needs to be migrated when the task migrates from one CPU to another. When the top task migrates away, we need to locate the second top task and so on. Given the above realizations, we use hashmaps to track top task load both for the current and the previous window. This hashmap is implemented as an array of fixed size. The key of the hashmap is given by task_execution_time_in_a_window / array_size. The size of the array (number of buckets in the hashmap) dictate the load granularity of each bucket. The value stored in each bucket is a refcount of all the tasks that executed long enough to be in that bucket. This approach has a few benefits. Firstly, any top task stats update now take O(1) time. While task migration is also O(1), it does still involve going through up to the size of the array to find the second top task. Further patches will aim to optimize this behavior. Secondly, and more importantly, not having to store the task struct itself saves a lot of memory usage in that 1) there is no need to retrieve task structs later causing cache misses and 2) we don't have to unnecessarily hold up task memory for up to 2 full windows by calling get_task_struct() after a task exits. Change-Id: I004dba474f41590db7d3f40d9deafe86e71359ac Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-05-31 16:40:45 -07:00			`extern unsigned int __read_mostly sched_load_granule;`
sched: Add separate load tracking histogram to predict loads Current window based load tracking only saves history for five windows. A historically heavy task's heavy load will be completely forgotten after five windows of light load. Even before the five window expires, a heavy task wakes up on same CPU it used to run won't trigger any frequency change until end of the window. It would starve for the entire window. It also adds one "small" load window to history because it's accumulating load at a low frequency, further reducing the tracked load for this heavy task. Ideally, scheduler should be able to identify such tasks and notify governor to increase frequency immediately after it wakes up. Add a histogram for each task to track a much longer load history. A prediction will be made based on runtime of previous or current window, histogram data and load tracked in recent windows. Prediction of all tasks that is currently running or runnable on a CPU is aggregated and reported to CPUFreq governor in sched_get_cpus_busy(). sched_get_cpus_busy() now returns predicted busy time in addition to previous window busy time and new task busy time, scaled to the CPU maximum possible frequency. Tunables: - /proc/sys/kernel/sched_gov_alert_freq (KHz) This tunable can be used to further filter the notifications. Frequency alert notification is sent only when the predicted load exceeds previous window load by sched_gov_alert_freq converted to load. Change-Id: If29098cd2c5499163ceaff18668639db76ee8504 Suggested-by: Saravana Kannan <skannan@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Junjie Wu <junjiew@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts around __migrate_task() and removed changes for CONFIG_SCHED_QHMP.] 2015-06-08 09:08:47 +05:30
sched/walt: Fix the memory leak of idle task load pointers The memory for task load pointers are allocated twice for each idle thread except for the boot CPU. This happens during boot from idle_threads_init()->idle_init() in the following 2 paths. 1. idle_init()->fork_idle()->copy_process()-> sched_fork()->init_new_task_load() 2. idle_init()->fork_idle()-> init_idle()->init_new_task_load() The memory allocation for all tasks happens through the 1st path, so use the same for idle tasks and kill the 2nd path. Since the idle thread of boot CPU does not go through fork_idle(), allocate the memory for it separately. Change-Id: I4696a414ffe07d4114b56d326463026019e278f1 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> [schikk@codeaurora.org: resolved merge conflicts] Signed-off-by: Swetha Chikkaboraiah <schikk@codeaurora.org> 2018-09-20 15:31:36 +05:30			`extern void init_new_task_load(struct task_struct *p);`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`extern u64 sched_ktime_clock(void);`
			`extern int got_boost_kick(void);`
			`extern int register_cpu_cycle_counter_cb(struct cpu_cycle_counter_cb *cb);`
			`extern void update_task_ravg(struct task_struct p, struct rq rq, int event,`
			`u64 wallclock, u64 irqtime);`
			`extern bool early_detection_notify(struct rq *rq, u64 wallclock);`
			`extern void clear_ed_task(struct task_struct p, struct rq rq);`
			`extern void fixup_busy_time(struct task_struct *p, int new_cpu);`
			`extern void clear_boost_kick(int cpu);`
			`extern void clear_hmp_request(int cpu);`
			`extern void mark_task_starting(struct task_struct *p);`
			`extern void set_window_start(struct rq *rq);`
			`extern void update_cluster_topology(void);`
sched: Track average sleep time Similar to tracking average burst length for tasks, average sleep time indicates how much a task sleeps on an average before waking up to run. Very low sleep and burst lengths indicates tasks that could be sensitive to task-wake latencies and hence should not be packed. Change-Id: Ife68a9a9a9e596246aab5029f60e41c5bad781e4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-09-09 19:50:27 +05:30			`extern void note_task_waking(struct task_struct *p, u64 wallclock);`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`extern void set_task_last_switch_out(struct task_struct *p, u64 wallclock);`
			`extern void init_clusters(void);`
sched: Support CFS_BANDWIDTH feature in HMP scheduler CFS_BANDWIDTH feature is not currently well-supported by HMP scheduler. Issues encountered include a kernel panic when rq->nr_big_tasks count becomes negative. This patch fixes HMP scheduler code to better handle CFS_BANDWIDTH feature. The most prominent change introduced is maintenance of HMP stats (nr_big_tasks, nr_small_tasks, cumulative_runnable_avg) per 'struct cfs_rq' in addition to being maintained in each 'struct rq'. This allows HMP stats to be updated easily when a group is throttled on a cpu. Change-Id: Iad9f378b79ab5d9d76f86d1775913cc1941e266a Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in dequeue_task_fair().] 2015-01-16 13:57:02 +05:30			`extern void reset_cpu_hmp_stats(int cpu, int reset_cra);`
			`extern unsigned int max_task_load(void);`
sched: window-stats: Account interrupt handling time as busy time Account cycles spent by idle cpu handling interrupts (irq or softirq) towards its busy time. Change-Id: I84cc084ced67502e1cfa7037594f29ed2305b2b1 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in core.c] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-07-30 01:24:34 -07:00			`extern void sched_account_irqtime(int cpu, struct task_struct *curr,`
			`u64 delta, u64 wallclock);`
sched: fix potential deflated frequency estimation during IRQ handling Time between mark_start of idle task and IRQ handler entry time is CPU cycle counter stall period. Therefore it's inappropriate to include such duration as part of sample period when we do frequency estimation. Fix such suboptimality by replenishing idle task's CPU cycle counter upon IRQ entry and using irqtime as time delta. Change-Id: I274d5047a50565cfaaa2fb821ece21c8cf4c991d Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-04-29 10:58:21 -07:00			`extern void sched_account_irqstart(int cpu, struct task_struct *curr,`
			`u64 wallclock);`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`extern unsigned int cpu_temp(int cpu);`
sched: Keep track of average nr_big_tasks Extend sched_get_nr_running_avg() API to return average nr_big_tasks, in addition to average nr_running and average nr_io_wait tasks. Also add a new trace point to record values returned by sched_get_nr_running_avg() API. Change-Id: Id3591e6d04da8db484b4d1cb9d95dba075f5ab9a Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Resolve trivial merge conflicts] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-01-30 11:52:37 +05:30			`extern unsigned int nr_eligible_big_tasks(int cpu);`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`extern int update_preferred_cluster(struct related_thread_group *grp,`
			`struct task_struct *p, u32 old_load);`
			`extern void set_preferred_cluster(struct related_thread_group *grp);`
sched: inherit the group id from the group leader When sysctl_sched_enable_thread_grouping is set to 1, any new tasks created are put in the same group as their group leader. Change-Id: If1837dd7c8120c8b097cfffa1dc52eb4781f1641 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2015-10-21 16:04:46 +05:30			`extern void add_new_task_to_grp(struct task_struct *new);`
sched/hmp: Enhance co-location and scheduler boost features The recent introduction of the schedtune cgroup controller has provided the scheduler with added flexibility in terms of some of it's placement features. In particular each cgroup under the schedtune controller can now specify: 1) Whether it needs co-location along with other cgroups 2) Whether it is eligible for scheduler boost (sched_boost_enabled) 3) Whether the kernel can override the boost eligibility when necessary (sched_boost_no_override) The scheduler now creates a reserved co-location group at boot. This group is used to co-locate all tasks that form part of any one of the cgroups that have co-location enabled. This reserved group can neither be destroyed nor reused for other purposes. Furthermore, cgroups are only allowed to indicate their co-location preference once at boot. Further updates are disallowed. Since we are now creating co-location groups for an extended period of time, there are a few other factors to consider when determining the preferred cluster for the group. We first exclude any tasks in the group that have not been observed to be running for a significant amount of time. Secondly we introduce the notion of group up and down migrate tunables to allow different migration policies than individual tasks. Lastly we break co-location if a single task in a group exceeds up-migrate but the total load of the group does not exceed group up-migrate. In terms of sched_boost, the scheduler now supports multiple types of boost. These are: 1) FULL_THROTTLE : Force up-migrate tasks belonging any cgroup that has the sched_boost_enabled flag turned on. Little CPUs will only be used when big CPUs can no longer accommodate tasks. Also up-migrate all RT tasks. 2) CONSERVATIVE : Override the sched_boost_enabled flag for all cgroups except those that have the sched_boost_no_override flag set. Force up-migrate all tasks belonging to only those cgroups that still remain eligible for boost. RT tasks do not get force up migrated. 3) RESTRAINED : Start frequency aggregation for co-located tasks. This type of boost does not force up-migrate any task. Finally the boost API removes ref-counting. This means that there can only be a single entity using boost at any given time. If multiple entities are managing boost, they are required to be well behaved so that they don't interfere with one another. Even for a single client, it is not possible to switch directly from one boost type to another. Boost must be first turned off before switching over to a new type. Change-Id: I8d224a70cbef162f27078b62b73acaa22670861d Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-31 16:54:12 -07:00			`extern unsigned int update_freq_aggregate_threshold(unsigned int threshold);`
sched: Track burst length for tasks Track burst length for tasks as time they ran from wakeup to sleep. This is used to predict average time a task may run when it wakes up and thus avoid waking up idle cpu for "short-burst" tasks. Change-Id: Ie71d3163630fb8aa0db8ee8383768f8748270cf9 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-05-13 02:05:32 -07:00			`extern void update_avg_burst(struct task_struct *p);`
			`extern void update_avg(u64 *avg, u64 sample);`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00
sched/hmp: Enhance co-location and scheduler boost features The recent introduction of the schedtune cgroup controller has provided the scheduler with added flexibility in terms of some of it's placement features. In particular each cgroup under the schedtune controller can now specify: 1) Whether it needs co-location along with other cgroups 2) Whether it is eligible for scheduler boost (sched_boost_enabled) 3) Whether the kernel can override the boost eligibility when necessary (sched_boost_no_override) The scheduler now creates a reserved co-location group at boot. This group is used to co-locate all tasks that form part of any one of the cgroups that have co-location enabled. This reserved group can neither be destroyed nor reused for other purposes. Furthermore, cgroups are only allowed to indicate their co-location preference once at boot. Further updates are disallowed. Since we are now creating co-location groups for an extended period of time, there are a few other factors to consider when determining the preferred cluster for the group. We first exclude any tasks in the group that have not been observed to be running for a significant amount of time. Secondly we introduce the notion of group up and down migrate tunables to allow different migration policies than individual tasks. Lastly we break co-location if a single task in a group exceeds up-migrate but the total load of the group does not exceed group up-migrate. In terms of sched_boost, the scheduler now supports multiple types of boost. These are: 1) FULL_THROTTLE : Force up-migrate tasks belonging any cgroup that has the sched_boost_enabled flag turned on. Little CPUs will only be used when big CPUs can no longer accommodate tasks. Also up-migrate all RT tasks. 2) CONSERVATIVE : Override the sched_boost_enabled flag for all cgroups except those that have the sched_boost_no_override flag set. Force up-migrate all tasks belonging to only those cgroups that still remain eligible for boost. RT tasks do not get force up migrated. 3) RESTRAINED : Start frequency aggregation for co-located tasks. This type of boost does not force up-migrate any task. Finally the boost API removes ref-counting. This means that there can only be a single entity using boost at any given time. If multiple entities are managing boost, they are required to be well behaved so that they don't interfere with one another. Even for a single client, it is not possible to switch directly from one boost type to another. Boost must be first turned off before switching over to a new type. Change-Id: I8d224a70cbef162f27078b62b73acaa22670861d Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-31 16:54:12 -07:00			`#define NO_BOOST 0`
			`#define FULL_THROTTLE_BOOST 1`
			`#define CONSERVATIVE_BOOST 2`
			`#define RESTRAINED_BOOST 3`

sched: Update fair and rt placement logic to use scheduler clusters Make use of clusters in the fair and rt scheduling classes. This is needed as the freq domain mask can no longer be used to do correct task placement. The freq domain mask was being used to demarcate clusters. Change-Id: I57f74147c7006f22d6760256926c10fd0bf50cbd Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts due to omitted changes for CONFIG_SCHED_QHMP.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-04-22 17:12:09 +05:30			`static inline struct sched_cluster *cpu_cluster(int cpu)`
			`{`
			`return cpu_rq(cpu)->cluster;`
			`}`

sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`static inline int cpu_capacity(int cpu)`
			`{`
			`return cpu_rq(cpu)->cluster->capacity;`
			`}`

			`static inline int cpu_max_possible_capacity(int cpu)`
			`{`
			`return cpu_rq(cpu)->cluster->max_possible_capacity;`
			`}`

			`static inline int cpu_load_scale_factor(int cpu)`
			`{`
			`return cpu_rq(cpu)->cluster->load_scale_factor;`
			`}`

			`static inline int cpu_efficiency(int cpu)`
			`{`
			`return cpu_rq(cpu)->cluster->efficiency;`
			`}`

			`static inline unsigned int cpu_cur_freq(int cpu)`
			`{`
			`return cpu_rq(cpu)->cluster->cur_freq;`
			`}`

			`static inline unsigned int cpu_min_freq(int cpu)`
			`{`
			`return cpu_rq(cpu)->cluster->min_freq;`
			`}`

sched: take into account of limited CPU min and max frequencies Actual CPU's min and max frequencies can be limited by hardware components while governor's not aware of. Provide an API for them to notify for scheduler to be able to notice accurate currently operating frequency boundaries which helps better task placement decision. CRs-fixed: 1006303 Change-Id: I608f5fa8b0baff8d9e998731dcddec59c9073d20 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-03-28 14:22:52 -07:00			`static inline unsigned int cluster_max_freq(struct sched_cluster *cluster)`
			`{`
			`/*`
			`* Governor and thermal driver don't know the other party's mitigation`
			`* voting. So struct cluster saves both and return min() for current`
			`* cluster fmax.`
			`*/`
			`return min(cluster->max_mitigated_freq, cluster->max_freq);`
			`}`

sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`static inline unsigned int cpu_max_freq(int cpu)`
			`{`
sched: take into account of limited CPU min and max frequencies Actual CPU's min and max frequencies can be limited by hardware components while governor's not aware of. Provide an API for them to notify for scheduler to be able to notice accurate currently operating frequency boundaries which helps better task placement decision. CRs-fixed: 1006303 Change-Id: I608f5fa8b0baff8d9e998731dcddec59c9073d20 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-03-28 14:22:52 -07:00			`return cluster_max_freq(cpu_rq(cpu)->cluster);`
sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`}`

			`static inline unsigned int cpu_max_possible_freq(int cpu)`
			`{`
			`return cpu_rq(cpu)->cluster->max_possible_freq;`
			`}`

			`static inline int same_cluster(int src_cpu, int dst_cpu)`
			`{`
			`return cpu_rq(src_cpu)->cluster == cpu_rq(dst_cpu)->cluster;`
			`}`

sched: Revise the inter cluster load balance restrictions The frequency based inter cluster load balance restrictions are not reliable as frequency does not provide a good estimate of the CPU's current load. Replace them with the spill_load and spill_nr_run based checks. The higher capacity cluster is restricted from pulling the tasks from the lower capacity cluster unless all of the lower capacity CPUs are above spill. This behavior can be controlled by a sysctl tunable and it is disabled by default (i.e. no load balance restrictions). Change-Id: I45c09c8adcb61a8a7d4e08beadf2f97f1805fb42 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts due to omitted changes for CONFIG_SCHED_QHMP.] 2015-12-04 06:34:03 +05:30			`static inline int cpu_max_power_cost(int cpu)`
			`{`
			`return cpu_rq(cpu)->cluster->max_power_cost;`
			`}`

sched: Avoid waking idle cpu for short-burst tasks Introduce sched_short_burst tunable to classify "short-burst" tasks. These tasks are eligible for packing to avoid overhead associated with waking up an idle CPU. select_best_cpu() ignores power-cost and selects the CPU with least wakeup latency which is not loaded with IRQs and can accommodate this task without exceeding spill limits. The ties are broken with load followed by previous CPU. This policy does not affect cluster selection but only CPU selection in the selected cluster. The tasks eligible for "wakeup-up-idle" and "boost" are not considered for packing. This policy is applied for both "fair" and "rt" scheduling class tasks. Change-Id: I2a05493fde93f58636725f18d0ce8dbce4418a30 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-09-09 19:38:03 +05:30			`static inline int cpu_min_power_cost(int cpu)`
			`{`
			`return cpu_rq(cpu)->cluster->min_power_cost;`
			`}`

sched: Fix possible overflow in cpu_cycles_to_freq() Truncating period to u32 could lead to incorrect results. Make it u64 instead. Change-Id: I5224a943e64bc6d64b6c8e614a01f798a6cdc796 Signed-off-by: Puja Gupta <pujag@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2017-11-16 13:39:33 -08:00			`static inline u32 cpu_cycles_to_freq(u64 cycles, u64 period)`
sched: add support for CPU frequency estimation with cycle counter At present scheduler calculates task's demand with the task's execution time weighted over CPU frequency. The CPU frequency is given by governor's CPU frequency transition notification. Such notification may not be available. Provide an API for CPU clock driver to register callback functions so in order for scheduler to access CPU's cycle counter to estimate CPU's frequency without notification. At time point scheduler assumes the cycle counter increases always even when cluster is idle which might not be true. This will be fixed by subsequent change for more accurate I/O wait time accounting. CRs-fixed: 1006303 Change-Id: I93b187efd7bc225db80da0184683694f5ab99738 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-03-08 13:46:04 -08:00			`{`
sched: simplify CPU frequency estimation and cycle counter API Most of CPUs increase cycle counter by one every cycle which makes frequency = cycles / time_delta is correct. Therefore it's reasonable to get rid of current cpu_cycle_max_scale_factor and ask cycle counter read callback function to return scaled counter value when it's needed in such a case that cycle counter doesn't increase every cycle. Thus multiply NSEC_PER_SEC / HZ_PER_KHZ to CPU cycle counter delta as we calculate frequency in khz and remove cpu_cycle_max_scale_factor. This allows us to simplify frequency estimation and cycle counter API. Change-Id: Ie7a628d4bc77c9b6c769f6099ce8d75740262a14 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-05-17 20:04:54 -07:00			`return div64_u64(cycles, period);`
sched: add support for CPU frequency estimation with cycle counter At present scheduler calculates task's demand with the task's execution time weighted over CPU frequency. The CPU frequency is given by governor's CPU frequency transition notification. Such notification may not be available. Provide an API for CPU clock driver to register callback functions so in order for scheduler to access CPU's cycle counter to estimate CPU's frequency without notification. At time point scheduler assumes the cycle counter increases always even when cluster is idle which might not be true. This will be fixed by subsequent change for more accurate I/O wait time accounting. CRs-fixed: 1006303 Change-Id: I93b187efd7bc225db80da0184683694f5ab99738 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-03-08 13:46:04 -08:00			`}`

sched: Revise the inter cluster load balance restrictions The frequency based inter cluster load balance restrictions are not reliable as frequency does not provide a good estimate of the CPU's current load. Replace them with the spill_load and spill_nr_run based checks. The higher capacity cluster is restricted from pulling the tasks from the lower capacity cluster unless all of the lower capacity CPUs are above spill. This behavior can be controlled by a sysctl tunable and it is disabled by default (i.e. no load balance restrictions). Change-Id: I45c09c8adcb61a8a7d4e08beadf2f97f1805fb42 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts due to omitted changes for CONFIG_SCHED_QHMP.] 2015-12-04 06:34:03 +05:30			`static inline bool hmp_capable(void)`
			`{`
			`return max_possible_capacity != min_max_possible_capacity;`
			`}`

core_ctl: un-isolate BIG CPUs more aggressively The current algorithm to bring additional BIG CPUs is very conservative. It works when BIG tasks alone run on BIG cluster. When co-location and scheduler boost features are activated, small/medium tasks also run on BIG cluster. We don't want these tasks to downmigrate, when BIG CPUs are available but isolated. The following changes are done to un-isolate CPUs more aggressively. (1) Round up the big_avg. When the big_avg indicates that there are 1.5 tasks on an average in the last window, it indicates that we need 2 BIG CPUs not 1 BIG CPU. (2) Track the maximum number of running tasks in the last window on all CPUs. If any of the CPU in a cluster has more than 4 runnable tasks in the last window, bring an additional CPU to help out. Change-Id: Id05d9983af290760cec6d93d1bdc45bc5e924cce Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2017-05-10 15:43:29 +05:30			`static inline bool is_max_capacity_cpu(int cpu)`
			`{`
			`return cpu_max_possible_capacity(cpu) == max_possible_capacity;`
			`}`

sched: add sched_get_cpu_last_busy_time() API sched_get_cpu_last_busy_time() returns the last time stamp when a given CPU is busy with more than 2 runnable tasks or has load greater than 50% of it's max capacity. The LPM driver can make use of this API and create a policy to prevent a recently loaded CPU entering deep sleep state. This API is implemented only for the higher capacity CPUs in the system. It returns 0 for other CPUs. Change-Id: I97ef47970a71647f4f55f21165d0cc1351770a53 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2018-02-09 13:53:04 +05:30			`static inline bool is_min_capacity_cpu(int cpu)`
			`{`
			`return cpu_max_possible_capacity(cpu) == min_max_possible_capacity;`
			`}`

sched: inline function scale_load_to_cpu() Inline relatively small and frequently used function scale_load_to_cpu(). CRs-fixed: 849655 Change-Id: Id5f60595c394959d78e6da4cc4c18c338fec285b Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-06-10 14:57:52 -07:00			`/*`
			`* 'load' is in reference to "best cpu" at its best frequency.`
			`* Scale that in reference to a given cpu, accounting for how bad it is`
			`* in reference to "best cpu".`
			`*/`
			`static inline u64 scale_load_to_cpu(u64 task_load, int cpu)`
			`{`
sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`u64 lsf = cpu_load_scale_factor(cpu);`
sched: inline function scale_load_to_cpu() Inline relatively small and frequently used function scale_load_to_cpu(). CRs-fixed: 849655 Change-Id: Id5f60595c394959d78e6da4cc4c18c338fec285b Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-06-10 14:57:52 -07:00
sched: Introduce the concept CPU clusters in the scheduler A cluster is set of CPUs sharing some power controls and an L2 cache. This patch buids a list of clusters at bootup which are sorted by their max_power_cost. Many cluster-shared attributes like cur_freq, max_freq etc are needlessly maintained in per-cpu 'struct rq' currently. Consolidate them in a cluster structure. Change-Id: I0567672ad5fb67d211d9336181ceb53b9f6023af Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> [joonwoop@codeaurora.org: fixed minor conflict in arch/arm64/kernel/topology.c. fixed conflict due to ommited changes for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-04-20 12:35:48 +05:30			`if (lsf != 1024) {`
			`task_load *= lsf;`
sched: avoid unnecessary multiplication and division Avoid unnecessary multiplication and division when load scaling factor is 1024. Change-Id: If3cb63a77feaf49cc69ddec7f41cc3c1cabbfc5a Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-08-24 15:14:44 -07:00			`task_load /= 1024;`
			`}`
sched: inline function scale_load_to_cpu() Inline relatively small and frequently used function scale_load_to_cpu(). CRs-fixed: 849655 Change-Id: Id5f60595c394959d78e6da4cc4c18c338fec285b Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-06-10 14:57:52 -07:00
			`return task_load;`
			`}`

sched: clean up fixup_hmp_sched_stats() The commit 392edf4969d20 ("sched: avoid stale cumulative_runnable_avg HMP statistics) introduced the callback function fixup_hmp_sched_stats() so update_history() can avoid decrement and increment pair of HMP stat. However the commit also made fixup function to do obscure p->ravg.demand update which isn't the cleanest way. Revise the function fixup_hmp_sched_stats() so the caller can update p->ravg.demand directly. Change-Id: Id54667d306495d2109c26362813f80f08a1385ad [joonwoop@codeaurora.org: stripped out CONFIG_SCHED_QHMP.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-07-30 10:44:13 -07:00			`static inline unsigned int task_load(struct task_struct *p)`
			`{`
			`return p->ravg.demand;`
			`}`
sched: Make RT tasks eligible for boost During sched boost RT tasks currently end up going to the lowest power cluster. This can be a performance bottleneck especially if the frequency and IPC differences between clusters are high. Furthermore, when RT tasks go over to the little cluster during boost, the load balancer keeps attempting to pull work over to the big cluster. This results in pre-emption of the executing RT task causing more delays. Finally, containing more work on a single cluster during boost might help save some power if the little cluster can then enter deeper low power modes. Change-Id: I177b2e81be5657c23e7ac43889472561ce9993a9 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2014-12-03 10:18:12 -08:00
sched: Window-based load stat improvements Some tasks can have a sporadic load pattern such that they can suddenly start running for longer intervals of time after running for shorter durations. To recognize such sharp increase in tasks' demands, max between the average of 5 window load samples and the most recent sample is chosen as the task demand. Make the window size (sched_ravg_window) configurable at boot up time. To prevent users from setting inappropriate values for window size, min and max limits are defined. As 'ravg' struct tracks load for both real-time and non real-time tasks it is moved out of sched_entity struct. In order to prevent changing function signatures for move_tasks() and move_one_task() per-cpu variables are defined to track the total load moved. In case multiple tasks are selected to migrate in one load balance operation, loads > 100 could be sent through migration notifiers. Prevent this scenario by setting mnd.load to 100 in such cases. Define wrapper functions to compute cpu demands for tasks and to change rq->cumulative_runnable_avg. Change-Id: I9abfbf3b5fe23ae615a6acd3db9580cfdeb515b4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Rohit Gupta <rohgup@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18 and squash "dcf7256 sched: window-stats: Fix overflow bug" into this patch.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __migrate_task().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 11:40:16 -07:00			`static inline void`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`inc_cumulative_runnable_avg(struct hmp_sched_stats *stats,`
			`struct task_struct *p)`
sched: Window-based load stat improvements Some tasks can have a sporadic load pattern such that they can suddenly start running for longer intervals of time after running for shorter durations. To recognize such sharp increase in tasks' demands, max between the average of 5 window load samples and the most recent sample is chosen as the task demand. Make the window size (sched_ravg_window) configurable at boot up time. To prevent users from setting inappropriate values for window size, min and max limits are defined. As 'ravg' struct tracks load for both real-time and non real-time tasks it is moved out of sched_entity struct. In order to prevent changing function signatures for move_tasks() and move_one_task() per-cpu variables are defined to track the total load moved. In case multiple tasks are selected to migrate in one load balance operation, loads > 100 could be sent through migration notifiers. Prevent this scenario by setting mnd.load to 100 in such cases. Define wrapper functions to compute cpu demands for tasks and to change rq->cumulative_runnable_avg. Change-Id: I9abfbf3b5fe23ae615a6acd3db9580cfdeb515b4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Rohit Gupta <rohgup@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18 and squash "dcf7256 sched: window-stats: Fix overflow bug" into this patch.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __migrate_task().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 11:40:16 -07:00			`{`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`u32 task_load;`

sched: Remove sched_enable_hmp flag Clean up the code and make it more maintainable by removing dependency on the sched_enable_hmp flag. We do not support HMP scheduler without recompiling. Enabling the HMP scheduler is done through enabling the CONFIG_SCHED_HMP config. Change-Id: I246c1b1889f8dcbc8f0a0805077c0ce5d4f083b0 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org> 2017-02-01 17:59:51 -08:00			`if (sched_disable_window_stats)`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`return;`

sched: Remove unused PELT extensions for HMP scheduling PELT extensions for HMP have never been used since the early days of the HMP scheduler. Furthermore, changes to PELT itself in newer kernel versions render some of the code redundant or incorrect. These extensions have not been tested for a long time and are practically dead code. Remove it so that future upgrades become easier. Change-Id: I029f327406ca00b2370c93134158b61dda3b81e3 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 10:53:01 -07:00			`task_load = sched_disable_window_stats ? 0 : p->ravg.demand;`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30
			`stats->cumulative_runnable_avg += task_load;`
sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00			`stats->pred_demands_sum += p->ravg.pred_demand;`
sched: Window-based load stat improvements Some tasks can have a sporadic load pattern such that they can suddenly start running for longer intervals of time after running for shorter durations. To recognize such sharp increase in tasks' demands, max between the average of 5 window load samples and the most recent sample is chosen as the task demand. Make the window size (sched_ravg_window) configurable at boot up time. To prevent users from setting inappropriate values for window size, min and max limits are defined. As 'ravg' struct tracks load for both real-time and non real-time tasks it is moved out of sched_entity struct. In order to prevent changing function signatures for move_tasks() and move_one_task() per-cpu variables are defined to track the total load moved. In case multiple tasks are selected to migrate in one load balance operation, loads > 100 could be sent through migration notifiers. Prevent this scenario by setting mnd.load to 100 in such cases. Define wrapper functions to compute cpu demands for tasks and to change rq->cumulative_runnable_avg. Change-Id: I9abfbf3b5fe23ae615a6acd3db9580cfdeb515b4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Rohit Gupta <rohgup@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18 and squash "dcf7256 sched: window-stats: Fix overflow bug" into this patch.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __migrate_task().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 11:40:16 -07:00			`}`

			`static inline void`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`dec_cumulative_runnable_avg(struct hmp_sched_stats *stats,`
sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00			`struct task_struct *p)`
sched: Window-based load stat improvements Some tasks can have a sporadic load pattern such that they can suddenly start running for longer intervals of time after running for shorter durations. To recognize such sharp increase in tasks' demands, max between the average of 5 window load samples and the most recent sample is chosen as the task demand. Make the window size (sched_ravg_window) configurable at boot up time. To prevent users from setting inappropriate values for window size, min and max limits are defined. As 'ravg' struct tracks load for both real-time and non real-time tasks it is moved out of sched_entity struct. In order to prevent changing function signatures for move_tasks() and move_one_task() per-cpu variables are defined to track the total load moved. In case multiple tasks are selected to migrate in one load balance operation, loads > 100 could be sent through migration notifiers. Prevent this scenario by setting mnd.load to 100 in such cases. Define wrapper functions to compute cpu demands for tasks and to change rq->cumulative_runnable_avg. Change-Id: I9abfbf3b5fe23ae615a6acd3db9580cfdeb515b4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Rohit Gupta <rohgup@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18 and squash "dcf7256 sched: window-stats: Fix overflow bug" into this patch.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __migrate_task().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 11:40:16 -07:00			`{`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`u32 task_load;`

sched: Remove sched_enable_hmp flag Clean up the code and make it more maintainable by removing dependency on the sched_enable_hmp flag. We do not support HMP scheduler without recompiling. Enabling the HMP scheduler is done through enabling the CONFIG_SCHED_HMP config. Change-Id: I246c1b1889f8dcbc8f0a0805077c0ce5d4f083b0 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org> 2017-02-01 17:59:51 -08:00			`if (sched_disable_window_stats)`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`return;`

sched: Remove unused PELT extensions for HMP scheduling PELT extensions for HMP have never been used since the early days of the HMP scheduler. Furthermore, changes to PELT itself in newer kernel versions render some of the code redundant or incorrect. These extensions have not been tested for a long time and are practically dead code. Remove it so that future upgrades become easier. Change-Id: I029f327406ca00b2370c93134158b61dda3b81e3 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 10:53:01 -07:00			`task_load = sched_disable_window_stats ? 0 : p->ravg.demand;`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30
			`stats->cumulative_runnable_avg -= task_load;`

			`BUG_ON((s64)stats->cumulative_runnable_avg < 0);`
sched: Add separate load tracking histogram to predict loads Current window based load tracking only saves history for five windows. A historically heavy task's heavy load will be completely forgotten after five windows of light load. Even before the five window expires, a heavy task wakes up on same CPU it used to run won't trigger any frequency change until end of the window. It would starve for the entire window. It also adds one "small" load window to history because it's accumulating load at a low frequency, further reducing the tracked load for this heavy task. Ideally, scheduler should be able to identify such tasks and notify governor to increase frequency immediately after it wakes up. Add a histogram for each task to track a much longer load history. A prediction will be made based on runtime of previous or current window, histogram data and load tracked in recent windows. Prediction of all tasks that is currently running or runnable on a CPU is aggregated and reported to CPUFreq governor in sched_get_cpus_busy(). sched_get_cpus_busy() now returns predicted busy time in addition to previous window busy time and new task busy time, scaled to the CPU maximum possible frequency. Tunables: - /proc/sys/kernel/sched_gov_alert_freq (KHz) This tunable can be used to further filter the notifications. Frequency alert notification is sent only when the predicted load exceeds previous window load by sched_gov_alert_freq converted to load. Change-Id: If29098cd2c5499163ceaff18668639db76ee8504 Suggested-by: Saravana Kannan <skannan@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Junjie Wu <junjiew@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts around __migrate_task() and removed changes for CONFIG_SCHED_QHMP.] 2015-06-08 09:08:47 +05:30
sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00			`stats->pred_demands_sum -= p->ravg.pred_demand;`
			`BUG_ON((s64)stats->pred_demands_sum < 0);`
sched: Window-based load stat improvements Some tasks can have a sporadic load pattern such that they can suddenly start running for longer intervals of time after running for shorter durations. To recognize such sharp increase in tasks' demands, max between the average of 5 window load samples and the most recent sample is chosen as the task demand. Make the window size (sched_ravg_window) configurable at boot up time. To prevent users from setting inappropriate values for window size, min and max limits are defined. As 'ravg' struct tracks load for both real-time and non real-time tasks it is moved out of sched_entity struct. In order to prevent changing function signatures for move_tasks() and move_one_task() per-cpu variables are defined to track the total load moved. In case multiple tasks are selected to migrate in one load balance operation, loads > 100 could be sent through migration notifiers. Prevent this scenario by setting mnd.load to 100 in such cases. Define wrapper functions to compute cpu demands for tasks and to change rq->cumulative_runnable_avg. Change-Id: I9abfbf3b5fe23ae615a6acd3db9580cfdeb515b4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Rohit Gupta <rohgup@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18 and squash "dcf7256 sched: window-stats: Fix overflow bug" into this patch.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __migrate_task().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 11:40:16 -07:00			`}`

sched: avoid stale cumulative_runnable_avg HMP statistics When a new window starts for a task and the task is on a rq, scheduler decreases rq's cumulative_runnable_avg momentarily, re-account task's demand and increases rq's cumulative_runnable_avg with newly accounted task's demand. Therefore there is short time period that rq's cumulative_runnable_avg is less than what it's supposed to be. Meanwhile, there is chance that other CPU is in search of best CPU to place a task and makes suboptimal decision with momentarily stale cumulative_runnable_avg. Fix such issue by adding or subtracting of delta between task's old and new demand instead of decrementing and incrementing of entire task's load. Change-Id: I3c9329961e6f96e269fa13359e7d1c39c4973ff2 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-07-13 21:04:18 -07:00			`static inline void`
			`fixup_cumulative_runnable_avg(struct hmp_sched_stats *stats,`
sched: Add separate load tracking histogram to predict loads Current window based load tracking only saves history for five windows. A historically heavy task's heavy load will be completely forgotten after five windows of light load. Even before the five window expires, a heavy task wakes up on same CPU it used to run won't trigger any frequency change until end of the window. It would starve for the entire window. It also adds one "small" load window to history because it's accumulating load at a low frequency, further reducing the tracked load for this heavy task. Ideally, scheduler should be able to identify such tasks and notify governor to increase frequency immediately after it wakes up. Add a histogram for each task to track a much longer load history. A prediction will be made based on runtime of previous or current window, histogram data and load tracked in recent windows. Prediction of all tasks that is currently running or runnable on a CPU is aggregated and reported to CPUFreq governor in sched_get_cpus_busy(). sched_get_cpus_busy() now returns predicted busy time in addition to previous window busy time and new task busy time, scaled to the CPU maximum possible frequency. Tunables: - /proc/sys/kernel/sched_gov_alert_freq (KHz) This tunable can be used to further filter the notifications. Frequency alert notification is sent only when the predicted load exceeds previous window load by sched_gov_alert_freq converted to load. Change-Id: If29098cd2c5499163ceaff18668639db76ee8504 Suggested-by: Saravana Kannan <skannan@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Junjie Wu <junjiew@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts around __migrate_task() and removed changes for CONFIG_SCHED_QHMP.] 2015-06-08 09:08:47 +05:30			`struct task_struct *p, s64 task_load_delta,`
			`s64 pred_demand_delta)`
sched: avoid stale cumulative_runnable_avg HMP statistics When a new window starts for a task and the task is on a rq, scheduler decreases rq's cumulative_runnable_avg momentarily, re-account task's demand and increases rq's cumulative_runnable_avg with newly accounted task's demand. Therefore there is short time period that rq's cumulative_runnable_avg is less than what it's supposed to be. Meanwhile, there is chance that other CPU is in search of best CPU to place a task and makes suboptimal decision with momentarily stale cumulative_runnable_avg. Fix such issue by adding or subtracting of delta between task's old and new demand instead of decrementing and incrementing of entire task's load. Change-Id: I3c9329961e6f96e269fa13359e7d1c39c4973ff2 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-07-13 21:04:18 -07:00			`{`
sched: Remove sched_enable_hmp flag Clean up the code and make it more maintainable by removing dependency on the sched_enable_hmp flag. We do not support HMP scheduler without recompiling. Enabling the HMP scheduler is done through enabling the CONFIG_SCHED_HMP config. Change-Id: I246c1b1889f8dcbc8f0a0805077c0ce5d4f083b0 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org> 2017-02-01 17:59:51 -08:00			`if (sched_disable_window_stats)`
sched: avoid stale cumulative_runnable_avg HMP statistics When a new window starts for a task and the task is on a rq, scheduler decreases rq's cumulative_runnable_avg momentarily, re-account task's demand and increases rq's cumulative_runnable_avg with newly accounted task's demand. Therefore there is short time period that rq's cumulative_runnable_avg is less than what it's supposed to be. Meanwhile, there is chance that other CPU is in search of best CPU to place a task and makes suboptimal decision with momentarily stale cumulative_runnable_avg. Fix such issue by adding or subtracting of delta between task's old and new demand instead of decrementing and incrementing of entire task's load. Change-Id: I3c9329961e6f96e269fa13359e7d1c39c4973ff2 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-07-13 21:04:18 -07:00			`return;`

sched: clean up fixup_hmp_sched_stats() The commit 392edf4969d20 ("sched: avoid stale cumulative_runnable_avg HMP statistics) introduced the callback function fixup_hmp_sched_stats() so update_history() can avoid decrement and increment pair of HMP stat. However the commit also made fixup function to do obscure p->ravg.demand update which isn't the cleanest way. Revise the function fixup_hmp_sched_stats() so the caller can update p->ravg.demand directly. Change-Id: Id54667d306495d2109c26362813f80f08a1385ad [joonwoop@codeaurora.org: stripped out CONFIG_SCHED_QHMP.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-07-30 10:44:13 -07:00			`stats->cumulative_runnable_avg += task_load_delta;`
sched: avoid stale cumulative_runnable_avg HMP statistics When a new window starts for a task and the task is on a rq, scheduler decreases rq's cumulative_runnable_avg momentarily, re-account task's demand and increases rq's cumulative_runnable_avg with newly accounted task's demand. Therefore there is short time period that rq's cumulative_runnable_avg is less than what it's supposed to be. Meanwhile, there is chance that other CPU is in search of best CPU to place a task and makes suboptimal decision with momentarily stale cumulative_runnable_avg. Fix such issue by adding or subtracting of delta between task's old and new demand instead of decrementing and incrementing of entire task's load. Change-Id: I3c9329961e6f96e269fa13359e7d1c39c4973ff2 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-07-13 21:04:18 -07:00			`BUG_ON((s64)stats->cumulative_runnable_avg < 0);`
sched: Add separate load tracking histogram to predict loads Current window based load tracking only saves history for five windows. A historically heavy task's heavy load will be completely forgotten after five windows of light load. Even before the five window expires, a heavy task wakes up on same CPU it used to run won't trigger any frequency change until end of the window. It would starve for the entire window. It also adds one "small" load window to history because it's accumulating load at a low frequency, further reducing the tracked load for this heavy task. Ideally, scheduler should be able to identify such tasks and notify governor to increase frequency immediately after it wakes up. Add a histogram for each task to track a much longer load history. A prediction will be made based on runtime of previous or current window, histogram data and load tracked in recent windows. Prediction of all tasks that is currently running or runnable on a CPU is aggregated and reported to CPUFreq governor in sched_get_cpus_busy(). sched_get_cpus_busy() now returns predicted busy time in addition to previous window busy time and new task busy time, scaled to the CPU maximum possible frequency. Tunables: - /proc/sys/kernel/sched_gov_alert_freq (KHz) This tunable can be used to further filter the notifications. Frequency alert notification is sent only when the predicted load exceeds previous window load by sched_gov_alert_freq converted to load. Change-Id: If29098cd2c5499163ceaff18668639db76ee8504 Suggested-by: Saravana Kannan <skannan@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Junjie Wu <junjiew@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts around __migrate_task() and removed changes for CONFIG_SCHED_QHMP.] 2015-06-08 09:08:47 +05:30
sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00			`stats->pred_demands_sum += pred_demand_delta;`
			`BUG_ON((s64)stats->pred_demands_sum < 0);`
sched: avoid stale cumulative_runnable_avg HMP statistics When a new window starts for a task and the task is on a rq, scheduler decreases rq's cumulative_runnable_avg momentarily, re-account task's demand and increases rq's cumulative_runnable_avg with newly accounted task's demand. Therefore there is short time period that rq's cumulative_runnable_avg is less than what it's supposed to be. Meanwhile, there is chance that other CPU is in search of best CPU to place a task and makes suboptimal decision with momentarily stale cumulative_runnable_avg. Fix such issue by adding or subtracting of delta between task's old and new demand instead of decrementing and incrementing of entire task's load. Change-Id: I3c9329961e6f96e269fa13359e7d1c39c4973ff2 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-07-13 21:04:18 -07:00			`}`

sched: per-cpu mostly_idle threshold sched_mostly_idle_load and sched_mostly_idle_nr_run knobs help pack tasks on cpus to some extent. In some cases, it may be desirable to have different packing limits for different cpus. For example, pack to a higher limit on high-performance cpus compared to power-efficient cpus. This patch removes the global mostly_idle tunables and makes them per-cpu, thus letting task packing behavior to be controlled in a fine-grained manner. Change-Id: Ifc254cda34b928eae9d6c342ce4c0f64e531e6c2 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-11-04 15:25:50 +05:30			`#define pct_to_real(tunable) \`
			`(div64_u64((u64)tunable * (u64)max_task_load(), 100))`

			`#define real_to_pct(tunable) \`
			`(div64_u64((u64)tunable * (u64)100, (u64)max_task_load()))`

sched: avoid CPUs with high irq activity CPUs with significant IRQ activity will not be able to serve tasks quickly. Avoid them if possible by disqualifying such CPUs from being recognized as mostly idle. Change-Id: I2c09272a4f259f0283b272455147d288fce11982 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> 2014-11-13 14:58:10 -08:00			`#define SCHED_HIGH_IRQ_TIMEOUT 3`
			`static inline u64 sched_irqload(int cpu)`
			`{`
			`struct rq *rq = cpu_rq(cpu);`
			`s64 delta;`

			`delta = get_jiffies_64() - rq->irqload_ts;`
sched: take account of irq preemption when calculating irqload delta If irq raises while sched_irqload() is calculating irqload delta, sched_account_irqtime() can update rq's irqload_ts which can be greater than the jiffies stored in sched_irqload()'s context so delta can be negative. This negative delta means there was recent irq occurence. So remove improper BUG_ON(). CRs-fixed: 771894 Change-Id: I5bb01b50ec84c14bf9f26dd9c95de82ec2cd19b5 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-12-16 14:44:09 -08:00			`/*`
			`* Current context can be preempted by irq and rq->irqload_ts can be`
			`* updated by irq context so that delta can be negative.`
			`* But this is okay and we can safely return as this means there`
			`* was recent irq occurrence.`
			`*/`
sched: avoid CPUs with high irq activity CPUs with significant IRQ activity will not be able to serve tasks quickly. Avoid them if possible by disqualifying such CPUs from being recognized as mostly idle. Change-Id: I2c09272a4f259f0283b272455147d288fce11982 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> 2014-11-13 14:58:10 -08:00
			`if (delta < SCHED_HIGH_IRQ_TIMEOUT)`
			`return rq->avg_irqload;`
			`else`
			`return 0;`
			`}`

			`static inline int sched_cpu_high_irqload(int cpu)`
			`{`
sched: make sched_cpu_high_irqload a runtime tunable It may be desirable to be able to alter the scehd_cpu_high_irqload setting easily, so make it a runtime tunable value. Change-Id: I832030eec2aafa101f0f435a4fd2d401d447880d Signed-off-by: Steve Muckle <smuckle@codeaurora.org> 2014-11-30 16:26:55 -08:00			`return sched_irqload(cpu) >= sysctl_sched_cpu_high_irqload;`
sched: avoid CPUs with high irq activity CPUs with significant IRQ activity will not be able to serve tasks quickly. Avoid them if possible by disqualifying such CPUs from being recognized as mostly idle. Change-Id: I2c09272a4f259f0283b272455147d288fce11982 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> 2014-11-13 14:58:10 -08:00			`}`

sched: Fix compile issues for !CONFIG_SCHED_HMP Fix compile issues observed when CONFIG_SCHED_HMP is not turned on. There are still targets that may want that config option turned off. Change-Id: I29e69356da8d003d13d8cd3927a0b166cc1ef95e Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-29 15:56:29 -07:00			`static inline bool task_in_related_thread_group(struct task_struct *p)`
			`{`
			`return !!(rcu_access_pointer(p->grp) != NULL);`
			`}`

sched: colocate related threads Provide userspace interface for tasks to be grouped together as "related" threads. For example, all threads involved in updating display buffer could be tagged as related. Scheduler will attempt to provide special treatment for group of related threads such as: 1) Colocation of related threads in same "preferred" cluster 2) Aggregation of demand towards determination of cluster frequency This patch extends scheduler to provide best-effort colocation support for a group of related threads. Change-Id: Ic2cd769faf5da4d03a8f3cb0ada6224d0101a5f5 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [joonwoop@codeaurora.org: fixed minor merge conflicts. removed ifdefry for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-04-24 15:44:31 +05:30			`static inline`
			`struct related_thread_group task_related_thread_group(struct task_struct p)`
			`{`
sched/core: Add protection against null-pointer dereference p->grp is being accessed outside of lock which can cause null-pointer dereference. Fix this and also add rcu critical section around access of this data structure. CRs-fixed: 985379 Change-Id: Ic82de6ae2821845d704f0ec18046cc6a24f98e39 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in init_new_task_load().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-03-05 13:47:52 -08:00			`return rcu_dereference(p->grp);`
sched: colocate related threads Provide userspace interface for tasks to be grouped together as "related" threads. For example, all threads involved in updating display buffer could be tagged as related. Scheduler will attempt to provide special treatment for group of related threads such as: 1) Colocation of related threads in same "preferred" cluster 2) Aggregation of demand towards determination of cluster frequency This patch extends scheduler to provide best-effort colocation support for a group of related threads. Change-Id: Ic2cd769faf5da4d03a8f3cb0ada6224d0101a5f5 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [joonwoop@codeaurora.org: fixed minor merge conflicts. removed ifdefry for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-04-24 15:44:31 +05:30			`}`

sched: Add separate load tracking histogram to predict loads Current window based load tracking only saves history for five windows. A historically heavy task's heavy load will be completely forgotten after five windows of light load. Even before the five window expires, a heavy task wakes up on same CPU it used to run won't trigger any frequency change until end of the window. It would starve for the entire window. It also adds one "small" load window to history because it's accumulating load at a low frequency, further reducing the tracked load for this heavy task. Ideally, scheduler should be able to identify such tasks and notify governor to increase frequency immediately after it wakes up. Add a histogram for each task to track a much longer load history. A prediction will be made based on runtime of previous or current window, histogram data and load tracked in recent windows. Prediction of all tasks that is currently running or runnable on a CPU is aggregated and reported to CPUFreq governor in sched_get_cpus_busy(). sched_get_cpus_busy() now returns predicted busy time in addition to previous window busy time and new task busy time, scaled to the CPU maximum possible frequency. Tunables: - /proc/sys/kernel/sched_gov_alert_freq (KHz) This tunable can be used to further filter the notifications. Frequency alert notification is sent only when the predicted load exceeds previous window load by sched_gov_alert_freq converted to load. Change-Id: If29098cd2c5499163ceaff18668639db76ee8504 Suggested-by: Saravana Kannan <skannan@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Junjie Wu <junjiew@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts around __migrate_task() and removed changes for CONFIG_SCHED_QHMP.] 2015-06-08 09:08:47 +05:30			`#define PRED_DEMAND_DELTA ((s64)new_pred_demand - p->ravg.pred_demand)`
sched: colocate related threads Provide userspace interface for tasks to be grouped together as "related" threads. For example, all threads involved in updating display buffer could be tagged as related. Scheduler will attempt to provide special treatment for group of related threads such as: 1) Colocation of related threads in same "preferred" cluster 2) Aggregation of demand towards determination of cluster frequency This patch extends scheduler to provide best-effort colocation support for a group of related threads. Change-Id: Ic2cd769faf5da4d03a8f3cb0ada6224d0101a5f5 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [joonwoop@codeaurora.org: fixed minor merge conflicts. removed ifdefry for CONFIG_SCHED_QHMP.] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-04-24 15:44:31 +05:30
sched: Aggregate for frequency Related threads in a group could execute on different CPUs and hence present a split-demand picture to cpufreq governor. IOW the governor fails to see the net cpu demand of all related threads in a given window if the threads's execution were to be split across CPUs. That could result in sub-optimal frequency chosen in comparison to the ideal frequency at which the aggregate work (taken up by related threads) needs to be run. This patch aggregates cpu execution stats in a window for all related threads in a group. This helps present cpu busy time to governor as if all related threads were part of the same thread and thus help select the right frequency required by related threads. This aggregation is done per-cluster. Change-Id: I71e6047620066323721c6d542034ddd4b2950e7f Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: Fixed notify_migration() to hold rcu read lock as this version of Linux doesn't hold p->pi_lock when the function gets called while keeping use of rcu_access_pointer() since we never dereference return value.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-05-12 15:01:15 +05:30			`extern void`
			`check_for_freq_change(struct rq *rq, bool check_pred, bool check_groups);`

sched: Move notify_migration() under CONFIG_SCHED_HMP notify_migration() is a HMP specific function that relies on all of its contents to be stubbed out for !CONFIG_SCHED_HMP. However, it still maintains calls to rcu_read_lock/unlock(). In the !HMP case these calls are simply redundant. Move the function under CONFIG_SCHED_HMP and add a stub when the config is not defined so that there is no overhead. Change-Id: Iad914f31b629e81e403b0e89796b2b0f1d081695 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-02 15:08:13 -07:00			`extern void notify_migration(int src_cpu, int dest_cpu,`
			`bool src_cpu_dead, struct task_struct *p);`

sched: improve logic for alerting governor Currently we send notification to governor not taking note of cpus that are synchronized with regard to their frequency. As a result, scheduler could send pointless notifications (notification spam!). Avoid this by considering synchronized cpus and alerting governor only when the highest demand of any cpu within cluster far exceeds or falls behind current frequency. Change-Id: I74908b5a212404ca56b38eb94548f9b1fbcca33d Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-08-14 22:01:57 +05:30			`/* Is frequency of two cpus synchronized with each other? */`
			`static inline int same_freq_domain(int src_cpu, int dst_cpu)`
			`{`
			`struct rq *rq = cpu_rq(src_cpu);`

			`if (src_cpu == dst_cpu)`
			`return 1;`

			`return cpumask_test_cpu(dst_cpu, &rq->freq_domain_cpumask);`
			`}`

sched: trigger immediate migration of tasks upon boost Currently turning on boost does not immediately trigger migration of tasks from lower capacity cpus. Tasks could incur migration latency of up to one timer tick (when check_for_migration() is run). Fix this by triggering a migration check on cpus with lower capacity as soon as boost is turned on for first time. Change-Id: I244649f9cb6608862d87631325967b887b7f4b7e Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org 2014-07-24 06:40:30 -07:00			`#define BOOST_KICK 0`
sched: Fix herding issue check_for_migration() could run concurrently on multiple cpus, resulting in multiple tasks wanting to migrate to same cpu. This could cause cpus to be underutilized and lead to increased scheduling latencies for tasks. Fix this by serializing select_best_cpu() calls from cpus running check_for_migration() check and marking selected cpus as reserved, so that subsequent call to select_best_cpu() from check_for_migration() will skip reserved cpus. Change-Id: I73a22cacab32dee3c14267a98b700f572aa3900c Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org 2014-07-25 08:04:27 -07:00			`#define CPU_RESERVED 1`

			`static inline int is_reserved(int cpu)`
			`{`
			`struct rq *rq = cpu_rq(cpu);`

			`return test_bit(CPU_RESERVED, &rq->hmp_flags);`
			`}`

			`static inline int mark_reserved(int cpu)`
			`{`
			`struct rq *rq = cpu_rq(cpu);`

			`/* Name boost_flags as hmp_flags? */`
			`return test_and_set_bit(CPU_RESERVED, &rq->hmp_flags);`
			`}`

			`static inline void clear_reserved(int cpu)`
			`{`
			`struct rq *rq = cpu_rq(cpu);`

			`clear_bit(CPU_RESERVED, &rq->hmp_flags);`
			`}`
sched: trigger immediate migration of tasks upon boost Currently turning on boost does not immediately trigger migration of tasks from lower capacity cpus. Tasks could incur migration latency of up to one timer tick (when check_for_migration() is run). Fix this by triggering a migration check on cpus with lower capacity as soon as boost is turned on for first time. Change-Id: I244649f9cb6608862d87631325967b887b7f4b7e Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org 2014-07-24 06:40:30 -07:00
sched: precompute required frequency for CPU load At present in order to estimate power cost of CPU load, HMP scheduler converts CPU load to coresponding frequency on the fly which can be avoided. Optimize and reduce execution time of select_best_cpu() by precomputing CPU load to frequency conversion. This optimization reduces about ~20% of execution time of select_best_cpu() on average. Change-Id: I385c57f2ea9a50883b76ba6ca3deb673b827217f [joonwoop@codeaurora.org: fixed minior conflict in kernel/sched/sched.h. stripped out codes for CONFIG_SCHED_QHMP.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-08-21 11:02:22 -07:00			`static inline u64 cpu_cravg_sync(int cpu, int sync)`
			`{`
			`struct rq *rq = cpu_rq(cpu);`
			`u64 load;`

			`load = rq->hmp_stats.cumulative_runnable_avg;`

			`/*`
			`* If load is being checked in a sync wakeup environment,`
			`* we may want to discount the load of the currently running`
			`* task.`
			`*/`
			`if (sync && cpu == smp_processor_id()) {`
			`if (load > rq->curr->ravg.demand)`
			`load -= rq->curr->ravg.demand;`
			`else`
			`load = 0;`
			`}`

			`return load;`
			`}`

sched: Avoid waking idle cpu for short-burst tasks Introduce sched_short_burst tunable to classify "short-burst" tasks. These tasks are eligible for packing to avoid overhead associated with waking up an idle CPU. select_best_cpu() ignores power-cost and selects the CPU with least wakeup latency which is not loaded with IRQs and can accommodate this task without exceeding spill limits. The ties are broken with load followed by previous CPU. This policy does not affect cluster selection but only CPU selection in the selected cluster. The tasks eligible for "wakeup-up-idle" and "boost" are not considered for packing. This policy is applied for both "fair" and "rt" scheduling class tasks. Change-Id: I2a05493fde93f58636725f18d0ce8dbce4418a30 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-09-09 19:38:03 +05:30			`static inline bool is_short_burst_task(struct task_struct *p)`
			`{`
sched: Avoid packing tasks with low sleep time Low sleep time can be an indication that waking tasks will not receive any vruntime bonus and hence would suffer from latency when packed. short-burst tasks sleeping on an average more than sched_short_sleep_ns are not eligible for packing. This policy covers the case where a task runs in short bursts and sleeping for smaller duration in between. Change-Id: Ib81fa37809b85c267949cd433bc6115dd89f100e Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-09-09 19:59:12 +05:30			`return p->ravg.avg_burst < sysctl_sched_short_burst &&`
			`p->ravg.avg_sleep_time > sysctl_sched_short_sleep;`
sched: Avoid waking idle cpu for short-burst tasks Introduce sched_short_burst tunable to classify "short-burst" tasks. These tasks are eligible for packing to avoid overhead associated with waking up an idle CPU. select_best_cpu() ignores power-cost and selects the CPU with least wakeup latency which is not loaded with IRQs and can accommodate this task without exceeding spill limits. The ties are broken with load followed by previous CPU. This policy does not affect cluster selection but only CPU selection in the selected cluster. The tasks eligible for "wakeup-up-idle" and "boost" are not considered for packing. This policy is applied for both "fair" and "rt" scheduling class tasks. Change-Id: I2a05493fde93f58636725f18d0ce8dbce4418a30 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-09-09 19:38:03 +05:30			`}`

sched: Restore previous implementation of check_for_migration() commit 3bda2b55b41d ("Merge android-4.4.96 (aed4c54) into msm-4.4") replaced HMP scheduler check_for_migration() implementation with EAS scheduler implementation. This breaks HMP scheduler upmgiration functionality. Fix this by restoring the previous implementation. Change-Id: I3221f3efe42e1e43f8009cfa52c11afbb9d9c5b3 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2018-01-05 10:21:34 +05:30			`extern void check_for_migration(struct rq rq, struct task_struct p);`
sched: remove the notion of small tasks and small task packing Task packing will now be determined solely on the basis of the power cost of task placement. All tasks are eligible for packing. Remove the notion of "small" tasks from the scheduler. Change-Id: I72d52d04b2677c6a8d0bc6aa7d50ff0f1a4f5ebb Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-06-19 12:28:24 -07:00			`extern void pre_big_task_count_change(const struct cpumask *cpus);`
			`extern void post_big_task_count_change(const struct cpumask *cpus);`
sched: Basic task placement support for HMP systems HMP systems have cpus with different power and performance characteristics. Some cpus could offer better power at cost of lower performance while other cpus could offer better performance at cost of higher power. As a result, bandwidth consumed by a task to do some "fixed" amount of work could vary across cpus. Optimal task placement on HMP would involve placing a task on a cpu where it can meet its performance goals at lowest power cost. Since kernel has little to no awareness of performance goals of applications, we guestimate whether task is meeting its performance goals or not by looking at its cpu bandwidth consumption. High bandwidth consumption could imply that task's performance can improve by running on cpus with better capacity/performance-characterisitcs. This patch makes the basic changes to support HMP. It provides a configurable threshold and any task consuming bandwidth in excess of threshold will be placed on a cpu with better capacity. Change-Id: I3fd98edd430f73342fbef06411e8b2d1cf2f56fa Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict about members of p->se which are not available anymore.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 20:04:42 -07:00			`extern void set_hmp_defaults(void);`
sched: Add load based placement for RT tasks Currently RT tasks prefer to go to the lowest power CPU in the system. This can end up causing contention on the lowest power CPU. Instead ensure that RT tasks end up on the lowest power cluster and the least loaded CPU within that cluster. Change-Id: I363b3d43236924962c67d2fb5d3d2d09800cd994 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-07-21 15:00:59 -07:00			`extern int power_delta_exceeded(unsigned int cpu_cost, unsigned int base_cost);`
sched: precompute required frequency for CPU load At present in order to estimate power cost of CPU load, HMP scheduler converts CPU load to coresponding frequency on the fly which can be avoided. Optimize and reduce execution time of select_best_cpu() by precomputing CPU load to frequency conversion. This optimization reduces about ~20% of execution time of select_best_cpu() on average. Change-Id: I385c57f2ea9a50883b76ba6ca3deb673b827217f [joonwoop@codeaurora.org: fixed minior conflict in kernel/sched/sched.h. stripped out codes for CONFIG_SCHED_QHMP.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-08-21 11:02:22 -07:00			`extern unsigned int power_cost(int cpu, u64 demand);`
sched: window-stats: Code cleanup Remove code duplication associated with update of various window-stats related sysctl tunables Change-Id: I64e29ac065172464ba371a03758937999c42a71f Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-08-11 09:22:24 +05:30			`extern void reset_all_window_stats(u64 window_start, unsigned int window_size);`
sched: Make RT tasks eligible for boost During sched boost RT tasks currently end up going to the lowest power cluster. This can be a performance bottleneck especially if the frequency and IPC differences between clusters are high. Furthermore, when RT tasks go over to the little cluster during boost, the load balancer keeps attempting to pull work over to the big cluster. This results in pre-emption of the executing RT task causing more delays. Finally, containing more work on a single cluster during boost might help save some power if the little cluster can then enter deeper low power modes. Change-Id: I177b2e81be5657c23e7ac43889472561ce9993a9 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2014-12-03 10:18:12 -08:00			`extern int sched_boost(void);`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`extern int task_load_will_fit(struct task_struct *p, u64 task_load, int cpu,`
sched/hmp: Enhance co-location and scheduler boost features The recent introduction of the schedtune cgroup controller has provided the scheduler with added flexibility in terms of some of it's placement features. In particular each cgroup under the schedtune controller can now specify: 1) Whether it needs co-location along with other cgroups 2) Whether it is eligible for scheduler boost (sched_boost_enabled) 3) Whether the kernel can override the boost eligibility when necessary (sched_boost_no_override) The scheduler now creates a reserved co-location group at boot. This group is used to co-locate all tasks that form part of any one of the cgroups that have co-location enabled. This reserved group can neither be destroyed nor reused for other purposes. Furthermore, cgroups are only allowed to indicate their co-location preference once at boot. Further updates are disallowed. Since we are now creating co-location groups for an extended period of time, there are a few other factors to consider when determining the preferred cluster for the group. We first exclude any tasks in the group that have not been observed to be running for a significant amount of time. Secondly we introduce the notion of group up and down migrate tunables to allow different migration policies than individual tasks. Lastly we break co-location if a single task in a group exceeds up-migrate but the total load of the group does not exceed group up-migrate. In terms of sched_boost, the scheduler now supports multiple types of boost. These are: 1) FULL_THROTTLE : Force up-migrate tasks belonging any cgroup that has the sched_boost_enabled flag turned on. Little CPUs will only be used when big CPUs can no longer accommodate tasks. Also up-migrate all RT tasks. 2) CONSERVATIVE : Override the sched_boost_enabled flag for all cgroups except those that have the sched_boost_no_override flag set. Force up-migrate all tasks belonging to only those cgroups that still remain eligible for boost. RT tasks do not get force up migrated. 3) RESTRAINED : Start frequency aggregation for co-located tasks. This type of boost does not force up-migrate any task. Finally the boost API removes ref-counting. This means that there can only be a single entity using boost at any given time. If multiple entities are managing boost, they are required to be well behaved so that they don't interfere with one another. Even for a single client, it is not possible to switch directly from one boost type to another. Boost must be first turned off before switching over to a new type. Change-Id: I8d224a70cbef162f27078b62b73acaa22670861d Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-31 16:54:12 -07:00			`enum sched_boost_policy boost_policy);`
			`extern enum sched_boost_policy sched_boost_policy(void);`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`extern int task_will_fit(struct task_struct *p, int cpu);`
			`extern u64 cpu_load(int cpu);`
			`extern u64 cpu_load_sync(int cpu, int sync);`
			`extern int preferred_cluster(struct sched_cluster *cluster,`
			`struct task_struct *p);`
			`extern void inc_nr_big_task(struct hmp_sched_stats *stats,`
			`struct task_struct *p);`
			`extern void dec_nr_big_task(struct hmp_sched_stats *stats,`
			`struct task_struct *p);`
			`extern void inc_rq_hmp_stats(struct rq *rq,`
			`struct task_struct *p, int change_cra);`
			`extern void dec_rq_hmp_stats(struct rq *rq,`
			`struct task_struct *p, int change_cra);`
sched: Fix compilation issue with reset_hmp_stats reset_hmp_stats was moved to another file and when CONFIG_CFS_BANDWIDTH is enabled there is code still referencing this in the original file causing compilation error. Change-Id: Iab7fc8551b628c443ce751026b06c5ff4ebba39a Signed-off-by: Olav Haugan <ohaugan@codeaurora.org> 2016-10-25 11:05:13 -07:00			`extern void reset_hmp_stats(struct hmp_sched_stats *stats, int reset_cra);`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`extern int is_big_task(struct task_struct *p);`
			`extern int upmigrate_discouraged(struct task_struct *p);`
			`extern struct sched_cluster rq_cluster(struct rq rq);`
			`extern int nr_big_tasks(struct rq *rq);`
			`extern void fixup_nr_big_tasks(struct hmp_sched_stats *stats,`
			`struct task_struct *p, s64 delta);`
			`extern void reset_task_stats(struct task_struct *p);`
			`extern void reset_cfs_rq_hmp_stats(int cpu, int reset_cra);`
			`extern void _inc_hmp_sched_stats_fair(struct rq *rq,`
			`struct task_struct *p, int change_cra);`
			`extern u64 cpu_upmigrate_discourage_read_u64(struct cgroup_subsys_state *css,`
			`struct cftype *cft);`
			`extern int cpu_upmigrate_discourage_write_u64(struct cgroup_subsys_state *css,`
			`struct cftype *cft, u64 upmigrate_discourage);`
sched/hmp: Enhance co-location and scheduler boost features The recent introduction of the schedtune cgroup controller has provided the scheduler with added flexibility in terms of some of it's placement features. In particular each cgroup under the schedtune controller can now specify: 1) Whether it needs co-location along with other cgroups 2) Whether it is eligible for scheduler boost (sched_boost_enabled) 3) Whether the kernel can override the boost eligibility when necessary (sched_boost_no_override) The scheduler now creates a reserved co-location group at boot. This group is used to co-locate all tasks that form part of any one of the cgroups that have co-location enabled. This reserved group can neither be destroyed nor reused for other purposes. Furthermore, cgroups are only allowed to indicate their co-location preference once at boot. Further updates are disallowed. Since we are now creating co-location groups for an extended period of time, there are a few other factors to consider when determining the preferred cluster for the group. We first exclude any tasks in the group that have not been observed to be running for a significant amount of time. Secondly we introduce the notion of group up and down migrate tunables to allow different migration policies than individual tasks. Lastly we break co-location if a single task in a group exceeds up-migrate but the total load of the group does not exceed group up-migrate. In terms of sched_boost, the scheduler now supports multiple types of boost. These are: 1) FULL_THROTTLE : Force up-migrate tasks belonging any cgroup that has the sched_boost_enabled flag turned on. Little CPUs will only be used when big CPUs can no longer accommodate tasks. Also up-migrate all RT tasks. 2) CONSERVATIVE : Override the sched_boost_enabled flag for all cgroups except those that have the sched_boost_no_override flag set. Force up-migrate all tasks belonging to only those cgroups that still remain eligible for boost. RT tasks do not get force up migrated. 3) RESTRAINED : Start frequency aggregation for co-located tasks. This type of boost does not force up-migrate any task. Finally the boost API removes ref-counting. This means that there can only be a single entity using boost at any given time. If multiple entities are managing boost, they are required to be well behaved so that they don't interfere with one another. Even for a single client, it is not possible to switch directly from one boost type to another. Boost must be first turned off before switching over to a new type. Change-Id: I8d224a70cbef162f27078b62b73acaa22670861d Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-31 16:54:12 -07:00			`extern void sched_boost_parse_dt(void);`
sched: Optimize the next top task search logic upon task migration find_next_top_index() is responsible for finding the second top task on a CPU when the top task migrates away from that CPU. This operation is expensive as we need to iterate the entire array of top tasks to find the second top task. Optimize this by introducing bitmaps for tracking top task indices. There are two bitmaps; one for the previous window and one for the current window. Each bit in a bitmap tracks whether the corresponding bucket in the top task hashmap has a non zero refcount. The bit is set when the refcount becomes non zero and is cleared when it becomes zero. Finding the second top task upon migration is then simply a matter of finding the highest set bit in the bitmap. Change-Id: Ibafaf66eed756b0328704dfaa89c17ab0d84e359 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-06-07 15:18:37 -07:00			`extern void clear_top_tasks_bitmap(unsigned long *bitmap);`
sched: Basic task placement support for HMP systems HMP systems have cpus with different power and performance characteristics. Some cpus could offer better power at cost of lower performance while other cpus could offer better performance at cost of higher power. As a result, bandwidth consumed by a task to do some "fixed" amount of work could vary across cpus. Optimal task placement on HMP would involve placing a task on a cpu where it can meet its performance goals at lowest power cost. Since kernel has little to no awareness of performance goals of applications, we guestimate whether task is meeting its performance goals or not by looking at its cpu bandwidth consumption. High bandwidth consumption could imply that task's performance can improve by running on cpus with better capacity/performance-characterisitcs. This patch makes the basic changes to support HMP. It provides a configurable threshold and any task consuming bandwidth in excess of threshold will be placed on a cpu with better capacity. Change-Id: I3fd98edd430f73342fbef06411e8b2d1cf2f56fa Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict about members of p->se which are not available anymore.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 20:04:42 -07:00
sched/hmp: Enhance co-location and scheduler boost features The recent introduction of the schedtune cgroup controller has provided the scheduler with added flexibility in terms of some of it's placement features. In particular each cgroup under the schedtune controller can now specify: 1) Whether it needs co-location along with other cgroups 2) Whether it is eligible for scheduler boost (sched_boost_enabled) 3) Whether the kernel can override the boost eligibility when necessary (sched_boost_no_override) The scheduler now creates a reserved co-location group at boot. This group is used to co-locate all tasks that form part of any one of the cgroups that have co-location enabled. This reserved group can neither be destroyed nor reused for other purposes. Furthermore, cgroups are only allowed to indicate their co-location preference once at boot. Further updates are disallowed. Since we are now creating co-location groups for an extended period of time, there are a few other factors to consider when determining the preferred cluster for the group. We first exclude any tasks in the group that have not been observed to be running for a significant amount of time. Secondly we introduce the notion of group up and down migrate tunables to allow different migration policies than individual tasks. Lastly we break co-location if a single task in a group exceeds up-migrate but the total load of the group does not exceed group up-migrate. In terms of sched_boost, the scheduler now supports multiple types of boost. These are: 1) FULL_THROTTLE : Force up-migrate tasks belonging any cgroup that has the sched_boost_enabled flag turned on. Little CPUs will only be used when big CPUs can no longer accommodate tasks. Also up-migrate all RT tasks. 2) CONSERVATIVE : Override the sched_boost_enabled flag for all cgroups except those that have the sched_boost_no_override flag set. Force up-migrate all tasks belonging to only those cgroups that still remain eligible for boost. RT tasks do not get force up migrated. 3) RESTRAINED : Start frequency aggregation for co-located tasks. This type of boost does not force up-migrate any task. Finally the boost API removes ref-counting. This means that there can only be a single entity using boost at any given time. If multiple entities are managing boost, they are required to be well behaved so that they don't interfere with one another. Even for a single client, it is not possible to switch directly from one boost type to another. Boost must be first turned off before switching over to a new type. Change-Id: I8d224a70cbef162f27078b62b73acaa22670861d Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-31 16:54:12 -07:00			`#if defined(CONFIG_SCHED_TUNE) && defined(CONFIG_CGROUP_SCHEDTUNE)`
			`extern bool task_sched_boost(struct task_struct *p);`
			`extern int sync_cgroup_colocation(struct task_struct *p, bool insert);`
			`extern bool same_schedtune(struct task_struct tsk1, struct task_struct tsk2);`
			`extern void update_cgroup_boost_settings(void);`
			`extern void restore_cgroup_boost_settings(void);`

			`#else`
			`static inline bool`
			`same_schedtune(struct task_struct tsk1, struct task_struct tsk2)`
			`{`
			`return true;`
			`}`

			`static inline bool task_sched_boost(struct task_struct *p)`
			`{`
			`return true;`
			`}`

			`static inline void update_cgroup_boost_settings(void) { }`
			`static inline void restore_cgroup_boost_settings(void) { }`
			`#endif`

sched: pre-allocate colocation groups At present, sched_set_group_id() dynamically allocates structure for colocation group to assign the given task to the group. However this can cause deadlock as memory allocator can wakeup a task which also tries to acquire related_thread_group_lock. Avoid such deadlock by pre-allocating colocation structures. This limits maximum colocation groups to static number but it's fine as it's never expected to be a lot. Change-Id: Ifc32ab4ead63c382ae390358ed86f7cc5b6eb2dc Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-11-28 13:41:18 -08:00			`extern int alloc_related_thread_groups(void);`

sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00			`#else /* CONFIG_SCHED_HMP */`

			`struct hmp_sched_stats;`
			`struct related_thread_group;`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`struct sched_cluster;`

sched: fix compiler errors with !SCHED_HMP HMP scheduler boost feature related functions are referred in SMP load balancer. Add the nop functions for the same to fix the compiler errors with !SCHED_HMP. Change-Id: I1cbcf67f728c2cbc7c0f47e8eaf1f4165649dce8 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2017-01-11 15:11:23 +05:30			`static inline enum sched_boost_policy sched_boost_policy(void)`
			`{`
			`return SCHED_BOOST_NONE;`
			`}`

			`static inline bool task_sched_boost(struct task_struct *p)`
			`{`
			`return true;`
			`}`

sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`static inline int got_boost_kick(void)`
			`{`
			`return 0;`
			`}`

			`static inline void update_task_ravg(struct task_struct p, struct rq rq,`
			`int event, u64 wallclock, u64 irqtime) { }`

			`static inline bool early_detection_notify(struct rq *rq, u64 wallclock)`
			`{`
			`return 0;`
			`}`

			`static inline void clear_ed_task(struct task_struct p, struct rq rq) { }`
			`static inline void fixup_busy_time(struct task_struct *p, int new_cpu) { }`
			`static inline void clear_boost_kick(int cpu) { }`
			`static inline void clear_hmp_request(int cpu) { }`
			`static inline void mark_task_starting(struct task_struct *p) { }`
			`static inline void set_window_start(struct rq *rq) { }`
sched: Add a stub function for init_clusters() Add a stub function for init_cluster() and remove a ifdefry for SCHED_HMP in sched_init() Change-Id: I6745485152d735436d8398818f7fb5e70ce5ee65 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-10-01 11:06:13 +05:30			`static inline void init_clusters(void) {}`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`static inline void update_cluster_topology(void) { }`
sched: Track average sleep time Similar to tracking average burst length for tasks, average sleep time indicates how much a task sleeps on an average before waking up to run. Very low sleep and burst lengths indicates tasks that could be sensitive to task-wake latencies and hence should not be packed. Change-Id: Ife68a9a9a9e596246aab5029f60e41c5bad781e4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-09-09 19:50:27 +05:30			`static inline void note_task_waking(struct task_struct *p, u64 wallclock) { }`
sched: Move most HMP specific code to a separate file. Most code pertaining to CONFIG_SCHED_HMP has been moved to a separate file "hmp.c" in order to facilitate kernel upgrades. Fewer changes in the original scheduler files means fewer conflicts. Some parts of code, however, could not be moved to the separate file either because of dependencies with other non-HMP code or because the changes are specific only to the scheduling classes where the code resides. Change-Id: Ib067ac75e5a494008dcb3c67586b622c1b3962ce Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-01 17:48:21 -07:00			`static inline void set_task_last_switch_out(struct task_struct *p,`
			`u64 wallclock) { }`

			`static inline int task_will_fit(struct task_struct *p, int cpu)`
			`{`
			`return 1;`
			`}`

			`static inline int select_best_cpu(struct task_struct *p, int target,`
			`int reason, int sync)`
			`{`
			`return 0;`
			`}`

			`static inline unsigned int power_cost(int cpu, u64 demand)`
			`{`
			`return SCHED_CAPACITY_SCALE;`
			`}`

			`static inline int sched_boost(void)`
			`{`
			`return 0;`
			`}`

			`static inline int is_big_task(struct task_struct *p)`
			`{`
			`return 0;`
			`}`

			`static inline int nr_big_tasks(struct rq *rq)`
			`{`
			`return 0;`
			`}`

			`static inline int is_cpu_throttling_imminent(int cpu)`
			`{`
			`return 0;`
			`}`

			`static inline int is_task_migration_throttled(struct task_struct *p)`
			`{`
			`return 0;`
			`}`

			`static inline unsigned int cpu_temp(int cpu)`
			`{`
			`return 0;`
			`}`

			`static inline void`
			`inc_rq_hmp_stats(struct rq rq, struct task_struct p, int change_cra) { }`

			`static inline void`
			`dec_rq_hmp_stats(struct rq rq, struct task_struct p, int change_cra) { }`

			`static inline void`
			`inc_hmp_sched_stats_fair(struct rq rq, struct task_struct p) { }`

			`static inline void`
			`dec_hmp_sched_stats_fair(struct rq rq, struct task_struct p) { }`

			`static inline int`
			`preferred_cluster(struct sched_cluster cluster, struct task_struct p)`
			`{`
			`return 1;`
			`}`

			`static inline struct sched_cluster rq_cluster(struct rq rq)`
			`{`
			`return NULL;`
			`}`

sched/walt: Fix the memory leak of idle task load pointers The memory for task load pointers are allocated twice for each idle thread except for the boot CPU. This happens during boot from idle_threads_init()->idle_init() in the following 2 paths. 1. idle_init()->fork_idle()->copy_process()-> sched_fork()->init_new_task_load() 2. idle_init()->fork_idle()-> init_idle()->init_new_task_load() The memory allocation for all tasks happens through the 1st path, so use the same for idle tasks and kill the 2nd path. Since the idle thread of boot CPU does not go through fork_idle(), allocate the memory for it separately. Change-Id: I4696a414ffe07d4114b56d326463026019e278f1 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> [schikk@codeaurora.org: resolved merge conflicts] Signed-off-by: Swetha Chikkaboraiah <schikk@codeaurora.org> 2018-09-20 15:31:36 +05:30			`static inline void init_new_task_load(struct task_struct *p)`
sched: Add per CPU load tracking for each task Keeping a track of the load footprint of each task on every CPU that it executed on gives the scheduler much more flexibility in terms of the number of frequency guidance policies. These new fields will be used in subsequent patches as we alter the load fixup mechanism upon task migration. We still need to maintain the curr/prev_window sums as they will also be required in subsequent patches as we start to track top tasks based on cumulative load. Also, we need to call init_new_task_load() for the idle task. This is an existing harmless bug as load tracking for the idle task is irrelevant. However, in this patch we are adding pointers to the ravg structure. These pointers have to be initialized even for the idle task. Finally move init_new_task_load() to sched_fork(). This was always the more appropriate place, however, following the introduction of new pointers in the ravg struct, this is necessary to avoid races with functions such as reset_all_task_stats(). Change-Id: Ib584372eb539706da4319973314e54dae04e5934 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-05-09 16:28:07 -07:00			`{`
			`}`
sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00
			`static inline u64 scale_load_to_cpu(u64 load, int cpu)`
			`{`
			`return load;`
			`}`

			`static inline unsigned int nr_eligible_big_tasks(int cpu)`
			`{`
			`return 0;`
			`}`

core_ctl: un-isolate BIG CPUs more aggressively The current algorithm to bring additional BIG CPUs is very conservative. It works when BIG tasks alone run on BIG cluster. When co-location and scheduler boost features are activated, small/medium tasks also run on BIG cluster. We don't want these tasks to downmigrate, when BIG CPUs are available but isolated. The following changes are done to un-isolate CPUs more aggressively. (1) Round up the big_avg. When the big_avg indicates that there are 1.5 tasks on an average in the last window, it indicates that we need 2 BIG CPUs not 1 BIG CPU. (2) Track the maximum number of running tasks in the last window on all CPUs. If any of the CPU in a cluster has more than 4 runnable tasks in the last window, bring an additional CPU to help out. Change-Id: Id05d9983af290760cec6d93d1bdc45bc5e924cce Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2017-05-10 15:43:29 +05:30			`static inline bool is_max_capacity_cpu(int cpu) { return true; }`

sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00			`static inline int pct_task_load(struct task_struct *p) { return 0; }`

			`static inline int cpu_capacity(int cpu)`
			`{`
			`return SCHED_LOAD_SCALE;`
			`}`

			`static inline int same_cluster(int src_cpu, int dst_cpu) { return 1; }`

			`static inline void inc_cumulative_runnable_avg(struct hmp_sched_stats *stats,`
			`struct task_struct *p)`
			`{`
			`}`

			`static inline void dec_cumulative_runnable_avg(struct hmp_sched_stats *stats,`
			`struct task_struct *p)`
			`{`
			`}`

			`static inline void sched_account_irqtime(int cpu, struct task_struct *curr,`
			`u64 delta, u64 wallclock)`
			`{`
			`}`

			`static inline void sched_account_irqstart(int cpu, struct task_struct *curr,`
			`u64 wallclock)`
			`{`
			`}`

			`static inline int sched_cpu_high_irqload(int cpu) { return 0; }`

			`static inline void set_preferred_cluster(struct related_thread_group *grp) { }`

sched: Fix compile issues for !CONFIG_SCHED_HMP Fix compile issues observed when CONFIG_SCHED_HMP is not turned on. There are still targets that may want that config option turned off. Change-Id: I29e69356da8d003d13d8cd3927a0b166cc1ef95e Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-29 15:56:29 -07:00			`static inline bool task_in_related_thread_group(struct task_struct *p)`
			`{`
			`return false;`
			`}`

sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00			`static inline`
			`struct related_thread_group task_related_thread_group(struct task_struct p)`
			`{`
			`return NULL;`
			`}`

			`static inline u32 task_load(struct task_struct *p) { return 0; }`

			`static inline int update_preferred_cluster(struct related_thread_group *grp,`
			`struct task_struct *p, u32 old_load)`
			`{`
			`return 0;`
			`}`
sched: Basic task placement support for HMP systems HMP systems have cpus with different power and performance characteristics. Some cpus could offer better power at cost of lower performance while other cpus could offer better performance at cost of higher power. As a result, bandwidth consumed by a task to do some "fixed" amount of work could vary across cpus. Optimal task placement on HMP would involve placing a task on a cpu where it can meet its performance goals at lowest power cost. Since kernel has little to no awareness of performance goals of applications, we guestimate whether task is meeting its performance goals or not by looking at its cpu bandwidth consumption. High bandwidth consumption could imply that task's performance can improve by running on cpus with better capacity/performance-characterisitcs. This patch makes the basic changes to support HMP. It provides a configurable threshold and any task consuming bandwidth in excess of threshold will be placed on a cpu with better capacity. Change-Id: I3fd98edd430f73342fbef06411e8b2d1cf2f56fa Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict about members of p->se which are not available anymore.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 20:04:42 -07:00
sched: inherit the group id from the group leader When sysctl_sched_enable_thread_grouping is set to 1, any new tasks created are put in the same group as their group leader. Change-Id: If1837dd7c8120c8b097cfffa1dc52eb4781f1641 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2015-10-21 16:04:46 +05:30			`static inline void add_new_task_to_grp(struct task_struct *new) {}`

sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00			`#define PRED_DEMAND_DELTA (0)`

			`static inline void`
			`check_for_freq_change(struct rq *rq, bool check_pred, bool check_groups) { }`

sched: Move notify_migration() under CONFIG_SCHED_HMP notify_migration() is a HMP specific function that relies on all of its contents to be stubbed out for !CONFIG_SCHED_HMP. However, it still maintains calls to rcu_read_lock/unlock(). In the !HMP case these calls are simply redundant. Move the function under CONFIG_SCHED_HMP and add a stub when the config is not defined so that there is no overhead. Change-Id: Iad914f31b629e81e403b0e89796b2b0f1d081695 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-02 15:08:13 -07:00			`static inline void notify_migration(int src_cpu, int dest_cpu,`
			`bool src_cpu_dead, struct task_struct *p) { }`

sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00			`static inline int same_freq_domain(int src_cpu, int dst_cpu)`
			`{`
			`return 1;`
			`}`
sched: support legacy mode better It should be possible to bypass all HMP scheduler changes at runtime by setting sysctl_sched_enable_hmp_task_placement and sysctl_sched_enable_power_aware to 0. Fix various code paths to honor this requirement. Change-Id: I74254e68582b3f9f1b84661baf7dae14f981c025 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in rt.c, p->nr_cpus_allowed == 1 is now moved in core.c] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-07-21 02:05:24 -07:00
sched: Restore previous implementation of check_for_migration() commit 3bda2b55b41d ("Merge android-4.4.96 (aed4c54) into msm-4.4") replaced HMP scheduler check_for_migration() implementation with EAS scheduler implementation. This breaks HMP scheduler upmgiration functionality. Fix this by restoring the previous implementation. Change-Id: I3221f3efe42e1e43f8009cfa52c11afbb9d9c5b3 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2018-01-05 10:21:34 +05:30			`static inline void check_for_migration(struct rq rq, struct task_struct p) { }`
sched: remove the notion of small tasks and small task packing Task packing will now be determined solely on the basis of the power cost of task placement. All tasks are eligible for packing. Remove the notion of "small" tasks from the scheduler. Change-Id: I72d52d04b2677c6a8d0bc6aa7d50ff0f1a4f5ebb Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-06-19 12:28:24 -07:00			`static inline void pre_big_task_count_change(void) { }`
			`static inline void post_big_task_count_change(void) { }`
sched: Basic task placement support for HMP systems HMP systems have cpus with different power and performance characteristics. Some cpus could offer better power at cost of lower performance while other cpus could offer better performance at cost of higher power. As a result, bandwidth consumed by a task to do some "fixed" amount of work could vary across cpus. Optimal task placement on HMP would involve placing a task on a cpu where it can meet its performance goals at lowest power cost. Since kernel has little to no awareness of performance goals of applications, we guestimate whether task is meeting its performance goals or not by looking at its cpu bandwidth consumption. High bandwidth consumption could imply that task's performance can improve by running on cpus with better capacity/performance-characterisitcs. This patch makes the basic changes to support HMP. It provides a configurable threshold and any task consuming bandwidth in excess of threshold will be placed on a cpu with better capacity. Change-Id: I3fd98edd430f73342fbef06411e8b2d1cf2f56fa Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict about members of p->se which are not available anymore.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 20:04:42 -07:00			`static inline void set_hmp_defaults(void) { }`

sched: Fix herding issue check_for_migration() could run concurrently on multiple cpus, resulting in multiple tasks wanting to migrate to same cpu. This could cause cpus to be underutilized and lead to increased scheduling latencies for tasks. Fix this by serializing select_best_cpu() calls from cpus running check_for_migration() check and marking selected cpus as reserved, so that subsequent call to select_best_cpu() from check_for_migration() will skip reserved cpus. Change-Id: I73a22cacab32dee3c14267a98b700f572aa3900c Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org 2014-07-25 08:04:27 -07:00			`static inline void clear_reserved(int cpu) { }`
sched/hmp: Enhance co-location and scheduler boost features The recent introduction of the schedtune cgroup controller has provided the scheduler with added flexibility in terms of some of it's placement features. In particular each cgroup under the schedtune controller can now specify: 1) Whether it needs co-location along with other cgroups 2) Whether it is eligible for scheduler boost (sched_boost_enabled) 3) Whether the kernel can override the boost eligibility when necessary (sched_boost_no_override) The scheduler now creates a reserved co-location group at boot. This group is used to co-locate all tasks that form part of any one of the cgroups that have co-location enabled. This reserved group can neither be destroyed nor reused for other purposes. Furthermore, cgroups are only allowed to indicate their co-location preference once at boot. Further updates are disallowed. Since we are now creating co-location groups for an extended period of time, there are a few other factors to consider when determining the preferred cluster for the group. We first exclude any tasks in the group that have not been observed to be running for a significant amount of time. Secondly we introduce the notion of group up and down migrate tunables to allow different migration policies than individual tasks. Lastly we break co-location if a single task in a group exceeds up-migrate but the total load of the group does not exceed group up-migrate. In terms of sched_boost, the scheduler now supports multiple types of boost. These are: 1) FULL_THROTTLE : Force up-migrate tasks belonging any cgroup that has the sched_boost_enabled flag turned on. Little CPUs will only be used when big CPUs can no longer accommodate tasks. Also up-migrate all RT tasks. 2) CONSERVATIVE : Override the sched_boost_enabled flag for all cgroups except those that have the sched_boost_no_override flag set. Force up-migrate all tasks belonging to only those cgroups that still remain eligible for boost. RT tasks do not get force up migrated. 3) RESTRAINED : Start frequency aggregation for co-located tasks. This type of boost does not force up-migrate any task. Finally the boost API removes ref-counting. This means that there can only be a single entity using boost at any given time. If multiple entities are managing boost, they are required to be well behaved so that they don't interfere with one another. Even for a single client, it is not possible to switch directly from one boost type to another. Boost must be first turned off before switching over to a new type. Change-Id: I8d224a70cbef162f27078b62b73acaa22670861d Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-08-31 16:54:12 -07:00			`static inline void sched_boost_parse_dt(void) {}`
sched: pre-allocate colocation groups At present, sched_set_group_id() dynamically allocates structure for colocation group to assign the given task to the group. However this can cause deadlock as memory allocator can wakeup a task which also tries to acquire related_thread_group_lock. Avoid such deadlock by pre-allocating colocation structures. This limits maximum colocation groups to static number but it's fine as it's never expected to be a lot. Change-Id: Ifc32ab4ead63c382ae390358ed86f7cc5b6eb2dc Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-11-28 13:41:18 -08:00			`static inline int alloc_related_thread_groups(void) { return 0; }`
sched: Fix herding issue check_for_migration() could run concurrently on multiple cpus, resulting in multiple tasks wanting to migrate to same cpu. This could cause cpus to be underutilized and lead to increased scheduling latencies for tasks. Fix this by serializing select_best_cpu() calls from cpus running check_for_migration() check and marking selected cpus as reserved, so that subsequent call to select_best_cpu() from check_for_migration() will skip reserved cpus. Change-Id: I73a22cacab32dee3c14267a98b700f572aa3900c Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org 2014-07-25 08:04:27 -07:00
sched: Add additional ftrace events This patch adds two ftrace events: sched_task_load -> records information of a task, such as scaled demand sched_cpu_load -> records information of a cpu, such as nr_running, nr_big_tasks etc This will be useful to debug HMP related task placement decisions by scheduler. Change-Id: If91587149bcd9bed157b5d2bfdecc3c3bf6652ff Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-03-31 18:10:21 -07:00			`#define trace_sched_cpu_load(...)`
sched: Optimize scheduler trace events to reduce trace buffer usage Scheduler ftrace events currently generate a lot of data when turned on. The excessive log messages often end up overflowing trace buffers for long use cases or crowding out other events. Optimize scheduler events so that the log spew is less and more manageable. To that end change the variable type for some event fields; introduce variants of sched_cpu_load that can be turned on/off for separate code paths and remove unused fields from various events. Change-Id: I2b313542b39ad5e09a01ad1303b5dfe2c4883b8a Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in rt.c due to CONFIG_SCHED_QHMP.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-11-02 15:08:20 -08:00			`#define trace_sched_cpu_load_lb(...)`
			`#define trace_sched_cpu_load_cgroup(...)`
			`#define trace_sched_cpu_load_wakeup(...)`
sched: Add additional ftrace events This patch adds two ftrace events: sched_task_load -> records information of a task, such as scaled demand sched_cpu_load -> records information of a cpu, such as nr_running, nr_big_tasks etc This will be useful to debug HMP related task placement decisions by scheduler. Change-Id: If91587149bcd9bed157b5d2bfdecc3c3bf6652ff Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> 2014-03-31 18:10:21 -07:00
sched: Track burst length for tasks Track burst length for tasks as time they ran from wakeup to sleep. This is used to predict average time a task may run when it wakes up and thus avoid waking up idle cpu for "short-burst" tasks. Change-Id: Ie71d3163630fb8aa0db8ee8383768f8748270cf9 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-05-13 02:05:32 -07:00			`static inline void update_avg_burst(struct task_struct *p) {}`

sched: Remove all existence of CONFIG_SCHED_FREQ_INPUT CONFIG_SCHED_FREQ_INPUT was created to keep parts of the scheduler dealing with frequency separate from other parts of the scheduler that deal with task placement. However, overtime the two features have become intricately linked whereby SCHED_FREQ_INPUT cannot be turned on without having SCHED_HMP turned on as well. Given this complex inter-dependency and the fact that all old, existing and future targets use both config options, remove this unnecessary feature separation. It will aid in making kernel upgrades a lot simpler and faster. Change-Id: Ia20e40d8a088d50909cc28f5be758fa3e9a4af6f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2016-07-28 19:18:08 -07:00			`#endif /* CONFIG_SCHED_HMP */`

			`/*`
			`* Returns the rq capacity of any rq in a group. This does not play`
			`* well with groups where rq capacity can change independently.`
			`*/`
			`#define group_rq_capacity(group) cpu_capacity(group_first_cpu(group))`
sched: Basic task placement support for HMP systems HMP systems have cpus with different power and performance characteristics. Some cpus could offer better power at cost of lower performance while other cpus could offer better performance at cost of higher power. As a result, bandwidth consumed by a task to do some "fixed" amount of work could vary across cpus. Optimal task placement on HMP would involve placing a task on a cpu where it can meet its performance goals at lowest power cost. Since kernel has little to no awareness of performance goals of applications, we guestimate whether task is meeting its performance goals or not by looking at its cpu bandwidth consumption. High bandwidth consumption could imply that task's performance can improve by running on cpus with better capacity/performance-characterisitcs. This patch makes the basic changes to support HMP. It provides a configurable threshold and any task consuming bandwidth in excess of threshold will be placed on a cpu with better capacity. Change-Id: I3fd98edd430f73342fbef06411e8b2d1cf2f56fa Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict about members of p->se which are not available anymore.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2014-03-29 20:04:42 -07:00
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#ifdef CONFIG_CGROUP_SCHED`

			`/*`
			`* Return the group to which this tasks belongs.`
			`*`
cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ The names of the two struct cgroup_subsys_state accessors - cgroup_subsys_state() and task_subsys_state() - are somewhat awkward. The former clashes with the type name and the latter doesn't even indicate it's somehow related to cgroup. We're about to revamp large portion of cgroup API, so, let's rename them so that they're less awkward. Most per-controller usages of the accessors are localized in accessor wrappers and given the amount of scheduled changes, this isn't gonna add any noticeable headache. Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state() to task_css(). This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> 2013-08-08 20:11:22 -04:00			`* We cannot use task_css() and friends because the cgroup subsystem`
			`* changes that value before the cgroup_subsys::attach() method is called,`
			`* therefore we cannot pin it and might observe the wrong value.`
sched: Fix race in task_group() Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"), he found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-06-22 13:36:05 +02:00			`*`
			`* The same is true for autogroup's p->signal->autogroup->tg, the autogroup`
			`* core changes this before calling sched_move_task().`
			`*`
			`* Instead we use a 'copy' which is updated from sched_move_task() while`
			`* holding both task_struct::pi_lock and rq::lock.`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`*/`
			`static inline struct task_group task_group(struct task_struct p)`
			`{`
sched: Fix race in task_group() Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"), he found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-06-22 13:36:05 +02:00			`return p->sched_task_group;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`}`

			`/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */`
			`static inline void set_task_rq(struct task_struct *p, unsigned int cpu)`
			`{`
			`#if defined(CONFIG_FAIR_GROUP_SCHED) \|\| defined(CONFIG_RT_GROUP_SCHED)`
			`struct task_group *tg = task_group(p);`
			`#endif`

			`#ifdef CONFIG_FAIR_GROUP_SCHED`
BACKPORT: sched/fair: Make it possible to account fair load avg consistently While set_task_rq_fair() is introduced in mainline by commit ad936d8658fd ("sched/fair: Make it possible to account fair load avg consistently"), the function results to be introduced here by the backport of commit 09a43ace1f98 ("sched/fair: Propagate load during synchronous attach/detach"). The problem (apart from the confusion introduced by the backport) is actually that set_task_rq_fair() is currently not called at all. Fix the problem by backporting again commit ad936d8658fd ("sched/fair: Make it possible to account fair load avg consistently"). Original change log: The current code accounts for the time a task was absent from the fair class (per ATTACH_AGE_LOAD). However it does not work correctly when a task got migrated or moved to another cgroup while outside of the fair class. This patch tries to address that by aging on migration. We locklessly read the 'last_update_time' stamp from both the old and new cfs_rq, ages the load upto the old time, and sets it to the new time. These timestamps should in general not be more than 1 tick apart from one another, so there is a definite bound on things. Signed-off-by: Byungchul Park <byungchul.park@lge.com> [ Changelog, a few edits and !SMP build fix ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry-picked from ad936d8658fd348338cb7d42c577dac77892b074) Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> Change-Id: I17294ab0ada3901d35895014715fd60952949358 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> 2017-05-30 14:51:53 +01:00			`set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`p->se.cfs_rq = tg->cfs_rq[cpu];`
			`p->se.parent = tg->se[cpu];`
			`#endif`

			`#ifdef CONFIG_RT_GROUP_SCHED`
			`p->rt.rt_rq = tg->rt_rq[cpu];`
			`p->rt.parent = tg->rt_se[cpu];`
			`#endif`
			`}`

			`#else /* CONFIG_CGROUP_SCHED */`

			`static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }`
			`static inline struct task_group task_group(struct task_struct p)`
			`{`
			`return NULL;`
			`}`
			`#endif /* CONFIG_CGROUP_SCHED */`

			`static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)`
			`{`
			`set_task_rq(p, cpu);`
			`#ifdef CONFIG_SMP`
			`/*`
			`* After ->cpu is set up to a new value, task_rq_lock(p, ...) can be`
			`* successfuly executed on another CPU. We must ensure that updates of`
			`* per-task data have been completed by this moment.`
			`*/`
			`smp_wmb();`
UPSTREAM: sched/core: Allow putting thread_info into task_struct If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT, then thread_info is defined as a single 'u32 flags' and is the first entry of task_struct. thread_info::task is removed (it serves no purpose if thread_info is embedded in task_struct), and thread_info::cpu gets its own slot in task_struct. This is heavily based on a patch written by Linus. Originally-from: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jann Horn <jann@thejh.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Bug: 38331309 Change-Id: I25e5a830f2ada5e74fa93661e97e5e701b1b70d2 (cherry picked from commit c65eacbe290b8141554c71b2c94489e73ade8c8d) Signed-off-by: Zubin Mithra <zsm@google.com> 2016-09-13 14:29:24 -07:00			`#ifdef CONFIG_THREAD_INFO_IN_TASK`
			`p->cpu = cpu;`
			`#else`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`task_thread_info(p)->cpu = cpu;`
UPSTREAM: sched/core: Allow putting thread_info into task_struct If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT, then thread_info is defined as a single 'u32 flags' and is the first entry of task_struct. thread_info::task is removed (it serves no purpose if thread_info is embedded in task_struct), and thread_info::cpu gets its own slot in task_struct. This is heavily based on a patch written by Linus. Originally-from: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jann Horn <jann@thejh.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Bug: 38331309 Change-Id: I25e5a830f2ada5e74fa93661e97e5e701b1b70d2 (cherry picked from commit c65eacbe290b8141554c71b2c94489e73ade8c8d) Signed-off-by: Zubin Mithra <zsm@google.com> 2016-09-13 14:29:24 -07:00			`#endif`
sched/numa: Introduce migrate_swap() Use the new stop_two_cpus() to implement migrate_swap(), a function that flips two tasks between their respective cpus. I'm fairly sure there's a less crude way than employing the stop_two_cpus() method, but everything I tried either got horribly fragile and/or complex. So keep it simple for now. The notable detail is how we 'migrate' tasks that aren't runnable anymore. We'll make it appear like we migrated them before they went to sleep. The sole difference is the previous cpu in the wakeup path, so we override this. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/1381141781-10992-39-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-07 11:29:16 +01:00			`p->wake_cpu = cpu;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#endif`
			`}`

			`/*`
			`* Tunables that become constants when CONFIG_SCHED_DEBUG is off:`
			`*/`
			`#ifdef CONFIG_SCHED_DEBUG`
static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc\|dec]() So here's a boot tested patch on top of Jason's series that does all the cleanups I talked about and turns jump labels into a more intuitive to use facility. It should also address the various misconceptions and confusions that surround jump labels. Typical usage scenarios: #include <linux/static_key.h> struct static_key key = STATIC_KEY_INIT_TRUE; if (static_key_false(&key)) do unlikely code else do likely code Or: if (static_key_true(&key)) do likely code else do unlikely code The static key is modified via: static_key_slow_inc(&key); ... static_key_slow_dec(&key); The 'slow' prefix makes it abundantly clear that this is an expensive operation. I've updated all in-kernel code to use this everywhere. Note that I (intentionally) have not pushed through the rename blindly through to the lowest levels: the actual jump-label patching arch facility should be named like that, so we want to decouple jump labels from the static-key facility a bit. On non-jump-label enabled architectures static keys default to likely()/unlikely() branches. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: a.p.zijlstra@chello.nl Cc: mathieu.desnoyers@efficios.com Cc: davem@davemloft.net Cc: ddaney.cavm@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu Signed-off-by: Ingo Molnar <mingo@elte.hu> 2012-02-24 08:31:31 +01:00			`# include <linux/static_key.h>`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`# define const_debug __read_mostly`
			`#else`
			`# define const_debug const`
			`#endif`

			`extern const_debug unsigned int sysctl_sched_features;`

			`#define SCHED_FEAT(name, enabled) \`
			`__SCHED_FEAT_##name ,`

			`enum {`
sched: Move all scheduler bits into kernel/sched/ There's too many sched*.[ch] files in kernel/, give them their own directory. (No code changed, other than Makefile glue added.) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-11-15 17:14:39 +01:00			`#include "features.h"`
sched: Use jump_labels for sched_feat Now that we initialize jump_labels before sched_init() we can use them for the debug features without having to worry about a window where they have the wrong setting. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-vpreo4hal9e0kzqmg5y0io2k@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-07-06 14:20:14 +02:00			`__SCHED_FEAT_NR,`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`};`

			`#undef SCHED_FEAT`

sched: Use jump_labels for sched_feat Now that we initialize jump_labels before sched_init() we can use them for the debug features without having to worry about a window where they have the wrong setting. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-vpreo4hal9e0kzqmg5y0io2k@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-07-06 14:20:14 +02:00			`#if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)`
			`#define SCHED_FEAT(name, enabled) \`
static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc\|dec]() So here's a boot tested patch on top of Jason's series that does all the cleanups I talked about and turns jump labels into a more intuitive to use facility. It should also address the various misconceptions and confusions that surround jump labels. Typical usage scenarios: #include <linux/static_key.h> struct static_key key = STATIC_KEY_INIT_TRUE; if (static_key_false(&key)) do unlikely code else do likely code Or: if (static_key_true(&key)) do likely code else do unlikely code The static key is modified via: static_key_slow_inc(&key); ... static_key_slow_dec(&key); The 'slow' prefix makes it abundantly clear that this is an expensive operation. I've updated all in-kernel code to use this everywhere. Note that I (intentionally) have not pushed through the rename blindly through to the lowest levels: the actual jump-label patching arch facility should be named like that, so we want to decouple jump labels from the static-key facility a bit. On non-jump-label enabled architectures static keys default to likely()/unlikely() branches. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: a.p.zijlstra@chello.nl Cc: mathieu.desnoyers@efficios.com Cc: davem@davemloft.net Cc: ddaney.cavm@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu Signed-off-by: Ingo Molnar <mingo@elte.hu> 2012-02-24 08:31:31 +01:00			`static __always_inline bool static_branch_##name(struct static_key *key) \`
sched: Use jump_labels for sched_feat Now that we initialize jump_labels before sched_init() we can use them for the debug features without having to worry about a window where they have the wrong setting. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-vpreo4hal9e0kzqmg5y0io2k@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-07-06 14:20:14 +02:00			`{ \`
sched: Remove extra static_key*() function indirection I think its a bit simpler without having to follow an extra layer of static inline fuctions. No functional change just cosmetic. Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: rostedt@goodmis.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/2ce52233ce200faad93b6029d90f1411cd926667.1404315388.git.jbaron@akamai.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-07-02 15:52:41 +00:00			`return static_key_##enabled(key); \`
sched: Use jump_labels for sched_feat Now that we initialize jump_labels before sched_init() we can use them for the debug features without having to worry about a window where they have the wrong setting. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-vpreo4hal9e0kzqmg5y0io2k@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-07-06 14:20:14 +02:00			`}`

			`#include "features.h"`

			`#undef SCHED_FEAT`

static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc\|dec]() So here's a boot tested patch on top of Jason's series that does all the cleanups I talked about and turns jump labels into a more intuitive to use facility. It should also address the various misconceptions and confusions that surround jump labels. Typical usage scenarios: #include <linux/static_key.h> struct static_key key = STATIC_KEY_INIT_TRUE; if (static_key_false(&key)) do unlikely code else do likely code Or: if (static_key_true(&key)) do likely code else do unlikely code The static key is modified via: static_key_slow_inc(&key); ... static_key_slow_dec(&key); The 'slow' prefix makes it abundantly clear that this is an expensive operation. I've updated all in-kernel code to use this everywhere. Note that I (intentionally) have not pushed through the rename blindly through to the lowest levels: the actual jump-label patching arch facility should be named like that, so we want to decouple jump labels from the static-key facility a bit. On non-jump-label enabled architectures static keys default to likely()/unlikely() branches. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: a.p.zijlstra@chello.nl Cc: mathieu.desnoyers@efficios.com Cc: davem@davemloft.net Cc: ddaney.cavm@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu Signed-off-by: Ingo Molnar <mingo@elte.hu> 2012-02-24 08:31:31 +01:00			`extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];`
sched: Use jump_labels for sched_feat Now that we initialize jump_labels before sched_init() we can use them for the debug features without having to worry about a window where they have the wrong setting. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-vpreo4hal9e0kzqmg5y0io2k@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-07-06 14:20:14 +02:00			`#define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))`
			`#else /* !(SCHED_DEBUG && HAVE_JUMP_LABEL) */`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))`
sched: Use jump_labels for sched_feat Now that we initialize jump_labels before sched_init() we can use them for the debug features without having to worry about a window where they have the wrong setting. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-vpreo4hal9e0kzqmg5y0io2k@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-07-06 14:20:14 +02:00			`#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched/numa: Convert sched_numa_balancing to a static_branch Variable sched_numa_balancing toggles numa_balancing feature. Hence moving from a simple read mostly variable to a more apt static_branch. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1439310261-16124-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-08-11 21:54:21 +05:30			`extern struct static_key_false sched_numa_balancing;`
mm: numa: Add fault driven placement and migration NOTE: This patch is based on "sched, numa, mm: Add fault driven placement and migration policy" but as it throws away all the policy to just leave a basic foundation I had to drop the signed-offs-by. This patch creates a bare-bones method for setting PTEs pte_numa in the context of the scheduler that when faulted later will be faulted onto the node the CPU is running on. In itself this does nothing useful but any placement policy will fundamentally depend on receiving hints on placement from fault context and doing something intelligent about it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> 2012-10-25 14:16:43 +02:00
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`static inline u64 global_rt_period(void)`
			`{`
			`return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;`
			`}`

			`static inline u64 global_rt_runtime(void)`
			`{`
			`if (sysctl_sched_rt_runtime < 0)`
			`return RUNTIME_INF;`

			`return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;`
			`}`

			`static inline int task_current(struct rq rq, struct task_struct p)`
			`{`
			`return rq->curr == p;`
			`}`

			`static inline int task_running(struct rq rq, struct task_struct p)`
			`{`
			`#ifdef CONFIG_SMP`
			`return p->on_cpu;`
			`#else`
			`return task_current(rq, p);`
			`#endif`
			`}`

sched: Add wrapper for checking task_struct::on_rq Implement task_on_rq_queued() and use it everywhere instead of on_rq check. No functional changes. The only exception is we do not use the wrapper in check_for_tasks(), because it requires to export task_on_rq_queued() in global header files. Next patch in series would return it back, so we do not twist it from here to there. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-08-20 13:47:32 +04:00			`static inline int task_on_rq_queued(struct task_struct *p)`
			`{`
			`return p->on_rq == TASK_ON_RQ_QUEUED;`
			`}`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state This is a new p->on_rq state which will be used to indicate that a task is in a process of migrating between two RQs. It allows to get rid of double_rq_lock(), which we used to use to change a rq of a queued task before. Let's consider an example. To move a task between src_rq and dst_rq we will do the following: raw_spin_lock(&src_rq->lock); /* p is a task which is queued on src_rq / p = ...; dequeue_task(src_rq, p, 0); p->on_rq = TASK_ON_RQ_MIGRATING; set_task_cpu(p, dst_cpu); raw_spin_unlock(&src_rq->lock); / * Both RQs are unlocked here. * Task p is dequeued from src_rq * but its on_rq value is not zero. */ raw_spin_lock(&dst_rq->lock); p->on_rq = TASK_ON_RQ_QUEUED; enqueue_task(dst_rq, p, 0); raw_spin_unlock(&dst_rq->lock); While p->on_rq is TASK_ON_RQ_MIGRATING, task is considered as "migrating", and other parallel scheduler actions with it are not available to parallel callers. The parallel caller is spining till migration is completed. The unavailable actions are changing of cpu affinity, changing of priority etc, in other words all the functionality which used to require task_rq(p)->lock before (and related to the task). To implement TASK_ON_RQ_MIGRATING support we primarily are using the following fact. Most of scheduler users (from which we are protecting a migrating task) use task_rq_lock() and __task_rq_lock() to get the lock of task_rq(p). These primitives know that task's cpu may change, and they are spining while the lock of the right RQ is not held. We add one more condition into them, so they will be also spinning until the migration is finished. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1408528062.23412.88.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-08-20 13:47:42 +04:00			`static inline int task_on_rq_migrating(struct task_struct *p)`
			`{`
			`return p->on_rq == TASK_ON_RQ_MIGRATING;`
			`}`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#ifndef prepare_arch_switch`
			`# define prepare_arch_switch(next) do { } while (0)`
			`#endif`
sched/arch: Introduce the finish_arch_post_lock_switch() scheduler callback This callback is called by the scheduler after rq->lock has been released and interrupts enabled. It will be used in subsequent patches on the ARM architecture. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Tested-by: Will Deacon <will.deacon@arm.com> Tested-by: Marc Zyngier <Marc.Zyngier@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/20120313110840.7b444deb6b1bb902c15f3cdf@canb.auug.org.au Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-11-27 21:43:10 +00:00			`#ifndef finish_arch_post_lock_switch`
			`# define finish_arch_post_lock_switch() do { } while (0)`
			`#endif`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
			`static inline void prepare_lock_switch(struct rq rq, struct task_struct next)`
			`{`
			`#ifdef CONFIG_SMP`
			`/*`
			`* We can optimise this out completely for !SMP, because the`
			`* SMP rebalancing from interrupt is the only thing that cares`
			`* here.`
			`*/`
			`next->on_cpu = 1;`
			`#endif`
			`}`

			`static inline void finish_lock_switch(struct rq rq, struct task_struct prev)`
			`{`
			`#ifdef CONFIG_SMP`
			`/*`
			`* After ->on_cpu is cleared, the task can be moved to a different CPU.`
			`* We must ensure this doesn't happen until the switch is completely`
			`* finished.`
sched/core: Fix TASK_DEAD race in finish_task_switch() So the problem this patch is trying to address is as follows: CPU0 CPU1 context_switch(A, B) ttwu(A) LOCK A->pi_lock A->on_cpu == 0 finish_task_switch(A) prev_state = A->state <-. WMB \| A->on_cpu = 0; \| UNLOCK rq0->lock \| \| context_switch(C, A) `-- A->state = TASK_DEAD prev_state == TASK_DEAD put_task_struct(A) context_switch(A, C) finish_task_switch(A) A->state == TASK_DEAD put_task_struct(A) The argument being that the WMB will allow the load of A->state on CPU0 to cross over and observe CPU1's store of A->state, which will then result in a double-drop and use-after-free. Now the comment states (and this was true once upon a long time ago) that we need to observe A->state while holding rq->lock because that will order us against the wakeup; however the wakeup will not in fact acquire (that) rq->lock; it takes A->pi_lock these days. We can obviously fix this by upgrading the WMB to an MB, but that is expensive, so we'd rather avoid that. The alternative this patch takes is: smp_store_release(&A->on_cpu, 0), which avoids the MB on some archs, but not important ones like ARM. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> # v3.1+ Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: manfred@colorfullife.com Cc: will.deacon@arm.com Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()") Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-09-29 14:45:09 +02:00			`*`
sched/core: Better document the try_to_wake_up() barriers Explain how the control dependency and smp_rmb() end up providing ACQUIRE semantics and pair with smp_store_release() in finish_lock_switch(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-10-06 14:36:17 +02:00			`* In particular, the load of prev->state in finish_task_switch() must`
			`* happen before this.`
			`*`
sched/core: Fix TASK_DEAD race in finish_task_switch() So the problem this patch is trying to address is as follows: CPU0 CPU1 context_switch(A, B) ttwu(A) LOCK A->pi_lock A->on_cpu == 0 finish_task_switch(A) prev_state = A->state <-. WMB \| A->on_cpu = 0; \| UNLOCK rq0->lock \| \| context_switch(C, A) `-- A->state = TASK_DEAD prev_state == TASK_DEAD put_task_struct(A) context_switch(A, C) finish_task_switch(A) A->state == TASK_DEAD put_task_struct(A) The argument being that the WMB will allow the load of A->state on CPU0 to cross over and observe CPU1's store of A->state, which will then result in a double-drop and use-after-free. Now the comment states (and this was true once upon a long time ago) that we need to observe A->state while holding rq->lock because that will order us against the wakeup; however the wakeup will not in fact acquire (that) rq->lock; it takes A->pi_lock these days. We can obviously fix this by upgrading the WMB to an MB, but that is expensive, so we'd rather avoid that. The alternative this patch takes is: smp_store_release(&A->on_cpu, 0), which avoids the MB on some archs, but not important ones like ARM. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> # v3.1+ Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: manfred@colorfullife.com Cc: will.deacon@arm.com Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()") Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-09-29 14:45:09 +02:00			`* Pairs with the control dependency and rmb in try_to_wake_up().`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`*/`
sched/core: Fix TASK_DEAD race in finish_task_switch() So the problem this patch is trying to address is as follows: CPU0 CPU1 context_switch(A, B) ttwu(A) LOCK A->pi_lock A->on_cpu == 0 finish_task_switch(A) prev_state = A->state <-. WMB \| A->on_cpu = 0; \| UNLOCK rq0->lock \| \| context_switch(C, A) `-- A->state = TASK_DEAD prev_state == TASK_DEAD put_task_struct(A) context_switch(A, C) finish_task_switch(A) A->state == TASK_DEAD put_task_struct(A) The argument being that the WMB will allow the load of A->state on CPU0 to cross over and observe CPU1's store of A->state, which will then result in a double-drop and use-after-free. Now the comment states (and this was true once upon a long time ago) that we need to observe A->state while holding rq->lock because that will order us against the wakeup; however the wakeup will not in fact acquire (that) rq->lock; it takes A->pi_lock these days. We can obviously fix this by upgrading the WMB to an MB, but that is expensive, so we'd rather avoid that. The alternative this patch takes is: smp_store_release(&A->on_cpu, 0), which avoids the MB on some archs, but not important ones like ARM. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> # v3.1+ Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: manfred@colorfullife.com Cc: will.deacon@arm.com Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()") Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-09-29 14:45:09 +02:00			`smp_store_release(&prev->on_cpu, 0);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#endif`
			`#ifdef CONFIG_DEBUG_SPINLOCK`
			`/* this is a valid case when another task releases the spinlock */`
			`rq->lock.owner = current;`
			`#endif`
			`/*`
			`* If we are tracking spinlock dependencies then we have to`
			`* fix up the runqueue lock - which gets 'carried over' from`
			`* prev into current:`
			`*/`
			`spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);`

			`raw_spin_unlock_irq(&rq->lock);`
			`}`

sched: Move wake flags to kernel/sched/sched.h They are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A78E.7040609@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:38 +08:00			`/*`
			`* wake flags`
			`*/`
			`#define WF_SYNC 0x01 /* waker goes to sleep after wakeup */`
			`#define WF_FORK 0x02 /* child wakeup after fork */`
			`#define WF_MIGRATED 0x4 /* internal use, task got migrated */`
sched: Provide a wake up API without sending freq notifications Each time a task wakes up, scheduler evaluates its load and notifies governor if the resulting frequency of destination CPU is larger than a threshold. However, some governor wakes up a separate task that handles frequency change, which again calls wake_up_process(). This is dangerous because if the task being woken up meets the threshold and ends up being moved around, there is a potential for endless recursive notifications. Introduce a new API for waking up a task without triggering frequency notification. Change-Id: I24261af81b7dc410c7fb01eaa90920b8d66fbd2a Signed-off-by: Junjie Wu <junjiew@codeaurora.org> 2016-01-05 10:53:30 -08:00			`#define WF_NO_NOTIFIER 0x08 /* do not notify governor */`
sched: Move wake flags to kernel/sched/sched.h They are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A78E.7040609@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:38 +08:00
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`/*`
			`* To aid in avoiding the subversion of "niceness" due to uneven distribution`
			`* of tasks with abnormal "nice" values across CPUs the contribution that`
			`* each task makes to its run queue's load is weighted according to its`
			`* scheduling class and "nice" value. For SCHED_NORMAL tasks this is just a`
			`* scaled version of the new time slice allocation that they receive on time`
			`* slice expiry etc.`
			`*/`

			`#define WEIGHT_IDLEPRIO 3`
			`#define WMULT_IDLEPRIO 1431655765`

			`/*`
			`* Nice levels are multiplicative, with a gentle 10% change for every`
			`* nice level changed. I.e. when a CPU-bound task goes from nice 0 to`
			`* nice 1, it will get ~10% less CPU time than another CPU-bound task`
			`* that remained on nice 0.`
			`*`
			`* The "10% effect" is relative and cumulative: from _any_ nice level,`
			`* if you go up 1 level, it's -10% CPU usage, if you go down 1 level`
			`* it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.`
			`* If a task goes up by ~10% and another task goes down by ~10% then`
			`* the relative distance between them is ~25%.)`
			`*/`
			`static const int prio_to_weight[40] = {`
			`/* -20 */ 88761, 71755, 56483, 46273, 36291,`
			`/* -15 */ 29154, 23254, 18705, 14949, 11916,`
			`/* -10 */ 9548, 7620, 6100, 4904, 3906,`
			`/* -5 */ 3121, 2501, 1991, 1586, 1277,`
			`/* 0 */ 1024, 820, 655, 526, 423,`
			`/* 5 */ 335, 272, 215, 172, 137,`
			`/* 10 */ 110, 87, 70, 56, 45,`
			`/* 15 */ 36, 29, 23, 18, 15,`
			`};`

			`/*`
			`* Inverse (2^32/x) values of the prio_to_weight[] array, precalculated.`
			`*`
			`* In cases where the weight does not change often, we can use the`
			`* precalculated inverse to speed up arithmetics by turning divisions`
			`* into multiplications:`
			`*/`
			`static const u32 prio_to_wmult[40] = {`
			`/* -20 */ 48388, 59856, 76040, 92818, 118348,`
			`/* -15 */ 147320, 184698, 229616, 287308, 360437,`
			`/* -10 */ 449829, 563644, 704093, 875809, 1099582,`
			`/* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,`
			`/* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,`
			`/* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,`
			`/* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,`
			`/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,`
			`};`

sched/rt: Fix PI handling vs. sched_setscheduler() Andrea Parri reported: > I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not > handled correctly: > > T1 (prio = 20) > lock(rtmutex); > > T2 (prio = 20) > blocks on rtmutex (rt_nr_boosted = 0 on T1's rq) > > T1 (prio = 20) > sys_set_scheduler(prio = 0) > [new_effective_prio == oldprio] > T1 prio = 20 (rt_nr_boosted = 0 on T1's rq) > > The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted()); > in particular, if we continue with > > T1 (prio = 20) > unlock(rtmutex) > wakeup(T2) > adjust_prio(T1) > [prio != rt_mutex_getprio(T1)] > dequeue(T1) > rt_nr_boosted = (unsigned long)(-1) > ... > T1 prio = 0 > > then we end up leaving rt_nr_boosted in an "inconsistent" state. > > The simple program attached could reproduce the previous scenario; note > that, as a consequence of the presence of this state, the "assertion" > > WARN_ON(!rt_nr_running && rt_nr_boosted) > > from dec_rt_group() may trigger. So normally we dequeue/enqueue tasks in sched_setscheduler(), which would ensure the accounting stays correct. However in the early PI path we fail to do so. So this was introduced at around v3.14, by: c365c292d059 ("sched: Consider pi boosting in setscheduler()") which fixed another problem exactly because that dequeue/enqueue, joy. Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it preserve runqueue location with that option. This requires decoupling the on_rt_rq() state from being on the list. In order to allow for explicit movement during the SAVE/RESTORE, introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these cases to preserve other invariants. Respecting the SAVE/RESTORE flags also has the (nice) side-effect that things like sys_nice()/sys_sched_setaffinity() also do not reorder FIFO tasks (whereas they used to before this patch). Change-Id: I1450923252f55dba19f450008db813113eb06c76 Reported-by: Andrea Parri <parri.andrea@gmail.com> Tested-by: Andrea Parri <parri.andrea@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> [pkondeti@codeaurora.org: Fix trivial merge conflict] Git-commit: ff77e468535987b3d21b7bd4da15608ea3ce7d0b Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-01-18 15:27:07 +01:00			`/*`
			`* {de,en}queue flags:`
			`*`
			`* DEQUEUE_SLEEP - task is no longer runnable`
			`* ENQUEUE_WAKEUP - task just became runnable`
			`*`
			`* SAVE/RESTORE - an otherwise spurious dequeue/enqueue, done to ensure tasks`
			`* are in a known state which allows modification. Such pairs`
			`* should preserve as much state as possible.`
			`*`
			`* MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location`
			`* in the runqueue.`
			`*`
			`* ENQUEUE_HEAD - place at front of runqueue (tail if not specified)`
			`* ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)`
			`* ENQUEUE_WAKING - sched_class::task_waking was called`
			`*`
			`*/`

			`#define DEQUEUE_SLEEP 0x01`
			`#define DEQUEUE_SAVE 0x02 /* matches ENQUEUE_RESTORE */`
			`#define DEQUEUE_MOVE 0x04 /* matches ENQUEUE_MOVE */`

sched/core: Fix task and run queue sched_info::run_delay inconsistencies Mike Meyer reported the following bug: > During evaluation of some performance data, it was discovered thread > and run queue run_delay accounting data was inconsistent with the other > accounting data that was collected. Further investigation found under > certain circumstances execution time was leaking into the task and > run queue accounting of run_delay. > > Consider the following sequence: > > a. thread is running. > b. thread moves beween cgroups, changes scheduling class or priority. > c. thread sleeps OR > d. thread involuntarily gives up cpu. > > a. implies: > > thread->sched_info.last_queued = 0 > > a. and b. results in the following: > > 1. dequeue_task(rq, thread) > > sched_info_dequeued(rq, thread) > delta = 0 > > sched_info_reset_dequeued(thread) > thread->sched_info.last_queued = 0 > > thread->sched_info.run_delay += delta > > 2. enqueue_task(rq, thread) > > sched_info_queued(rq, thread) > > /* thread is still on cpu at this point. / > thread->sched_info.last_queued = task_rq(thread)->clock; > > c. results in: > > dequeue_task(rq, thread) > > sched_info_dequeued(rq, thread) > > / delta is execution time not run_delay. / > delta = task_rq(thread)->clock - thread->sched_info.last_queued > > sched_info_reset_dequeued(thread) > thread->sched_info.last_queued = 0 > > thread->sched_info.run_delay += delta > > Since thread was running between enqueue_task(rq, thread) and > dequeue_task(rq, thread), the delta above is really execution > time and not run_delay. > > d. results in: > > __sched_info_switch(thread, next_thread) > > sched_info_depart(rq, thread) > > sched_info_queued(rq, thread) > > / last_queued not updated due to being non-zero */ > return > > Since thread was running between enqueue_task(rq, thread) and > __sched_info_switch(thread, next_thread), the execution time > between enqueue_task(rq, thread) and > __sched_info_switch(thread, next_thread) now will become > associated with run_delay due to when last_queued was last updated. > This alternative patch solves the problem by not calling sched_info_{de,}queued() in {de,en}queue_task(). Therefore the sched_info state is preserved and things work as expected. By inlining the {de,en}queue_task() functions the new condition becomes (mostly) a compile-time constant and we'll not emit any new branch instructions. It even shrinks the code (due to inlining {en,de}queue_task()): $ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig text data bss dec hex filename 64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o 64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig Reported-by: Mike Meyer <Mike.Meyer@Teradata.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-09-30 17:44:13 +02:00			`#define ENQUEUE_WAKEUP 0x01`
sched/rt: Fix PI handling vs. sched_setscheduler() Andrea Parri reported: > I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not > handled correctly: > > T1 (prio = 20) > lock(rtmutex); > > T2 (prio = 20) > blocks on rtmutex (rt_nr_boosted = 0 on T1's rq) > > T1 (prio = 20) > sys_set_scheduler(prio = 0) > [new_effective_prio == oldprio] > T1 prio = 20 (rt_nr_boosted = 0 on T1's rq) > > The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted()); > in particular, if we continue with > > T1 (prio = 20) > unlock(rtmutex) > wakeup(T2) > adjust_prio(T1) > [prio != rt_mutex_getprio(T1)] > dequeue(T1) > rt_nr_boosted = (unsigned long)(-1) > ... > T1 prio = 0 > > then we end up leaving rt_nr_boosted in an "inconsistent" state. > > The simple program attached could reproduce the previous scenario; note > that, as a consequence of the presence of this state, the "assertion" > > WARN_ON(!rt_nr_running && rt_nr_boosted) > > from dec_rt_group() may trigger. So normally we dequeue/enqueue tasks in sched_setscheduler(), which would ensure the accounting stays correct. However in the early PI path we fail to do so. So this was introduced at around v3.14, by: c365c292d059 ("sched: Consider pi boosting in setscheduler()") which fixed another problem exactly because that dequeue/enqueue, joy. Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it preserve runqueue location with that option. This requires decoupling the on_rt_rq() state from being on the list. In order to allow for explicit movement during the SAVE/RESTORE, introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these cases to preserve other invariants. Respecting the SAVE/RESTORE flags also has the (nice) side-effect that things like sys_nice()/sys_sched_setaffinity() also do not reorder FIFO tasks (whereas they used to before this patch). Change-Id: I1450923252f55dba19f450008db813113eb06c76 Reported-by: Andrea Parri <parri.andrea@gmail.com> Tested-by: Andrea Parri <parri.andrea@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> [pkondeti@codeaurora.org: Fix trivial merge conflict] Git-commit: ff77e468535987b3d21b7bd4da15608ea3ce7d0b Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-01-18 15:27:07 +01:00			`#define ENQUEUE_RESTORE 0x02`
			`#define ENQUEUE_MOVE 0x04`

			`#define ENQUEUE_HEAD 0x08`
			`#define ENQUEUE_REPLENISH 0x10`
sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00			`#ifdef CONFIG_SMP`
sched/rt: Fix PI handling vs. sched_setscheduler() Andrea Parri reported: > I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not > handled correctly: > > T1 (prio = 20) > lock(rtmutex); > > T2 (prio = 20) > blocks on rtmutex (rt_nr_boosted = 0 on T1's rq) > > T1 (prio = 20) > sys_set_scheduler(prio = 0) > [new_effective_prio == oldprio] > T1 prio = 20 (rt_nr_boosted = 0 on T1's rq) > > The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted()); > in particular, if we continue with > > T1 (prio = 20) > unlock(rtmutex) > wakeup(T2) > adjust_prio(T1) > [prio != rt_mutex_getprio(T1)] > dequeue(T1) > rt_nr_boosted = (unsigned long)(-1) > ... > T1 prio = 0 > > then we end up leaving rt_nr_boosted in an "inconsistent" state. > > The simple program attached could reproduce the previous scenario; note > that, as a consequence of the presence of this state, the "assertion" > > WARN_ON(!rt_nr_running && rt_nr_boosted) > > from dec_rt_group() may trigger. So normally we dequeue/enqueue tasks in sched_setscheduler(), which would ensure the accounting stays correct. However in the early PI path we fail to do so. So this was introduced at around v3.14, by: c365c292d059 ("sched: Consider pi boosting in setscheduler()") which fixed another problem exactly because that dequeue/enqueue, joy. Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it preserve runqueue location with that option. This requires decoupling the on_rt_rq() state from being on the list. In order to allow for explicit movement during the SAVE/RESTORE, introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these cases to preserve other invariants. Respecting the SAVE/RESTORE flags also has the (nice) side-effect that things like sys_nice()/sys_sched_setaffinity() also do not reorder FIFO tasks (whereas they used to before this patch). Change-Id: I1450923252f55dba19f450008db813113eb06c76 Reported-by: Andrea Parri <parri.andrea@gmail.com> Tested-by: Andrea Parri <parri.andrea@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> [pkondeti@codeaurora.org: Fix trivial merge conflict] Git-commit: ff77e468535987b3d21b7bd4da15608ea3ce7d0b Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-01-18 15:27:07 +01:00			`#define ENQUEUE_WAKING 0x20`
sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00			`#else`
sched/core: Fix task and run queue sched_info::run_delay inconsistencies Mike Meyer reported the following bug: > During evaluation of some performance data, it was discovered thread > and run queue run_delay accounting data was inconsistent with the other > accounting data that was collected. Further investigation found under > certain circumstances execution time was leaking into the task and > run queue accounting of run_delay. > > Consider the following sequence: > > a. thread is running. > b. thread moves beween cgroups, changes scheduling class or priority. > c. thread sleeps OR > d. thread involuntarily gives up cpu. > > a. implies: > > thread->sched_info.last_queued = 0 > > a. and b. results in the following: > > 1. dequeue_task(rq, thread) > > sched_info_dequeued(rq, thread) > delta = 0 > > sched_info_reset_dequeued(thread) > thread->sched_info.last_queued = 0 > > thread->sched_info.run_delay += delta > > 2. enqueue_task(rq, thread) > > sched_info_queued(rq, thread) > > /* thread is still on cpu at this point. / > thread->sched_info.last_queued = task_rq(thread)->clock; > > c. results in: > > dequeue_task(rq, thread) > > sched_info_dequeued(rq, thread) > > / delta is execution time not run_delay. / > delta = task_rq(thread)->clock - thread->sched_info.last_queued > > sched_info_reset_dequeued(thread) > thread->sched_info.last_queued = 0 > > thread->sched_info.run_delay += delta > > Since thread was running between enqueue_task(rq, thread) and > dequeue_task(rq, thread), the delta above is really execution > time and not run_delay. > > d. results in: > > __sched_info_switch(thread, next_thread) > > sched_info_depart(rq, thread) > > sched_info_queued(rq, thread) > > / last_queued not updated due to being non-zero */ > return > > Since thread was running between enqueue_task(rq, thread) and > __sched_info_switch(thread, next_thread), the execution time > between enqueue_task(rq, thread) and > __sched_info_switch(thread, next_thread) now will become > associated with run_delay due to when last_queued was last updated. > This alternative patch solves the problem by not calling sched_info_{de,}queued() in {de,en}queue_task(). Therefore the sched_info state is preserved and things work as expected. By inlining the {de,en}queue_task() functions the new condition becomes (mostly) a compile-time constant and we'll not emit any new branch instructions. It even shrinks the code (due to inlining {en,de}queue_task()): $ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig text data bss dec hex filename 64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o 64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig Reported-by: Mike Meyer <Mike.Meyer@Teradata.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-09-30 17:44:13 +02:00			`#define ENQUEUE_WAKING 0x00`
sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00			`#endif`
sched/rt: Fix PI handling vs. sched_setscheduler() Andrea Parri reported: > I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not > handled correctly: > > T1 (prio = 20) > lock(rtmutex); > > T2 (prio = 20) > blocks on rtmutex (rt_nr_boosted = 0 on T1's rq) > > T1 (prio = 20) > sys_set_scheduler(prio = 0) > [new_effective_prio == oldprio] > T1 prio = 20 (rt_nr_boosted = 0 on T1's rq) > > The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted()); > in particular, if we continue with > > T1 (prio = 20) > unlock(rtmutex) > wakeup(T2) > adjust_prio(T1) > [prio != rt_mutex_getprio(T1)] > dequeue(T1) > rt_nr_boosted = (unsigned long)(-1) > ... > T1 prio = 0 > > then we end up leaving rt_nr_boosted in an "inconsistent" state. > > The simple program attached could reproduce the previous scenario; note > that, as a consequence of the presence of this state, the "assertion" > > WARN_ON(!rt_nr_running && rt_nr_boosted) > > from dec_rt_group() may trigger. So normally we dequeue/enqueue tasks in sched_setscheduler(), which would ensure the accounting stays correct. However in the early PI path we fail to do so. So this was introduced at around v3.14, by: c365c292d059 ("sched: Consider pi boosting in setscheduler()") which fixed another problem exactly because that dequeue/enqueue, joy. Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it preserve runqueue location with that option. This requires decoupling the on_rt_rq() state from being on the list. In order to allow for explicit movement during the SAVE/RESTORE, introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these cases to preserve other invariants. Respecting the SAVE/RESTORE flags also has the (nice) side-effect that things like sys_nice()/sys_sched_setaffinity() also do not reorder FIFO tasks (whereas they used to before this patch). Change-Id: I1450923252f55dba19f450008db813113eb06c76 Reported-by: Andrea Parri <parri.andrea@gmail.com> Tested-by: Andrea Parri <parri.andrea@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> [pkondeti@codeaurora.org: Fix trivial merge conflict] Git-commit: ff77e468535987b3d21b7bd4da15608ea3ce7d0b Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-01-18 15:27:07 +01:00			`#define ENQUEUE_WAKEUP_NEW 0x40`
sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00
sched: Guarantee task priority in pick_next_task() Michael spotted that the idle_balance() push down created a task priority problem. Previously, when we called idle_balance() before pick_next_task() it wasn't a problem when -- because of the rq->lock droppage -- an rt/dl task slipped in. Similarly for pre_schedule(), rt pre-schedule could have a dl task slip in. But by pulling it into the pick_next_task() loop, we'll not try a higher task priority again. Cure this by creating a re-start condition in pick_next_task(); and triggering this from pick_next_task_{rt,fair}(). It also fixes a live-lock where we get stuck in pick_next_task_fair() due to idle_balance() seeing !0 nr_running but there not actually being any fair tasks about. Reported-by: Michael Wang <wangyun@linux.vnet.ibm.com> Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()") Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-02-14 12:25:08 +01:00			`#define RETRY_TASK ((void *)-1UL)`

sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00			`struct sched_class {`
			`const struct sched_class *next;`

			`void (enqueue_task) (struct rq rq, struct task_struct *p, int flags);`
			`void (dequeue_task) (struct rq rq, struct task_struct *p, int flags);`
			`void (yield_task) (struct rq rq);`
			`bool (yield_to_task) (struct rq rq, struct task_struct *p, bool preempt);`

			`void (check_preempt_curr) (struct rq rq, struct task_struct *p, int flags);`

sched: Push put_prev_task() into pick_next_task() In order to avoid having to do put/set on a whole cgroup hierarchy when we context switch, push the put into pick_next_task() so that both operations are in the same function. Further changes then allow us to possibly optimize away redundant work. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-02-11 06:05:00 +01:00			`/*`
			`* It is the responsibility of the pick_next_task() method that will`
			`* return the next task to call put_prev_task() on the @prev task or`
			`* something equivalent.`
sched: Guarantee task priority in pick_next_task() Michael spotted that the idle_balance() push down created a task priority problem. Previously, when we called idle_balance() before pick_next_task() it wasn't a problem when -- because of the rq->lock droppage -- an rt/dl task slipped in. Similarly for pre_schedule(), rt pre-schedule could have a dl task slip in. But by pulling it into the pick_next_task() loop, we'll not try a higher task priority again. Cure this by creating a re-start condition in pick_next_task(); and triggering this from pick_next_task_{rt,fair}(). It also fixes a live-lock where we get stuck in pick_next_task_fair() due to idle_balance() seeing !0 nr_running but there not actually being any fair tasks about. Reported-by: Michael Wang <wangyun@linux.vnet.ibm.com> Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()") Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-02-14 12:25:08 +01:00			`*`
			`* May return RETRY_TASK when it finds a higher prio class has runnable`
			`* tasks.`
sched: Push put_prev_task() into pick_next_task() In order to avoid having to do put/set on a whole cgroup hierarchy when we context switch, push the put into pick_next_task() so that both operations are in the same function. Further changes then allow us to possibly optimize away redundant work. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org> 2012-02-11 06:05:00 +01:00			`*/`
			`struct task_struct * (pick_next_task) (struct rq rq,`
			`struct task_struct *prev);`
sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00			`void (put_prev_task) (struct rq rq, struct task_struct *p);`

			`#ifdef CONFIG_SMP`
FROMLIST: sched/fair: Use wake_q length as a hint for wake_wide (from https://patchwork.kernel.org/patch/9895261/) This patch adds a parameter to select_task_rq, sibling_count_hint allowing the caller, where it has this information, to inform the sched_class the number of tasks that are being woken up as part of the same event. The wake_q mechanism is one case where this information is available. select_task_rq_fair can then use the information to detect that it needs to widen the search space for task placement in order to avoid overloading the last-level cache domain's CPUs. * * * The reason I am investigating this change is the following use case on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which all repeatedly do X amount of work then pthread_barrier_wait (i.e. sleep until the last task finishes its X and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU finish faster, and then those CPUs pull over the tasks that are still running: v CPU v ->time-> ------------- 0 (big) 11111 /333 ------------- 1 (big) 22222 /444\| ------------- 2 (LITTLE) 333333/ ------------- 3 (LITTLE) 444444/ ------------- Now when task 4 hits the barrier (at \|) and wakes the others up, there are 4 tasks with prev_cpu=<big> and 0 tasks with prev_cpu=<little>. want_affine therefore means that we'll only look in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled on the bigs until the next load balance, something like this: v CPU v ->time-> ------------------------ 0 (big) 11111 /333 31313\33333 ------------------------ 1 (big) 22222 /444\|424\4444444 ------------------------ 2 (LITTLE) 333333/ \222222 ------------------------ 3 (LITTLE) 444444/ \1111 ------------------------ ^^^ underutilization So, I'm trying to get want_affine = 0 for these tasks. I don't _think_ any incarnation of the wakee_flips mechanism can help us here because which task is waker and which tasks are wakees generally changes with each iteration. However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the nice property that we know exactly how many tasks are being woken, so we can cheat. It might be a disadvantage that we "widen" _every_ task that's woken in an event, while select_idle_sibling would work fine for the first sd_llc_size - 1 tasks. IIUC, if wake_affine() behaves correctly this trick wouldn't be necessary on SMP systems, so it might be best guarded by the presence of SD_ASYM_CPUCAPACITY? * * * Final note.. In order to observe "perfect" behaviour for this use case, I also had to disable the TTWU_QUEUE sched feature. Suppose during the wakeup above we are working through the work queue and have placed tasks 3 and 2, and are about to place task 1: v CPU v ->time-> -------------- 0 (big) 11111 /333 3 -------------- 1 (big) 22222 /444\|4 -------------- 2 (LITTLE) 333333/ 2 -------------- 3 (LITTLE) 444444/ <- Task 1 should go here -------------- If TTWU_QUEUE is enabled, we will not yet have enqueued task 2 (having instead sent a reschedule IPI) or attached its load to CPU 2. So we are likely to also place task 1 on cpu 2. Disabling TTWU_QUEUE means that we enqueue task 2 before placing task 1, solving this issue. TTWU_QUEUE is there to minimise rq lock contention, and I guess that this contention is less of an issue on big.LITTLE systems since they have relatively few CPUs, which suggests the trade-off makes sense here. Change-Id: I2080302839a263e0841a89efea8589ea53bbda9c Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Matt Fleming <matt@codeblueprint.co.uk> 2017-08-07 15:46:13 +01:00			`int (select_task_rq)(struct task_struct p, int task_cpu, int sd_flag, int flags,`
			`int subling_count_hint);`
sched/core: Remove a parameter in the migrate_task_rq() function The parameter "int next_cpu" in the following function is unused: migrate_task_rq(struct task_struct *p, int next_cpu) Remove it. Signed-off-by: xiaofeng.yan <yanxiaofeng@inspur.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1442991360-31945-1-git-send-email-yanxiaofeng@inspur.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-09-23 14:55:59 +08:00			`void (migrate_task_rq)(struct task_struct p);`
sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00
			`void (task_waking) (struct task_struct task);`
			`void (task_woken) (struct rq this_rq, struct task_struct *task);`

			`void (set_cpus_allowed)(struct task_struct p,`
			`const struct cpumask *newmask);`

			`void (rq_online)(struct rq rq);`
			`void (rq_offline)(struct rq rq);`
			`#endif`

			`void (set_curr_task) (struct rq rq);`
			`void (task_tick) (struct rq rq, struct task_struct *p, int queued);`
			`void (task_fork) (struct task_struct p);`
sched: Add sched_class->task_dead() method Add a new function to the scheduling class interface. It is called at the end of a context switch, if the prev task is in TASK_DEAD state. It will be useful for the scheduling classes that want to be notified when one of their tasks dies, e.g. to perform some cleanup actions, such as SCHED_DEADLINE. Signed-off-by: Dario Faggioli <raistlin@linux.it> Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Cc: bruce.ashfield@windriver.com Cc: claudio@evidence.eu.com Cc: darren@dvhart.com Cc: dhaval.giani@gmail.com Cc: fchecconi@gmail.com Cc: fweisbec@gmail.com Cc: harald.gustafsson@ericsson.com Cc: hgu1972@gmail.com Cc: insop.song@gmail.com Cc: jkacur@redhat.com Cc: johan.eker@ericsson.com Cc: liming.wang@windriver.com Cc: luca.abeni@unitn.it Cc: michael@amarulasolutions.com Cc: nicola.manica@disi.unitn.it Cc: oleg@redhat.com Cc: paulmck@linux.vnet.ibm.com Cc: p.faure@akatech.ch Cc: rostedt@goodmis.org Cc: tommaso.cucinotta@sssup.it Cc: vincent.guittot@linaro.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-2-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:35 +01:00			`void (task_dead) (struct task_struct p);`
sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00
sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl() Currently used hrtimer_try_to_cancel() is racy: raw_spin_lock(&rq->lock) ... dl_task_timer raw_spin_lock(&rq->lock) ... raw_spin_lock(&rq->lock) ... switched_from_dl() ... ... hrtimer_try_to_cancel() ... ... switched_to_fair() ... ... ... ... ... ... ... ... raw_spin_unlock(&rq->lock) ... (asquired) ... ... ... ... ... ... do_exit() ... ... schedule() ... ... raw_spin_lock(&rq->lock) ... raw_spin_unlock(&rq->lock) ... ... ... raw_spin_unlock(&rq->lock) ... raw_spin_lock(&rq->lock) ... ... (asquired) put_task_struct() ... ... free_task_struct() ... ... ... ... raw_spin_unlock(&rq->lock) ... (asquired) ... ... ... ... ... (use after free) ... So, let's implement 100% guaranteed way to cancel the timer and let's be sure we are safe even in very unlikely situations. rq unlocking does not limit the area of switched_from_dl() use, because this has already been possible in pull_dl_task() below. Let's consider the safety of of this unlocking. New code in the patch is working when hrtimer_try_to_cancel() fails. This means the callback is running. In this case hrtimer_cancel() is just waiting till the callback is finished. Two 1) Since we are in switched_from_dl(), new class is not dl_sched_class and new prio is not less MAX_DL_PRIO. So, the callback returns early; it's right after !dl_task() check. After that hrtimer_cancel() returns back too. The above is: raw_spin_lock(rq->lock); ... ... dl_task_timer() ... raw_spin_lock(rq->lock); switched_from_dl() ... hrtimer_try_to_cancel() ... raw_spin_unlock(rq->lock); ... hrtimer_cancel() ... ... raw_spin_unlock(rq->lock); ... return HRTIMER_NORESTART; ... ... raw_spin_lock(rq->lock); ... 2) But the below is also possible: dl_task_timer() raw_spin_lock(rq->lock); ... raw_spin_unlock(rq->lock); raw_spin_lock(rq->lock); ... switched_from_dl() ... hrtimer_try_to_cancel() ... ... return HRTIMER_NORESTART; raw_spin_unlock(rq->lock); ... hrtimer_cancel(); ... raw_spin_lock(rq->lock); ... In this case hrtimer_cancel() returns immediately. Very unlikely case, just to mention. Nobody can manipulate the task, because check_class_changed() is always called with pi_lock locked. Nobody can force the task to participate in (concurrent) priority inheritance schemes (the same reason). All concurrent task operations require pi_lock, which is held by us. No deadlocks with dl_task_timer() are possible, because it returns right after !dl_task() check (it does nothing). If we receive a new dl_task during the time of unlocked rq, we just don't have to do pull_dl_task() in switched_from_dl() further. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> [ Added comments] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-10-27 17:40:52 +03:00			`/*`
			`* The switched_from() call is allowed to drop rq->lock, therefore we`
			`* cannot assume the switched_from/switched_to pair is serliazed by`
			`* rq->lock. They are however serialized by p->pi_lock.`
			`*/`
sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00			`void (switched_from) (struct rq this_rq, struct task_struct *task);`
			`void (switched_to) (struct rq this_rq, struct task_struct *task);`
			`void (prio_changed) (struct rq this_rq, struct task_struct *task,`
			`int oldprio);`

			`unsigned int (get_rr_interval) (struct rq rq,`
			`struct task_struct *task);`

sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc test case in cost of breaking another one. After that commit, calling clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result of Y time being smaller than X time. Reproducer/tester can be found further below, it can be compiled and ran by: gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread while ./tst-cpuclock2 ; do : ; done This reproducer, when running on a buggy kernel, will complain about "clock_gettime difference too small". Issue happens because on start in thread_group_cputimer() we initialize sum_exec_runtime of cputimer with threads runtime not yet accounted and then add the threads runtime to running cputimer again on scheduler tick, making it's sum_exec_runtime bigger than actual threads runtime. KOSAKI Motohiro posted a fix for this problem, but that patch was never applied: https://lkml.org/lkml/2013/5/26/191 . This patch takes different approach to cure the problem. It calls update_curr() when cputimer starts, that assure we will have updated stats of running threads and on the next schedule tick we will account only the runtime that elapsed from cputimer start. That also assure we have consistent state between cpu times of individual threads and cpu time of the process consisted by those threads. Full reproducer (tst-cpuclock2.c): #define _GNU_SOURCE #include <unistd.h> #include <sys/syscall.h> #include <stdio.h> #include <time.h> #include <pthread.h> #include <stdint.h> #include <inttypes.h> /* Parameters for the Linux kernel ABI for CPU clocks. / #define CPUCLOCK_SCHED 2 #define MAKE_PROCESS_CPUCLOCK(pid, clock) \ ((~(clockid_t) (pid) << 3) \| (clockid_t) (clock)) static pthread_barrier_t barrier; / Help advance the clock. / static void chew_cpu(void arg) { pthread_barrier_wait(&barrier); while (1) ; return NULL; } / Don't use the glibc wrapper. / static int do_nanosleep(int flags, const struct timespec req) { clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED); return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL); } static int64_t tsdiff(const struct timespec before, const struct timespec after) { int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec; int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec; return after_i - before_i; } int main(void) { int result = 0; pthread_t th; pthread_barrier_init(&barrier, NULL, 2); if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) { perror("pthread_create"); return 1; } pthread_barrier_wait(&barrier); /* The test. / struct timespec before, after, sleeptimeabs; int64_t sleepdiff, diffabs; const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 }; / The relative nanosleep. Not sure why this is needed, but its presence seems to make it easier to reproduce the problem. / if (do_nanosleep(0, &sleeptime) != 0) { perror("clock_nanosleep"); return 1; } / Get the current time. / if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) { perror("clock_gettime[2]"); return 1; } / Compute the absolute sleep time based on the current time. / uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec; sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000; sleeptimeabs.tv_nsec = nsec % 1000000000; / Sleep for the computed time. / if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) { perror("absolute clock_nanosleep"); return 1; } / Get the time after the sleep. / if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) { perror("clock_gettime[3]"); return 1; } / The time after sleep should always be equal to or after the absolute sleep time passed to clock_nanosleep. / sleepdiff = tsdiff(&sleeptimeabs, &after); if (sleepdiff < 0) { printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff); result = 1; printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec); printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec); printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec); } / The difference between the timestamps taken before and after the clock_nanosleep call should be equal to or more than the duration of the sleep. */ diffabs = tsdiff(&before, &after); if (diffabs < sleeptime.tv_nsec) { printf("clock_gettime difference too small: %" PRId64 "\n", diffabs); result = 1; } pthread_cancel(th); return result; } Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-11-12 16:58:44 +01:00			`void (update_curr) (struct rq rq);`

BACKPORT: sched/cgroup: Fix cpu_cgroup_fork() handling A new fair task is detached and attached from/to task_group with: cgroup_post_fork() ss->fork(child) := cpu_cgroup_fork() sched_move_task() task_move_group_fair() Which is wrong, because at this point in fork() the task isn't fully initialized and it cannot 'move' to another group, because its not attached to any group as yet. In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we can just call this small part directly instead sched_move_task(). And the task doesn't really migrate because it is not yet attached so we need the following sequence: do_fork() sched_fork() __set_task_cpu() cgroup_post_fork() set_task_rq() # set task group and runqueue wake_up_new_task() select_task_rq() can select a new cpu __set_task_cpu post_init_entity_util_avg attach_task_cfs_rq() activate_task enqueue_task This patch makes that happen. BACKPORT: Difference from original commit: - Removed use of DEQUEUE_MOVE (which isn't defined in 4.4) in dequeue_task flags - Replaced "struct rq_flags rf" with "unsigned long flags". Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> [ Added TASK_SET_GROUP to set depth properly. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit ea86cb4b7621e1298a37197005bf0abcc86348d4) Change-Id: I8126fd923288acf961218431ffd29d6bf6fd8d72 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> 2016-06-17 13:38:55 +02:00			`#define TASK_SET_GROUP 0`
			`#define TASK_MOVE_GROUP 1`

sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00			`#ifdef CONFIG_FAIR_GROUP_SCHED`
BACKPORT: sched/cgroup: Fix cpu_cgroup_fork() handling A new fair task is detached and attached from/to task_group with: cgroup_post_fork() ss->fork(child) := cpu_cgroup_fork() sched_move_task() task_move_group_fair() Which is wrong, because at this point in fork() the task isn't fully initialized and it cannot 'move' to another group, because its not attached to any group as yet. In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we can just call this small part directly instead sched_move_task(). And the task doesn't really migrate because it is not yet attached so we need the following sequence: do_fork() sched_fork() __set_task_cpu() cgroup_post_fork() set_task_rq() # set task group and runqueue wake_up_new_task() select_task_rq() can select a new cpu __set_task_cpu post_init_entity_util_avg attach_task_cfs_rq() activate_task enqueue_task This patch makes that happen. BACKPORT: Difference from original commit: - Removed use of DEQUEUE_MOVE (which isn't defined in 4.4) in dequeue_task flags - Replaced "struct rq_flags rf" with "unsigned long flags". Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> [ Added TASK_SET_GROUP to set depth properly. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit ea86cb4b7621e1298a37197005bf0abcc86348d4) Change-Id: I8126fd923288acf961218431ffd29d6bf6fd8d72 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> 2016-06-17 13:38:55 +02:00			`void (task_change_group)(struct task_struct p, int type);`
sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00			`#endif`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`#ifdef CONFIG_SCHED_HMP`
			`void (inc_hmp_sched_stats)(struct rq rq, struct task_struct *p);`
			`void (dec_hmp_sched_stats)(struct rq rq, struct task_struct *p);`
sched: avoid stale cumulative_runnable_avg HMP statistics When a new window starts for a task and the task is on a rq, scheduler decreases rq's cumulative_runnable_avg momentarily, re-account task's demand and increases rq's cumulative_runnable_avg with newly accounted task's demand. Therefore there is short time period that rq's cumulative_runnable_avg is less than what it's supposed to be. Meanwhile, there is chance that other CPU is in search of best CPU to place a task and makes suboptimal decision with momentarily stale cumulative_runnable_avg. Fix such issue by adding or subtracting of delta between task's old and new demand instead of decrementing and incrementing of entire task's load. Change-Id: I3c9329961e6f96e269fa13359e7d1c39c4973ff2 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-07-13 21:04:18 -07:00			`void (fixup_hmp_sched_stats)(struct rq rq, struct task_struct *p,`
sched: Add separate load tracking histogram to predict loads Current window based load tracking only saves history for five windows. A historically heavy task's heavy load will be completely forgotten after five windows of light load. Even before the five window expires, a heavy task wakes up on same CPU it used to run won't trigger any frequency change until end of the window. It would starve for the entire window. It also adds one "small" load window to history because it's accumulating load at a low frequency, further reducing the tracked load for this heavy task. Ideally, scheduler should be able to identify such tasks and notify governor to increase frequency immediately after it wakes up. Add a histogram for each task to track a much longer load history. A prediction will be made based on runtime of previous or current window, histogram data and load tracked in recent windows. Prediction of all tasks that is currently running or runnable on a CPU is aggregated and reported to CPUFreq governor in sched_get_cpus_busy(). sched_get_cpus_busy() now returns predicted busy time in addition to previous window busy time and new task busy time, scaled to the CPU maximum possible frequency. Tunables: - /proc/sys/kernel/sched_gov_alert_freq (KHz) This tunable can be used to further filter the notifications. Frequency alert notification is sent only when the predicted load exceeds previous window load by sched_gov_alert_freq converted to load. Change-Id: If29098cd2c5499163ceaff18668639db76ee8504 Suggested-by: Saravana Kannan <skannan@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Junjie Wu <junjiew@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts around __migrate_task() and removed changes for CONFIG_SCHED_QHMP.] 2015-06-08 09:08:47 +05:30			`u32 new_task_load, u32 new_pred_demand);`
sched: Consolidate hmp stats into their own struct Key hmp stats (nr_big_tasks, nr_small_tasks and cumulative_runnable_average) are currently maintained per-cpu in 'struct rq'. Merge those stats in their own structure (struct hmp_sched_stats) and modify impacted functions to deal with the newly introduced structure. This cleanup is required for a subsequent patch which fixes various issues with use of CFS_BANDWIDTH feature in HMP scheduler. Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [joonwoop@codeaurora.org: fixed conflict in __update_load_avg().] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2015-01-16 11:27:31 +05:30			`#endif`
sched: Move struct sched_class to kernel/sched/sched.h It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-05 16:06:55 +08:00			`};`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched: Fix hotplug task migration Dan Carpenter reported: > kernel/sched/rt.c:1347 pick_next_task_rt() warn: variable dereferenced before check 'prev' (see line 1338) > kernel/sched/deadline.c:1011 pick_next_task_dl() warn: variable dereferenced before check 'prev' (see line 1005) Kirill also spotted that migrate_tasks() will have an instant NULL deref because pick_next_task() will immediately deref prev. Instead of fixing all the corner cases because migrate_tasks() can pass in a NULL prev task in the unlikely case of hot-un-plug, provide a fake task such that we can remove all the NULL checks from the far more common paths. A further problem; not previously spotted; is that because we pushed pre_schedule() and idle_balance() into pick_next_task() we now need to avoid those getting called and pulling more tasks on our dying CPU. We avoid pull_{dl,rt}_task() by setting fake_task.prio to MAX_PRIO+1. We also note that since we call pick_next_task() exactly the amount of times we have runnable tasks present, we should never land in idle_balance(). Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()") Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Reported-by: Kirill Tkhai <tkhai@yandex.ru> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140212094930.GB3545@laptop.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2014-02-12 10:49:30 +01:00			`static inline void put_prev_task(struct rq rq, struct task_struct prev)`
			`{`
			`prev->sched_class->put_prev_task(rq, prev);`
			`}`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#define sched_class_highest (&stop_sched_class)`
			`#define for_each_class(class) \`
			`for (class = sched_class_highest; class; class = class->next)`

			`extern const struct sched_class stop_sched_class;`
sched/deadline: Add SCHED_DEADLINE structures & implementation Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-28 11:14:43 +01:00			`extern const struct sched_class dl_sched_class;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`extern const struct sched_class rt_sched_class;`
			`extern const struct sched_class fair_sched_class;`
			`extern const struct sched_class idle_sched_class;`


			`#ifdef CONFIG_SMP`

FIXUP: sched: fix build for non-SMP target Currently the build for a single-core (e.g. user-mode) Linux is broken and this configuration is required (at least) to run some network tests. The main issues for the current code support on single-core systems are: 1. {se,rq}::sched_avg is not available nor maintained for !SMP systems This means that load and utilisation signals are NOT available in single core systems. All the EAS code depends on these signals. 2. sched_group_energy is also SMP dependant. Again this means that all the EAS setup and preparation code (energyn model initialization) has to be properly guarded/disabled for !SMP systems. 3. SchedFreq depends on utilization signal, which is not available on !SMP systems. 4. SchedTune is useless on unicore systems if SchedFreq is not available. 5. WALT machinery is not required on single-core systems. This patch addresses all these issues by enforcing some constraints for single-core systems: a) WALT, SchedTune and SchedTune are now dependant on SMP b) The default governor for !SMP systems is INTERACTIVE c) The energy model initialisation/build functions are d) Other minor code re-arrangements and CONFIG_SMP guarding to enable single core builds. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> 2016-07-22 11:35:59 +01:00			`extern void init_max_cpu_capacity(struct max_cpu_capacity *mcc);`
sched: Let 'struct sched_group_power' care about CPU capacity It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Since struct sched_group_power is really about compute capacity of sched groups, let's rename it to struct sched_group_capacity. Similarly sgp becomes sgc. Related variables and functions dealing with groups are also adjusted accordingly. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-26 18:19:37 -04:00			`extern void update_group_capacity(struct sched_domain *sd, int cpu);`
sched: Fix update_group_power() prototype placement to fix build warning when !CONFIG_SMP All warnings: In file included from kernel/sched/core.c:85:0: kernel/sched/sched.h:1036:39: warning: 'struct sched_domain' declared inside parameter list kernel/sched/sched.h:1036:39: warning: its scope is only this definition or declaration, which is probably not what you want It's because struct sched_domain is defined inside #if CONFIG_SMP, while update_group_power() is declared unconditionally. Fix this warning by declaring update_group_power() only if CONFIG_SMP=n. Build tested with CONFIG_SMP enabled and then disabled. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5137F4BA.2060101@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-03-07 10:00:26 +08:00
sched: Reduce trigger_load_balance() parameters The cpu information is already stored in the struct rq, so no need to pass it as parameter to the trigger_load_balance function. Cc: linaro-kernel@lists.linaro.org Cc: preeti.lkml@gmail.com Cc: mingo@redhat.com Cc: peterz@infradead.org Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-2-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-01-06 12:34:38 +01:00			`extern void trigger_load_balance(struct rq *rq);`
sched: add cpu isolation support This adds cpu isolation APIs to the scheduler to isolate and unisolate CPUs. Isolating and unisolating a CPU can be used in place of hotplug. Isolating and unisolating a CPU is faster than hotplug and can thus be used to optimize the performance and power of multi-core CPUs. Isolating works by migrating non-pinned IRQs and tasks to other CPUS and marking the CPU as not available to the scheduler and load balancer. Pinned tasks and IRQs are still allowed to run but it is expected that this would be minimal. Unisolation works by just marking the CPU available for scheduler and load balancer. Change-Id: I0bbddb56238c2958c5987877c5bfc3e79afa67cc Signed-off-by: Olav Haugan <ohaugan@codeaurora.org> 2016-05-31 14:34:46 -07:00			`extern void nohz_balance_clear_nohz_mask(int cpu);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched: Fix wrong rq's runnable_avg update with rt tasks The current update of the rq's load can be erroneous when RT tasks are involved. The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up. A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance. When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time. As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is. A new idle_exit function is called when the prev task is the idle function so the elapsed time will be accounted as idle time in the rq's load. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: linaro-kernel@lists.linaro.org Cc: peterz@infradead.org Cc: pjt@google.com Cc: fweisbec@gmail.com Cc: efault@gmx.de Link: http://lkml.kernel.org/r/1366302867-5055-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-04-18 18:34:26 +02:00			`extern void idle_enter_fair(struct rq *this_rq);`
			`extern void idle_exit_fair(struct rq *this_rq);`

sched: Make sched_class::set_cpus_allowed() unconditional Give every class a set_cpus_allowed() method, this enables some small optimization in the RT,DL implementation by avoiding a double cpumask_weight() call. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dedekind1@gmail.com Cc: juri.lelli@arm.com Cc: mgorman@suse.de Cc: riel@redhat.com Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20150515154833.614517487@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-05-15 17:43:35 +02:00			`extern void set_cpus_allowed_common(struct task_struct p, const struct cpumask new_mask);`

sched: Remove some #ifdeffery Remove a few gratuitous #ifdefs in pick_next_task*(). Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-nnzddp5c4fijyzzxxrwlxghf@git.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2014-02-12 15:47:29 +01:00			`#else`

			`static inline void idle_enter_fair(struct rq *rq) { }`
			`static inline void idle_exit_fair(struct rq *rq) { }`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#endif`

sched: Let the scheduler see CPU idle states When the cpu enters idle, it stores the cpuidle state pointer in its struct rq instance which in turn could be used to make a better decision when balancing tasks. As soon as the cpu exits its idle state, the struct rq reference is cleared. There are a couple of situations where the idle state pointer could be changed while it is being consulted: 1. For x86/acpi with dynamic c-states, when a laptop switches from battery to AC that could result on removing the deeper idle state. The acpi driver triggers: 'acpi_processor_cst_has_changed' 'cpuidle_pause_and_lock' 'cpuidle_uninstall_idle_handler' 'kick_all_cpus_sync'. All cpus will exit their idle state and the pointed object will be set to NULL. 2. The cpuidle driver is unloaded. Logically that could happen but not in practice because the drivers are always compiled in and 95% of them are not coded to unregister themselves. In any case, the unloading code must call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock' leading to 'kick_all_cpus_sync' as mentioned above. A race can happen if we use the pointer and then one of these two scenarios occurs at the same moment. In order to be safe, the idle state pointer stored in the rq must be used inside a rcu_read_lock section where we are protected with the 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The idle_get_state() and idle_put_state() accessors should be used to that effect. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: linaro-kernel@lists.linaro.org Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-09-04 11:32:09 -04:00			`#ifdef CONFIG_CPU_IDLE`
			`static inline void idle_set_state(struct rq *rq,`
			`struct cpuidle_state *idle_state)`
			`{`
			`rq->idle_state = idle_state;`
			`}`

			`static inline struct cpuidle_state idle_get_state(struct rq rq)`
			`{`
			`WARN_ON(!rcu_read_lock_held());`
			`return rq->idle_state;`
			`}`
sched, cpuidle: Track cpuidle state index in the scheduler The idle-state of each cpu is currently pointed to by rq->idle_state but there isn't any information in the struct cpuidle_state that can used to look up the idle-state energy model data stored in struct sched_group_energy. For this purpose is necessary to store the idle state index as well. Ideally, the idle-state data should be unified. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> 2015-01-27 13:48:07 +00:00
			`static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)`
			`{`
			`rq->idle_state_idx = idle_state_idx;`
			`}`

			`static inline int idle_get_state_idx(struct rq *rq)`
			`{`
			`WARN_ON(!rcu_read_lock_held());`
			`return rq->idle_state_idx;`
			`}`
sched: Let the scheduler see CPU idle states When the cpu enters idle, it stores the cpuidle state pointer in its struct rq instance which in turn could be used to make a better decision when balancing tasks. As soon as the cpu exits its idle state, the struct rq reference is cleared. There are a couple of situations where the idle state pointer could be changed while it is being consulted: 1. For x86/acpi with dynamic c-states, when a laptop switches from battery to AC that could result on removing the deeper idle state. The acpi driver triggers: 'acpi_processor_cst_has_changed' 'cpuidle_pause_and_lock' 'cpuidle_uninstall_idle_handler' 'kick_all_cpus_sync'. All cpus will exit their idle state and the pointed object will be set to NULL. 2. The cpuidle driver is unloaded. Logically that could happen but not in practice because the drivers are always compiled in and 95% of them are not coded to unregister themselves. In any case, the unloading code must call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock' leading to 'kick_all_cpus_sync' as mentioned above. A race can happen if we use the pointer and then one of these two scenarios occurs at the same moment. In order to be safe, the idle state pointer stored in the rq must be used inside a rcu_read_lock section where we are protected with the 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The idle_get_state() and idle_put_state() accessors should be used to that effect. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: linaro-kernel@lists.linaro.org Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-09-04 11:32:09 -04:00			`#else`
			`static inline void idle_set_state(struct rq *rq,`
			`struct cpuidle_state *idle_state)`
			`{`
			`}`

			`static inline struct cpuidle_state idle_get_state(struct rq rq)`
			`{`
			`return NULL;`
			`}`
sched, cpuidle: Track cpuidle state index in the scheduler The idle-state of each cpu is currently pointed to by rq->idle_state but there isn't any information in the struct cpuidle_state that can used to look up the idle-state energy model data stored in struct sched_group_energy. For this purpose is necessary to store the idle state index as well. Ideally, the idle-state data should be unified. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> 2015-01-27 13:48:07 +00:00
			`static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)`
			`{`
			`}`

			`static inline int idle_get_state_idx(struct rq *rq)`
			`{`
			`return -1;`
			`}`
sched: Let the scheduler see CPU idle states When the cpu enters idle, it stores the cpuidle state pointer in its struct rq instance which in turn could be used to make a better decision when balancing tasks. As soon as the cpu exits its idle state, the struct rq reference is cleared. There are a couple of situations where the idle state pointer could be changed while it is being consulted: 1. For x86/acpi with dynamic c-states, when a laptop switches from battery to AC that could result on removing the deeper idle state. The acpi driver triggers: 'acpi_processor_cst_has_changed' 'cpuidle_pause_and_lock' 'cpuidle_uninstall_idle_handler' 'kick_all_cpus_sync'. All cpus will exit their idle state and the pointed object will be set to NULL. 2. The cpuidle driver is unloaded. Logically that could happen but not in practice because the drivers are always compiled in and 95% of them are not coded to unregister themselves. In any case, the unloading code must call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock' leading to 'kick_all_cpus_sync' as mentioned above. A race can happen if we use the pointer and then one of these two scenarios occurs at the same moment. In order to be safe, the idle state pointer stored in the rq must be used inside a rcu_read_lock section where we are protected with the 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The idle_get_state() and idle_put_state() accessors should be used to that effect. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: linaro-kernel@lists.linaro.org Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-09-04 11:32:09 -04:00			`#endif`

sched/debug: Make sysrq prints of sched debug data optional Calls to sysrq_sched_debug_show() can yield rather verbose output which contributes to log spew and, under heavy load, may increase the chances of a watchdog bark. Make printing of this data optional with the introduction of a new Kconfig, CONFIG_SYSRQ_SCHED_DEBUG. Change-Id: I5f54d901d0dea403109f7ac33b8881d967a899ed Signed-off-by: Matt Wagantall <mattw@codeaurora.org> 2013-12-05 20:01:32 -08:00			`#ifdef CONFIG_SYSRQ_SCHED_DEBUG`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`extern void sysrq_sched_debug_show(void);`
sched/debug: Make sysrq prints of sched debug data optional Calls to sysrq_sched_debug_show() can yield rather verbose output which contributes to log spew and, under heavy load, may increase the chances of a watchdog bark. Make printing of this data optional with the introduction of a new Kconfig, CONFIG_SYSRQ_SCHED_DEBUG. Change-Id: I5f54d901d0dea403109f7ac33b8881d967a899ed Signed-off-by: Matt Wagantall <mattw@codeaurora.org> 2013-12-05 20:01:32 -08:00			`#endif`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`extern void sched_init_granularity(void);`
			`extern void update_max_interval(void);`
sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic Introduces data structures relevant for implementing dynamic migration of -deadline tasks and the logic for checking if runqueues are overloaded with -deadline tasks and for choosing where a task should migrate, when it is the case. Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can be moved among CPUs when necessary. It is also possible to bind a task to a (set of) CPU(s), thus restricting its capability of migrating, or forbidding migrations at all. The very same approach used in sched_rt is utilised: - -deadline tasks are kept into CPU-specific runqueues, - -deadline tasks are migrated among runqueues to achieve the following: * on an M-CPU system the M earliest deadline ready tasks are always running; * affinity/cpusets settings of all the -deadline tasks is always respected. Therefore, this very special form of "load balancing" is done with an active method, i.e., the scheduler pushes or pulls tasks between runqueues when they are woken up and/or (de)scheduled. IOW, every time a preemption occurs, the descheduled task might be sent to some other CPU (depending on its deadline) to continue executing (push). On the other hand, every time a CPU becomes idle, it might pull the second earliest deadline ready task from some other CPU. To enforce this, a pull operation is always attempted before taking any scheduling decision (pre_schedule()), as well as a push one after each scheduling decision (post_schedule()). In addition, when a task arrives or wakes up, the best CPU where to resume it is selected taking into account its affinity mask, the system topology, but also its deadline. E.g., from the scheduling point of view, the best CPU where to wake up (and also where to push) a task is the one which is running the task with the latest deadline among the M executing ones. In order to facilitate these decisions, per-runqueue "caching" of the deadlines of the currently running and of the first ready task is used. Queued but not running tasks are also parked in another rb-tree to speed-up pushes. Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:38 +01:00
			`extern void init_sched_dl_class(void);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`extern void init_sched_rt_class(void);`
			`extern void init_sched_fair_class(void);`

sched: Transform resched_task() into resched_curr() We always use resched_task() with rq->curr argument. It's not possible to reschedule any task but rq's current. The patch introduces resched_curr(struct rq *) to replace all of the repeating patterns. The main aim is cleanup, but there is a little size profit too: (before) $ size kernel/sched/built-in.o text data bss dec hex filename 155274 16445 7042 178761 2ba49 kernel/sched/built-in.o $ size vmlinux text data bss dec hex filename 7411490 1178376 991232 9581098 92322a vmlinux (after) $ size kernel/sched/built-in.o text data bss dec hex filename 155130 16445 7042 178617 2b9b9 kernel/sched/built-in.o $ size vmlinux text data bss dec hex filename 7411362 1178376 991232 9580970 9231aa vmlinux I was choosing between resched_curr() and resched_rq(), and the first name looks better for me. A little lie in Documentation/trace/ftrace.txt. I have not actually collected the tracing again. With a hope the patch won't make execution times much worse :) Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-06-29 00:03:57 +04:00			`extern void resched_curr(struct rq *rq);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`extern void resched_cpu(int cpu);`

			`extern struct rt_bandwidth def_rt_bandwidth;`
			`extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);`
ANDROID: sched/rt: schedtune: Add boost retention to RT Boosted RT tasks can be deboosted quickly, this makes boost usless for RT tasks and causes lots of glitching. Use timers to prevent de-boost too soon and wait for long enough such that next enqueue happens after a threshold. While this can be solved in the governor, there are following advantages: - The approach used is governor-independent - Reduces boost group lock contention for frequently sleepers/wakers Note: Fixed build breakage due to schedfreq dependency which isn't used for RT anymore. Bug: 30210506 Change-Id: I428a2695cac06cc3458cdde0dea72315e4e66c00 Signed-off-by: Joel Fernandes <joelaf@google.com> 2017-09-11 17:10:37 -07:00			`extern void init_rt_schedtune_timer(struct sched_rt_entity *rt_se);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks In order of deadline scheduling to be effective and useful, it is important that some method of having the allocation of the available CPU bandwidth to tasks and task groups under control. This is usually called "admission control" and if it is not performed at all, no guarantee can be given on the actual scheduling of the -deadline tasks. Since when RT-throttling has been introduced each task group have a bandwidth associated to itself, calculated as a certain amount of runtime over a period. Moreover, to make it possible to manipulate such bandwidth, readable/writable controls have been added to both procfs (for system wide settings) and cgroupfs (for per-group settings). Therefore, the same interface is being used for controlling the bandwidth distrubution to -deadline tasks and task groups, i.e., new controls but with similar names, equivalent meaning and with the same usage paradigm are added. However, more discussion is needed in order to figure out how we want to manage SCHED_DEADLINE bandwidth at the task group level. Therefore, this patch adds a less sophisticated, but actually very sensible, mechanism to ensure that a certain utilization cap is not overcome per each root_domain (the single rq for !SMP configurations). Another main difference between deadline bandwidth management and RT-throttling is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!), and thus we don't need an higher level throttling mechanism to enforce the desired bandwidth. This patch, therefore: - adds system wide deadline bandwidth management by means of: * /proc/sys/kernel/sched_dl_runtime_us, * /proc/sys/kernel/sched_dl_period_us, that determine (i.e., runtime / period) the total bandwidth available on each CPU of each root_domain for -deadline tasks; - couples the RT and deadline bandwidth management, i.e., enforces that the sum of how much bandwidth is being devoted to -rt -deadline tasks to stay below 100%. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created until the sum of their bandwidths stay below: M * (sched_dl_runtime_us / sched_dl_period_us) It is also possible to disable this bandwidth management logic, and be thus free of oversubscribing the system up to any arbitrary level. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:45 +01:00			`extern struct dl_bandwidth def_dl_bandwidth;`
			`extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);`
sched/deadline: Add SCHED_DEADLINE structures & implementation Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-28 11:14:43 +01:00			`extern void init_dl_task_timer(struct sched_dl_entity *dl_se);`

sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks In order of deadline scheduling to be effective and useful, it is important that some method of having the allocation of the available CPU bandwidth to tasks and task groups under control. This is usually called "admission control" and if it is not performed at all, no guarantee can be given on the actual scheduling of the -deadline tasks. Since when RT-throttling has been introduced each task group have a bandwidth associated to itself, calculated as a certain amount of runtime over a period. Moreover, to make it possible to manipulate such bandwidth, readable/writable controls have been added to both procfs (for system wide settings) and cgroupfs (for per-group settings). Therefore, the same interface is being used for controlling the bandwidth distrubution to -deadline tasks and task groups, i.e., new controls but with similar names, equivalent meaning and with the same usage paradigm are added. However, more discussion is needed in order to figure out how we want to manage SCHED_DEADLINE bandwidth at the task group level. Therefore, this patch adds a less sophisticated, but actually very sensible, mechanism to ensure that a certain utilization cap is not overcome per each root_domain (the single rq for !SMP configurations). Another main difference between deadline bandwidth management and RT-throttling is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!), and thus we don't need an higher level throttling mechanism to enforce the desired bandwidth. This patch, therefore: - adds system wide deadline bandwidth management by means of: * /proc/sys/kernel/sched_dl_runtime_us, * /proc/sys/kernel/sched_dl_period_us, that determine (i.e., runtime / period) the total bandwidth available on each CPU of each root_domain for -deadline tasks; - couples the RT and deadline bandwidth management, i.e., enforces that the sum of how much bandwidth is being devoted to -rt -deadline tasks to stay below 100%. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created until the sum of their bandwidths stay below: M * (sched_dl_runtime_us / sched_dl_period_us) It is also possible to disable this bandwidth management logic, and be thus free of oversubscribing the system up to any arbitrary level. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-11-07 14:43:45 +01:00			`unsigned long to_ratio(u64 period, u64 runtime);`

sched/fair: Init cfs_rq's sched_entity load average The runnable load and utilization averages of cfs_rq's sched_entity were not initiated. Like done to a task, give new cfs_rq' sched_entity start values to heavy its load in infant time. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-5-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-07-15 08:04:39 +08:00			`extern void init_entity_runnable_average(struct sched_entity *se);`
BACKPORT: sched/fair: Initiate a new task's util avg to a bounded value A new task's util_avg is set to full utilization of a CPU (100% time running). This accelerates a new task's utilization ramp-up, useful to boost its execution in early time. However, it may result in (insanely) high utilization for a transient time period when a flood of tasks are spawned. Importantly, it violates the "fundamentally bounded" CPU utilization, and its side effect is negative if we don't take any measure to bound it. This patch proposes an algorithm to address this issue. It has two methods to approach a sensible initial util_avg: (1) An expected (or average) util_avg based on its cfs_rq's util_avg: util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight (2) A trajectory of how successive new tasks' util develops, which gives 1/2 of the left utilization budget to a new task such that the additional util is noticeably large (when overall util is low) or unnoticeably small (when overall util is high enough). In the meantime, the aggregate utilization is well bounded: util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n where n denotes the nth task. If util_avg is larger than util_avg_cap, then the effective util is clamped to the util_avg_cap. Change-Id: Idafe989b24d9e70911666f09800bf1d5a011e1f4 Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: steve.muckle@linaro.org Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 2b8c41daba327c633228169e8bd8ec067ab443f8) [integrate with schedfreq - schedfreq has a tuneable for init task util but this commit removes the use of the tuneable since we have a new algorithm for calculating an initial utilisation. I've left the tuneable in place, but it is no longer used even when schedfreq is the CPUFreq governor] Signed-off-by: Chris Redpath <chris.redpath@arm.com> 2016-03-30 04:30:56 +08:00			`extern void post_init_entity_util_avg(struct sched_entity *se);`
sched: Set an initial value of runnable avg for new forked task We need to initialize the se.avg.{decay_count, load_avg_contrib} for a new forked task. Otherwise random values of above variables cause a mess when a new task is enqueued: enqueue_task_fair enqueue_entity enqueue_entity_load_avg and make fork balancing imbalance due to incorrect load_avg_contrib. Further more, Morten Rasmussen notice some tasks were not launched at once after created. So Paul and Peter suggest giving a start value for new task runnable avg time same as sched_slice(). PeterZ said: > So the 'problem' is that our running avg is a 'floating' average; ie. it > decays with time. Now we have to guess about the future of our newly > spawned task -- something that is nigh impossible seeing these CPU > vendors keep refusing to implement the crystal ball instruction. > > So there's two asymptotic cases we want to deal well with; 1) the case > where the newly spawned program will be 'nearly' idle for its lifetime; > and 2) the case where its cpu-bound. > > Since we have to guess, we'll go for worst case and assume its > cpu-bound; now we don't want to make the avg so heavy adjusting to the > near-idle case takes forever. We want to be able to quickly adjust and > lower our running avg. > > Now we also don't want to make our avg too light, such that it gets > decremented just for the new task not having had a chance to run yet -- > even if when it would run, it would be more cpu-bound than not. > > So what we do is we make the initial avg of the same duration as that we > guess it takes to run each task on the system at least once -- aka > sched_slice(). > > Of course we can defeat this with wakeup/fork bombs, but in the 'normal' > case it should be good enough. Paul also contributed most of the code comments in this commit. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reviewed-by: Paul Turner <pjt@google.com> [peterz; added explanation of sched_slice() usage] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-4-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-06-20 10:18:47 +08:00
CHROMIUM: sched: update the average of nr_running Doing a Exponential moving average per nr_running++/-- does not guarantee a fixed sample rate which induces errors if there are lots of threads being enqueued/dequeued from the rq (Linpack mt). Instead of keeping track of the avg, the scheduler now keeps track of the integral of nr_running and allows the readers to perform filtering on top. Original-author: Sai Charan Gurrappadi <sgurrappadi@nvidia.com> Change-Id: Id946654f32fa8be0eaf9d8fa7c9a8039b5ef9fab Signed-off-by: Joseph Lo <josephl@nvidia.com> Signed-off-by: Andrew Bresticker <abrestic@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/174694 Reviewed-on: https://chromium-review.googlesource.com/272853 [jstultz: fwdported to 4.4] Signed-off-by: John Stultz <john.stultz@linaro.org> 2013-04-22 14:39:18 +08:00			`static inline void __add_nr_running(struct rq *rq, unsigned count)`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`{`
sched, nohz: Change rq->nr_running to always use wrappers Sometimes ->nr_running may cross 2 but interrupt is not being sent to rq's cpu. In this case we don't reenable the timer. Looks like this may be the reason for rare unexpected effects, if nohz is enabled. Patch replaces all places of direct changing of nr_running and makes add_nr_running() caring about crossing border. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-09 03:00:14 +04:00			`unsigned prev_nr = rq->nr_running;`

sched: Fix bug in average nr_running and nr_iowait calculation sched_get_nr_running_avg() returns average nr_running and nr_iowait task count since it was last invoked. Fix several bugs in their calculation. * sched_update_nr_prod() needs to consider that nr_running count can change by more than 1 when CFS_BANDWIDTH feature is used * sched_get_nr_running_avg() needs to sum up nr_iowait count across all cpus, rather than just one * sched_get_nr_running_avg() could race with sched_update_nr_prod(), as a result of which it could use curr_time which is behind a cpu's 'last_time' value. That would lead to erroneous calculation of average nr_running or nr_iowait. While at it, fix also a bug in BUG_ON() check in sched_update_nr_prod() function and remove unnecessary nr_running argument to sched_update_nr_prod() function. Change-Id: I46737614737292fae0d7204c4648fb9b862f65b2 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-02-16 16:42:59 +05:30			`sched_update_nr_prod(cpu_of(rq), count, true);`
sched, nohz: Change rq->nr_running to always use wrappers Sometimes ->nr_running may cross 2 but interrupt is not being sent to rq's cpu. In this case we don't reenable the timer. Looks like this may be the reason for rare unexpected effects, if nohz is enabled. Patch replaces all places of direct changing of nr_running and makes add_nr_running() caring about crossing border. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-09 03:00:14 +04:00			`rq->nr_running = prev_nr + count;`
sched: Kick full dynticks CPU that have more than one task enqueued. Kick the tick on full dynticks CPUs when they get more than one task running on their queue. This makes sure that local fairness is maintained by the tick on the destination. This is done regardless of these tasks' class. We should be able to be more clever in the future depending on these. eg: a CPU that runs a SCHED_FIFO task doesn't need to maintain fairness against local pending tasks of the fair class. But keep things simple for now. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> 2013-04-20 14:35:09 +02:00
sched, nohz: Change rq->nr_running to always use wrappers Sometimes ->nr_running may cross 2 but interrupt is not being sent to rq's cpu. In this case we don't reenable the timer. Looks like this may be the reason for rare unexpected effects, if nohz is enabled. Patch replaces all places of direct changing of nr_running and makes add_nr_running() caring about crossing border. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-09 03:00:14 +04:00			`if (prev_nr < 2 && rq->nr_running >= 2) {`
sched/fair: Implement fast idling of CPUs when the system is partially loaded When a system is lightly loaded (i.e. no more than 1 job per cpu), attempt to pull job to a cpu before putting it to idle is unnecessary and can be skipped. This patch adds an indicator so the scheduler can know when there's no more than 1 active job is on any CPU in the system to skip needless job pulls. On a 4 socket machine with a request/response kind of workload from clients, we saw about 0.13 msec delay when we go through a full load balance to try pull job from all the other cpus. While 0.1 msec was spent on processing the request and generating a response, the 0.13 msec load balance overhead was actually more than the actual work being done. This overhead can be skipped much of the time for lightly loaded systems. With this patch, we tested with a netperf request/response workload that has the server busy with half the cpus in a 4 socket system. We found the patch eliminated 75% of the load balance attempts before idling a cpu. The overhead of setting/clearing the indicator is low as we already gather the necessary info while we call add_nr_running() and update_sd_lb_stats.() We switch to full load balance load immediately if any cpu got more than one job on its run queue in add_nr_running. We'll clear the indicator to avoid load balance when we detect no cpu's have more than one job when we scan the work queues in update_sg_lb_stats(). We are aggressive in turning on the load balance and opportunistic in skipping the load balance. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Jason Low <jason.low2@hp.com> Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-06-23 12:16:49 -07:00			`#ifdef CONFIG_SMP`
			`if (!rq->rd->overload)`
			`rq->rd->overload = true;`
			`#endif`

			`#ifdef CONFIG_NO_HZ_FULL`
sched: Kick full dynticks CPU that have more than one task enqueued. Kick the tick on full dynticks CPUs when they get more than one task running on their queue. This makes sure that local fairness is maintained by the tick on the destination. This is done regardless of these tasks' class. We should be able to be more clever in the future depending on these. eg: a CPU that runs a SCHED_FIFO task doesn't need to maintain fairness against local pending tasks of the fair class. But keep things simple for now. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> 2013-04-20 14:35:09 +02:00			`if (tick_nohz_full_cpu(rq->cpu)) {`
nohz: Use IPI implicit full barrier against rq->nr_running r/w A full dynticks CPU is allowed to stop its tick when a single task runs. Meanwhile when a new task gets enqueued, the CPU must be notified so that it can restart its tick to maintain local fairness and other accounting details. This notification is performed by way of an IPI. Then when the target receives the IPI, we expect it to see the new value of rq->nr_running. Hence the following ordering scenario: CPU 0 CPU 1 write rq->running get IPI smp_wmb() smp_rmb() send IPI read rq->nr_running But Paul Mckenney says that nowadays IPIs imply a full barrier on all architectures. So we can safely remove this pair and rely on the implicit barriers that come along IPI send/receive. Lets just comment on this new assumption. Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> 2014-03-18 22:54:04 +01:00			`/*`
			`* Tick is needed if more than one task runs on a CPU.`
			`* Send the target an IPI to kick it out of nohz mode.`
			`*`
			`* We assume that IPI implies full memory barrier and the`
			`* new value of rq->nr_running is visible on reception`
			`* from the target.`
			`*/`
nohz: Use nohz own full kick on 2nd task enqueue Now that we have a nohz full remote kick based on irq work, lets use it to notify a CPU that it's exiting single task mode. This unbloats a bit the scheduler IPI that the nohz code was abusing for its cool "callable anywhere/anytime" properties. Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> 2014-03-18 21:12:53 +01:00			`tick_nohz_full_kick_cpu(rq->cpu);`
sched: Kick full dynticks CPU that have more than one task enqueued. Kick the tick on full dynticks CPUs when they get more than one task running on their queue. This makes sure that local fairness is maintained by the tick on the destination. This is done regardless of these tasks' class. We should be able to be more clever in the future depending on these. eg: a CPU that runs a SCHED_FIFO task doesn't need to maintain fairness against local pending tasks of the fair class. But keep things simple for now. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> 2013-04-20 14:35:09 +02:00			`}`
			`#endif`
sched/fair: Implement fast idling of CPUs when the system is partially loaded When a system is lightly loaded (i.e. no more than 1 job per cpu), attempt to pull job to a cpu before putting it to idle is unnecessary and can be skipped. This patch adds an indicator so the scheduler can know when there's no more than 1 active job is on any CPU in the system to skip needless job pulls. On a 4 socket machine with a request/response kind of workload from clients, we saw about 0.13 msec delay when we go through a full load balance to try pull job from all the other cpus. While 0.1 msec was spent on processing the request and generating a response, the 0.13 msec load balance overhead was actually more than the actual work being done. This overhead can be skipped much of the time for lightly loaded systems. With this patch, we tested with a netperf request/response workload that has the server busy with half the cpus in a 4 socket system. We found the patch eliminated 75% of the load balance attempts before idling a cpu. The overhead of setting/clearing the indicator is low as we already gather the necessary info while we call add_nr_running() and update_sd_lb_stats.() We switch to full load balance load immediately if any cpu got more than one job on its run queue in add_nr_running. We'll clear the indicator to avoid load balance when we detect no cpu's have more than one job when we scan the work queues in update_sg_lb_stats(). We are aggressive in turning on the load balance and opportunistic in skipping the load balance. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Jason Low <jason.low2@hp.com> Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-06-23 12:16:49 -07:00			`}`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`}`

CHROMIUM: sched: update the average of nr_running Doing a Exponential moving average per nr_running++/-- does not guarantee a fixed sample rate which induces errors if there are lots of threads being enqueued/dequeued from the rq (Linpack mt). Instead of keeping track of the avg, the scheduler now keeps track of the integral of nr_running and allows the readers to perform filtering on top. Original-author: Sai Charan Gurrappadi <sgurrappadi@nvidia.com> Change-Id: Id946654f32fa8be0eaf9d8fa7c9a8039b5ef9fab Signed-off-by: Joseph Lo <josephl@nvidia.com> Signed-off-by: Andrew Bresticker <abrestic@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/174694 Reviewed-on: https://chromium-review.googlesource.com/272853 [jstultz: fwdported to 4.4] Signed-off-by: John Stultz <john.stultz@linaro.org> 2013-04-22 14:39:18 +08:00			`static inline void __sub_nr_running(struct rq *rq, unsigned count)`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`{`
sched: Fix bug in average nr_running and nr_iowait calculation sched_get_nr_running_avg() returns average nr_running and nr_iowait task count since it was last invoked. Fix several bugs in their calculation. * sched_update_nr_prod() needs to consider that nr_running count can change by more than 1 when CFS_BANDWIDTH feature is used * sched_get_nr_running_avg() needs to sum up nr_iowait count across all cpus, rather than just one * sched_get_nr_running_avg() could race with sched_update_nr_prod(), as a result of which it could use curr_time which is behind a cpu's 'last_time' value. That would lead to erroneous calculation of average nr_running or nr_iowait. While at it, fix also a bug in BUG_ON() check in sched_update_nr_prod() function and remove unnecessary nr_running argument to sched_update_nr_prod() function. Change-Id: I46737614737292fae0d7204c4648fb9b862f65b2 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> 2015-02-16 16:42:59 +05:30			`sched_update_nr_prod(cpu_of(rq), count, false);`
sched, nohz: Change rq->nr_running to always use wrappers Sometimes ->nr_running may cross 2 but interrupt is not being sent to rq's cpu. In this case we don't reenable the timer. Looks like this may be the reason for rare unexpected effects, if nohz is enabled. Patch replaces all places of direct changing of nr_running and makes add_nr_running() caring about crossing border. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-05-09 03:00:14 +04:00			`rq->nr_running -= count;`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`}`

CHROMIUM: sched: update the average of nr_running Doing a Exponential moving average per nr_running++/-- does not guarantee a fixed sample rate which induces errors if there are lots of threads being enqueued/dequeued from the rq (Linpack mt). Instead of keeping track of the avg, the scheduler now keeps track of the integral of nr_running and allows the readers to perform filtering on top. Original-author: Sai Charan Gurrappadi <sgurrappadi@nvidia.com> Change-Id: Id946654f32fa8be0eaf9d8fa7c9a8039b5ef9fab Signed-off-by: Joseph Lo <josephl@nvidia.com> Signed-off-by: Andrew Bresticker <abrestic@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/174694 Reviewed-on: https://chromium-review.googlesource.com/272853 [jstultz: fwdported to 4.4] Signed-off-by: John Stultz <john.stultz@linaro.org> 2013-04-22 14:39:18 +08:00			`#ifdef CONFIG_CPU_QUIET`
			`#define NR_AVE_SCALE(x) ((x) << FSHIFT)`
			`static inline u64 do_nr_running_integral(struct rq *rq)`
			`{`
			`s64 nr, deltax;`
			`u64 nr_running_integral = rq->nr_running_integral;`

			`deltax = rq->clock_task - rq->nr_last_stamp;`
			`nr = NR_AVE_SCALE(rq->nr_running);`

			`nr_running_integral += nr * deltax;`

			`return nr_running_integral;`
			`}`

			`static inline void add_nr_running(struct rq *rq, unsigned count)`
			`{`
			`write_seqcount_begin(&rq->ave_seqcnt);`
			`rq->nr_running_integral = do_nr_running_integral(rq);`
			`rq->nr_last_stamp = rq->clock_task;`
			`__add_nr_running(rq, count);`
			`write_seqcount_end(&rq->ave_seqcnt);`
			`}`

			`static inline void sub_nr_running(struct rq *rq, unsigned count)`
			`{`
			`write_seqcount_begin(&rq->ave_seqcnt);`
			`rq->nr_running_integral = do_nr_running_integral(rq);`
			`rq->nr_last_stamp = rq->clock_task;`
			`__sub_nr_running(rq, count);`
			`write_seqcount_end(&rq->ave_seqcnt);`
			`}`
			`#else`
			`#define add_nr_running __add_nr_running`
			`#define sub_nr_running __sub_nr_running`
			`#endif`

sched: Keep at least 1 tick per second for active dynticks tasks The scheduler doesn't yet fully support environments with a single task running without a periodic tick. In order to ensure we still maintain the duties of scheduler_tick(), keep at least 1 tick per second. This makes sure that we keep the progression of various scheduler accounting and background maintainance even with a very low granularity. Examples include cpu load, sched average, CFS entity vruntime, avenrun and events such as load balancing, amongst other details handled in sched_class::task_tick(). This limitation will be removed in the future once we get these individual items to work in full dynticks CPUs. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> 2013-05-03 03:39:05 +02:00			`static inline void rq_last_tick_reset(struct rq *rq)`
			`{`
			`#ifdef CONFIG_NO_HZ_FULL`
			`rq->last_sched_tick = jiffies;`
			`#endif`
			`}`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`extern void update_rq_clock(struct rq *rq);`

			`extern void activate_task(struct rq rq, struct task_struct p, int flags);`
			`extern void deactivate_task(struct rq rq, struct task_struct p, int flags);`

			`extern void check_preempt_curr(struct rq rq, struct task_struct p, int flags);`

			`extern const_debug unsigned int sysctl_sched_time_avg;`
			`extern const_debug unsigned int sysctl_sched_nr_migrate;`
			`extern const_debug unsigned int sysctl_sched_migration_cost;`

			`static inline u64 sched_avg_period(void)`
			`{`
			`return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;`
			`}`

			`#ifdef CONFIG_SCHED_HRTICK`

			`/*`
			`* Use hrtick when:`
			`* - enabled by features`
			`* - hrtimer is actually high res`
			`*/`
			`static inline int hrtick_enabled(struct rq *rq)`
			`{`
			`if (!sched_feat(HRTICK))`
			`return 0;`
			`if (!cpu_active(cpu_of(rq)))`
			`return 0;`
			`return hrtimer_is_hres_active(&rq->hrtick_timer);`
			`}`

			`void hrtick_start(struct rq *rq, u64 delay);`

sched: Save some hrtick_start_fair cycles hrtick_start_fair() shows up in profiles even when disabled. v3.0.6 taskset -c 3 pipe-test PerfTop: 997 irqs/sec kernel:89.5% exact: 0.0% [1000Hz cycles], (all, CPU: 3) ------------------------------------------------------------------------------------------------ Virgin Patched samples pcnt function samples pcnt function _______ _____ ___________________________ _______ _____ ___________________________ 2880.00 10.2% __schedule 3136.00 11.3% __schedule 1634.00 5.8% pipe_read 1615.00 5.8% pipe_read 1458.00 5.2% system_call 1534.00 5.5% system_call 1382.00 4.9% _raw_spin_lock_irqsave 1412.00 5.1% _raw_spin_lock_irqsave 1202.00 4.3% pipe_write 1255.00 4.5% copy_user_generic_string 1164.00 4.1% copy_user_generic_string 1241.00 4.5% __switch_to 1097.00 3.9% __switch_to 929.00 3.3% mutex_lock 872.00 3.1% mutex_lock 846.00 3.0% mutex_unlock 687.00 2.4% mutex_unlock 804.00 2.9% pipe_write 682.00 2.4% native_sched_clock 713.00 2.6% native_sched_clock 643.00 2.3% system_call_after_swapgs 653.00 2.3% _raw_spin_unlock_irqrestore 617.00 2.2% sched_clock_local 633.00 2.3% fsnotify 612.00 2.2% fsnotify 605.00 2.2% sched_clock_local 596.00 2.1% _raw_spin_unlock_irqrestore 593.00 2.1% system_call_after_swapgs 542.00 1.9% sysret_check 559.00 2.0% sysret_check 467.00 1.7% fget_light 472.00 1.7% fget_light 462.00 1.6% finish_task_switch 461.00 1.7% finish_task_switch 437.00 1.5% vfs_write 442.00 1.6% vfs_write 431.00 1.5% do_sync_write 428.00 1.5% do_sync_write 413.00 1.5% select_task_rq_fair 404.00 1.5% _raw_spin_lock_irq 386.00 1.4% update_curr 402.00 1.4% update_curr 385.00 1.4% rw_verify_area 389.00 1.4% do_sync_read 377.00 1.3% _raw_spin_lock_irq 378.00 1.4% vfs_read 369.00 1.3% do_sync_read 340.00 1.2% pipe_iov_copy_from_user 360.00 1.3% vfs_read 316.00 1.1% __wake_up_sync_key * 342.00 1.2% hrtick_start_fair 313.00 1.1% __wake_up_common Signed-off-by: Mike Galbraith <efault@gmx.de> [ fixed !CONFIG_SCHED_HRTICK borkage ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321971607.6855.17.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-11-22 15:20:07 +01:00			`#else`

			`static inline int hrtick_enabled(struct rq *rq)`
			`{`
			`return 0;`
			`}`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#endif /* CONFIG_SCHED_HRTICK */`

			`#ifdef CONFIG_SMP`
			`extern void sched_avg_update(struct rq *rq);`
sched: Optimize freq invariant accounting Currently the freq invariant accounting (in __update_entity_runnable_avg() and sched_rt_avg_update()) get the scale factor from a weak function call, this means that even for archs that default on their implementation the compiler cannot see into this function and optimize the extra scaling math away. This is sad, esp. since its a 64-bit multiplication which can be quite costly on some platforms. So replace the weak function with #ifdef and __always_inline goo. This is not quite as nice from an arch support PoV but should at least result in compile time errors if done wrong. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ben Segall <bsegall@google.com> Cc: Morten.Rasmussen@arm.com Cc: Paul Turner <pjt@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-03-23 14:19:05 +01:00
			`#ifndef arch_scale_freq_capacity`
			`static __always_inline`
			`unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)`
			`{`
			`return SCHED_CAPACITY_SCALE;`
			`}`
			`#endif`
sched: Make scale_rt invariant with frequency The average running time of RT tasks is used to estimate the remaining compute capacity for CFS tasks. This remaining capacity is the original capacity scaled down by a factor (aka scale_rt_capacity). This estimation of available capacity must also be invariant with frequency scaling. A frequency scaling factor is applied on the running time of the RT tasks for computing scale_rt_capacity. In sched_rt_avg_update(), we now scale the RT execution time like below: rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT Then, scale_rt_capacity can be summarized by: scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total with available = total - rq->rt_avg This has been been optimized in current code by: scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT) But we can also developed the equation like below: scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total) and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in the computation of rq->rt_avg and scale_rt_capacity(). so rq->rt_avg += rt_delta * arch_scale_freq_capacity() and scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total) arch_scale_frequency_capacity() will be called in the hot path of the scheduler which implies to have a short and efficient function. As an example, arch_scale_frequency_capacity() should return a cached value that is updated periodically outside of the hot path. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-02-27 16:54:08 +01:00
sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define Bring arch_scale_cpu_capacity() in line with the recent change of its arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched: Optimize freq invariant accounting") from weak function to #define to allow inlining of the function. While at it, remove the ARCH_CAPACITY sched_feature as well. With the change to #define there isn't a straightforward way to allow runtime switch between an arch implementation and the default implementation of arch_scale_cpu_capacity() using sched_feature. The default was to use the arch-specific implementation, but only the arm architecture provides one and that is essentially equivalent to the default implementation. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com> Cc: Juri Lelli <Juri.Lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: daniel.lezcano@linaro.org Cc: mturquette@baylibre.com Cc: pang.xunlei@zte.com.cn Cc: rjw@rjwysocki.net Cc: sgurrappadi@nvidia.com Cc: vincent.guittot@linaro.org Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1439569394-11974-3-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-08-14 17:23:10 +01:00			`#ifndef arch_scale_cpu_capacity`
			`static __always_inline`
			`unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)`
			`{`
sched/fair: Make utilization tracking CPU scale-invariant Besides the existing frequency scale-invariance correction factor, apply CPU scale-invariance correction factor to utilization tracking to compensate for any differences in compute capacity. This could be due to micro-architectural differences (i.e. instructions per seconds) between cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current maximum frequency supported by individual cpus in SMP systems. In the existing implementation utilization isn't comparable between cpus as it is relative to the capacity of each individual CPU. Each segment of the sched_avg.util_sum geometric series is now scaled by the CPU performance factor too so the sched_avg.util_avg of each sched entity will be invariant from the particular CPU of the HMP/SMP system on which the sched entity is scheduled. With this patch, the utilization of a CPU stays relative to the max CPU performance of the fastest CPU in the system. In contrast to utilization (sched_avg.util_sum), load (sched_avg.load_sum) should not be scaled by compute capacity. The utilization metric is based on running time which only makes sense when cpus are _not_ fully utilized (utilization cannot go beyond 100% even if more tasks are added), where load is runnable time which isn't limited by the capacity of the CPU and therefore is a better metric for overloaded scenarios. If we run two nice-0 busy loops on two cpus with different compute capacity their load should be similar since their compute demands are the same. We have to assume that the compute demand of any task running on a fully utilized CPU (no spare cycles = 100% utilization) is high and the same no matter of the compute capacity of its current CPU, hence we shouldn't scale load by CPU capacity. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/55CE7409.1000700@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-08-15 00:04:41 +01:00			`if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))`
sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define Bring arch_scale_cpu_capacity() in line with the recent change of its arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched: Optimize freq invariant accounting") from weak function to #define to allow inlining of the function. While at it, remove the ARCH_CAPACITY sched_feature as well. With the change to #define there isn't a straightforward way to allow runtime switch between an arch implementation and the default implementation of arch_scale_cpu_capacity() using sched_feature. The default was to use the arch-specific implementation, but only the arm architecture provides one and that is essentially equivalent to the default implementation. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com> Cc: Juri Lelli <Juri.Lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: daniel.lezcano@linaro.org Cc: mturquette@baylibre.com Cc: pang.xunlei@zte.com.cn Cc: rjw@rjwysocki.net Cc: sgurrappadi@nvidia.com Cc: vincent.guittot@linaro.org Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1439569394-11974-3-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-08-14 17:23:10 +01:00			`return sd->smt_gain / sd->span_weight;`

			`return SCHED_CAPACITY_SCALE;`
			`}`
			`#endif`

sched/fair: jump to max OPP when crossing UP threshold Since the true utilization of a long running task is not detectable while it is running and might be bigger than the current cpu capacity, create the maximum cpu capacity head room by requesting the maximum cpu capacity once the cpu usage plus the capacity margin exceeds the current capacity. This is also done to try to harm the performance of a task the least. Original fair-class only version authored by Juri Lelli <juri.lelli@arm.com>. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Steve Muckle <smuckle@linaro.org> 2015-06-25 14:12:33 +01:00			`#ifdef CONFIG_SMP`
			`static inline unsigned long capacity_of(int cpu)`
			`{`
			`return cpu_rq(cpu)->cpu_capacity;`
			`}`

			`static inline unsigned long capacity_orig_of(int cpu)`
			`{`
			`return cpu_rq(cpu)->cpu_capacity_orig;`
			`}`

sched: Introduce Window Assisted Load Tracking (WALT) use a window based view of time in order to track task demand and CPU utilization in the scheduler. Window Assisted Load Tracking (WALT) implementation credits: Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park, Pavan Kumar Kondeti, Olav Haugan 2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla and Todd Kjos Change-Id: I21408236836625d4e7d7de1843d20ed5ff36c708 Includes fixes for issues: eas/walt: Use walt_ktime_clock() instead of ktime_get_ns() to avoid a race resulting in watchdog resets BUG: 29353986 Change-Id: Ic1820e22a136f7c7ebd6f42e15f14d470f6bbbdb Handle walt accounting anomoly during resume During resume, there is a corner case where on wakeup, a task's prev_runnable_sum can go negative. This is a workaround that fixes the condition and warns (instead of crashing). BUG: 29464099 Change-Id: I173e7874324b31a3584435530281708145773508 Signed-off-by: Todd Kjos <tkjos@google.com> Signed-off-by: Srinath Sridharan <srinathsr@google.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> [jstultz: fwdported to 4.4] Signed-off-by: John Stultz <john.stultz@linaro.org> 2016-05-31 09:08:38 -07:00			`extern unsigned int sysctl_sched_use_walt_cpu_util;`
			`extern unsigned int walt_ravg_window;`
sched: walt: Correct WALT window size initialization It is preferable that WALT window rollover occurs just before a tick, since the tick is an opportune moment to record a complete window's statistics, as well as report those stats to the cpu frequency governor. When CONFIG_HZ results in a TICK_NSEC that isn't a integral number, this requirement may be violated. Account for this by reducing the WALT window size to the nearest multiple of TICK_NSEC. Commit d368c6faa19b ("sched: walt: fix window misalignment when HZ=300") attempted to do this but WALT isn't using MIN_SCHED_RAVG_WINDOW as the window size and the patch was doing nothing. Also, change the type of 'walt_disabled' to bool and warn if an invalid window size causes WALT to be disabled. Change-Id: Ie3dcfc21a3df4408254ca1165a355bbe391ed5c7 Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org> 2017-08-10 17:26:20 -07:00			`extern bool walt_disabled;`
sched: Introduce Window Assisted Load Tracking (WALT) use a window based view of time in order to track task demand and CPU utilization in the scheduler. Window Assisted Load Tracking (WALT) implementation credits: Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park, Pavan Kumar Kondeti, Olav Haugan 2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla and Todd Kjos Change-Id: I21408236836625d4e7d7de1843d20ed5ff36c708 Includes fixes for issues: eas/walt: Use walt_ktime_clock() instead of ktime_get_ns() to avoid a race resulting in watchdog resets BUG: 29353986 Change-Id: Ic1820e22a136f7c7ebd6f42e15f14d470f6bbbdb Handle walt accounting anomoly during resume During resume, there is a corner case where on wakeup, a task's prev_runnable_sum can go negative. This is a workaround that fixes the condition and warns (instead of crashing). BUG: 29464099 Change-Id: I173e7874324b31a3584435530281708145773508 Signed-off-by: Todd Kjos <tkjos@google.com> Signed-off-by: Srinath Sridharan <srinathsr@google.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> [jstultz: fwdported to 4.4] Signed-off-by: John Stultz <john.stultz@linaro.org> 2016-05-31 09:08:38 -07:00
sched/fair: jump to max OPP when crossing UP threshold Since the true utilization of a long running task is not detectable while it is running and might be bigger than the current cpu capacity, create the maximum cpu capacity head room by requesting the maximum cpu capacity once the cpu usage plus the capacity margin exceeds the current capacity. This is also done to try to harm the performance of a task the least. Original fair-class only version authored by Juri Lelli <juri.lelli@arm.com>. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Steve Muckle <smuckle@linaro.org> 2015-06-25 14:12:33 +01:00			`/*`
			`* cpu_util returns the amount of capacity of a CPU that is used by CFS`
			`* tasks. The unit of the return value must be the one of capacity so we can`
			`* compare the utilization with the capacity of the CPU that is available for`
			`* CFS task (ie cpu_capacity).`
			`*`
			`* cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the`
			`* recent utilization of currently non-runnable tasks on a CPU. It represents`
			`* the amount of utilization of a CPU in the range [0..capacity_orig] where`
			`* capacity_orig is the cpu_capacity available at the highest frequency`
			`* (arch_scale_freq_capacity()).`
			`* The utilization of a CPU converges towards a sum equal to or less than the`
			`* current capacity (capacity_curr <= capacity_orig) of the CPU because it is`
			`* the running time on this CPU scaled by capacity_curr.`
			`*`
			`* Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even`
			`* higher than capacity_orig because of unfortunate rounding in`
			`* cfs.avg.util_avg or just after migrating tasks and new task wakeups until`
			`* the average stabilizes with the new running time. We need to check that the`
			`* utilization stays within the range of [0..capacity_orig] and cap it if`
			`* necessary. Without utilization capping, a group could be seen as overloaded`
			`* (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of`
			`* available capacity. We allow utilization to overshoot capacity_curr (but not`
			`* capacity_orig) as it useful for predicting the capacity required after task`
			`* migrations (scheduler-driven DVFS).`
			`*/`
			`static inline unsigned long __cpu_util(int cpu, int delta)`
			`{`
			`unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;`
			`unsigned long capacity = capacity_orig_of(cpu);`

sched: Introduce Window Assisted Load Tracking (WALT) use a window based view of time in order to track task demand and CPU utilization in the scheduler. Window Assisted Load Tracking (WALT) implementation credits: Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park, Pavan Kumar Kondeti, Olav Haugan 2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla and Todd Kjos Change-Id: I21408236836625d4e7d7de1843d20ed5ff36c708 Includes fixes for issues: eas/walt: Use walt_ktime_clock() instead of ktime_get_ns() to avoid a race resulting in watchdog resets BUG: 29353986 Change-Id: Ic1820e22a136f7c7ebd6f42e15f14d470f6bbbdb Handle walt accounting anomoly during resume During resume, there is a corner case where on wakeup, a task's prev_runnable_sum can go negative. This is a workaround that fixes the condition and warns (instead of crashing). BUG: 29464099 Change-Id: I173e7874324b31a3584435530281708145773508 Signed-off-by: Todd Kjos <tkjos@google.com> Signed-off-by: Srinath Sridharan <srinathsr@google.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> [jstultz: fwdported to 4.4] Signed-off-by: John Stultz <john.stultz@linaro.org> 2016-05-31 09:08:38 -07:00			`#ifdef CONFIG_SCHED_WALT`
sched: WALT: fix potential overflow Task demand and CPU util are in u64. Change-Id: If7ec1623e723026d3346201122aab0303a6d2ba2 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2017-01-20 11:10:15 -08:00			`if (!walt_disabled && sysctl_sched_use_walt_cpu_util)`
			`util = div64_u64(cpu_rq(cpu)->cumulative_runnable_avg,`
			`walt_ravg_window >> SCHED_LOAD_SHIFT);`
sched: Introduce Window Assisted Load Tracking (WALT) use a window based view of time in order to track task demand and CPU utilization in the scheduler. Window Assisted Load Tracking (WALT) implementation credits: Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park, Pavan Kumar Kondeti, Olav Haugan 2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla and Todd Kjos Change-Id: I21408236836625d4e7d7de1843d20ed5ff36c708 Includes fixes for issues: eas/walt: Use walt_ktime_clock() instead of ktime_get_ns() to avoid a race resulting in watchdog resets BUG: 29353986 Change-Id: Ic1820e22a136f7c7ebd6f42e15f14d470f6bbbdb Handle walt accounting anomoly during resume During resume, there is a corner case where on wakeup, a task's prev_runnable_sum can go negative. This is a workaround that fixes the condition and warns (instead of crashing). BUG: 29464099 Change-Id: I173e7874324b31a3584435530281708145773508 Signed-off-by: Todd Kjos <tkjos@google.com> Signed-off-by: Srinath Sridharan <srinathsr@google.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> [jstultz: fwdported to 4.4] Signed-off-by: John Stultz <john.stultz@linaro.org> 2016-05-31 09:08:38 -07:00			`#endif`
Merge android-4.4@a8935c9 (v4.4.87) into msm-4.4 * refs/heads/tmp-a8935c9: Linux 4.4.87 crypto: algif_skcipher - only call put_page on referenced and used pages epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove() kvm: arm/arm64: Force reading uncached stage2 PGD kvm: arm/arm64: Fix race in resetting stage2 PGD drm/ttm: Fix accounting error when fail to get pages for pool xfrm: policy: check policy direction value wl1251: add a missing spin_lock_init() CIFS: remove endian related sparse warning CIFS: Fix maximum SMB2 header size alpha: uapi: Add support for __SANE_USERSPACE_TYPES__ cpuset: Fix incorrect memory_pressure control file mapping cpumask: fix spurious cpumask_of_node() on non-NUMA multi-node configs ceph: fix readpage from fscache i2c: ismt: Return EMSGSIZE for block reads with bogus length i2c: ismt: Don't duplicate the receive length for block reads irqchip: mips-gic: SYNC after enabling GIC region ANDROID: cpufreq-dt: Set sane defaults for schedutil rate limits BACKPORT: cpufreq: schedutil: Use policy-dependent transition delays FROMLIST: binder: fix an ret value override FROMLIST: binder: fix memory corruption in binder_transaction binder Linux 4.4.86 drm/i915: fix compiler warning in drivers/gpu/drm/i915/intel_uncore.c scsi: sg: reset 'res_in_use' after unlinking reserved array scsi: sg: protect accesses to 'reserved' page array arm64: fpsimd: Prevent registers leaking across exec x86/io: Add "memory" clobber to insb/insw/insl/outsb/outsw/outsl arm64: mm: abort uaccess retries upon fatal signal lpfc: Fix Device discovery failures during switch reboot test. p54: memset(0) whole array lightnvm: initialize ppa_addr in dev_to_generic_addr() gcov: support GCC 7.1 gcov: add support for gcc version >= 6 i2c: jz4780: drop superfluous init btrfs: remove duplicate const specifier ALSA: au88x0: Fix zero clear of stream->resources scsi: isci: avoid array subscript warning sched: WALT: fix window mis-alignment sched: EAS: kill incorrect nohz idle cpu kick sched: EAS: fix incorrect energy delta calculation due to rounding error sched: EAS/WALT: take into account of waking task's load cpufreq: sched: WALT: don't apply capacity margin twice sched: WALT: fix potential overflow sched: EAS: schedfreq: fix CPU util over estimation sched: EAS/WALT: use cr_avg instead of prev_runnable_sum sched: WALT: fix broken cumulative runnable average accounting sched: deadline: WALT: account cumulative runnable avg FROMLIST: android: binder: Add page usage in binder stats FROMLIST: android: binder: Add shrinker tracepoints FROMLIST: android: binder: Add global lru shrinker to binder FROMLIST: android: binder: Move buffer out of area shared with user space FROMLIST: android: binder: Add allocator selftest FROMLIST: android: binder: Refactor prev and next buffer into a helper function android: android-base.config: enable IP6_NF_MATCH_RPFILTER UPSTREAM: cpufreq: schedutil: Use unsigned int for iowait boost UPSTREAM: cpufreq: schedutil: Make iowait boost more energy efficient Conflicts: drivers/cpufreq/cpufreq-dt.c kernel/sched/deadline.c kernel/sched/fair.c kernel/sched/sched.h Change-Id: Iee31db3fd1a0d1650ebf3d6de307a4e4637120b4 Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org> 2017-09-21 13:19:38 -07:00
sched/fair: jump to max OPP when crossing UP threshold Since the true utilization of a long running task is not detectable while it is running and might be bigger than the current cpu capacity, create the maximum cpu capacity head room by requesting the maximum cpu capacity once the cpu usage plus the capacity margin exceeds the current capacity. This is also done to try to harm the performance of a task the least. Original fair-class only version authored by Juri Lelli <juri.lelli@arm.com>. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Steve Muckle <smuckle@linaro.org> 2015-06-25 14:12:33 +01:00			`delta += util;`
			`if (delta < 0)`
			`return 0;`

			`return (delta >= capacity) ? capacity : delta;`
			`}`

			`static inline unsigned long cpu_util(int cpu)`
			`{`
			`return __cpu_util(cpu, 0);`
			`}`

sched: EAS/WALT: use cr_avg instead of prev_runnable_sum WALT accounts two major statistics; CPU load and cumulative tasks demand. The CPU load which is account of accumulated each CPU's absolute execution time is for CPU frequency guidance. Whereas cumulative tasks demand which is each CPU's instantaneous load to reflect CPU's load at given time is for task placement decision. Use cumulative tasks demand for cpu_util() for task placement and introduce cpu_util_freq() for frequency guidance. Change-Id: Id928f01dbc8cb2a617cdadc584c1f658022565c5 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-12-08 16:12:12 -08:00			`static inline unsigned long cpu_util_freq(int cpu)`
			`{`
			`unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;`
			`unsigned long capacity = capacity_orig_of(cpu);`

			`#ifdef CONFIG_SCHED_WALT`
sched: WALT: fix potential overflow Task demand and CPU util are in u64. Change-Id: If7ec1623e723026d3346201122aab0303a6d2ba2 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2017-01-20 11:10:15 -08:00			`if (!walt_disabled && sysctl_sched_use_walt_cpu_util)`
			`util = div64_u64(cpu_rq(cpu)->prev_runnable_sum,`
			`walt_ravg_window >> SCHED_LOAD_SHIFT);`
sched: EAS/WALT: use cr_avg instead of prev_runnable_sum WALT accounts two major statistics; CPU load and cumulative tasks demand. The CPU load which is account of accumulated each CPU's absolute execution time is for CPU frequency guidance. Whereas cumulative tasks demand which is each CPU's instantaneous load to reflect CPU's load at given time is for task placement decision. Use cumulative tasks demand for cpu_util() for task placement and introduce cpu_util_freq() for frequency guidance. Change-Id: Id928f01dbc8cb2a617cdadc584c1f658022565c5 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2016-12-08 16:12:12 -08:00			`#endif`
			`return (util >= capacity) ? capacity : util;`
			`}`

sched/fair: jump to max OPP when crossing UP threshold Since the true utilization of a long running task is not detectable while it is running and might be bigger than the current cpu capacity, create the maximum cpu capacity head room by requesting the maximum cpu capacity once the cpu usage plus the capacity margin exceeds the current capacity. This is also done to try to harm the performance of a task the least. Original fair-class only version authored by Juri Lelli <juri.lelli@arm.com>. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Steve Muckle <smuckle@linaro.org> 2015-06-25 14:12:33 +01:00			`#endif`

Merge branch 'v4.4-16.09-android-tmp' into lsk-v4.4-16.09-android * v4.4-16.09-android-tmp: unsafe_[get\|put]_user: change interface to use a error target label usercopy: remove page-spanning test for now usercopy: fix overlap check for kernel text mm/slub: support left redzone Linux 4.4.21 lib/mpi: mpi_write_sgl(): fix skipping of leading zero limbs regulator: anatop: allow regulator to be in bypass mode hwrng: exynos - Disable runtime PM on probe failure cpufreq: Fix GOV_LIMITS handling for the userspace governor metag: Fix atomic__return inline asm constraints scsi: fix upper bounds check of sense key in scsi_sense_key_string() ALSA: timer: fix NULL pointer dereference on memory allocation failure ALSA: timer: fix division by zero after SNDRV_TIMER_IOCTL_CONTINUE ALSA: timer: fix NULL pointer dereference in read()/ioctl() race ALSA: hda - Enable subwoofer on Dell Inspiron 7559 ALSA: hda - Add headset mic quirk for Dell Inspiron 5468 ALSA: rawmidi: Fix possible deadlock with virmidi registration ALSA: fireworks: accessing to user space outside spinlock ALSA: firewire-tascam: accessing to user space outside spinlock ALSA: usb-audio: Add sample rate inquiry quirk for B850V3 CP2114 crypto: caam - fix IV loading for authenc (giv)decryption uprobes: Fix the memcg accounting x86/apic: Do not init irq remapping if ioapic is disabled vhost/scsi: fix reuse of &vq->iov[out] in response bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two. ubifs: Fix assertion in layout_in_gaps() ovl: fix workdir creation ovl: listxattr: use strnlen() ovl: remove posix_acl_default from workdir ovl: don't copy up opaqueness wrappers for ->i_mutex access lustre: remove unused declaration timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING timekeeping: Cap array access in timekeeping_debug xfs: fix superblock inprogress check ASoC: atmel_ssc_dai: Don't unconditionally reset SSC on stream startup drm/msm: fix use of copy_from_user() while holding spinlock drm: Reject page_flip for !DRIVER_MODESET drm/radeon: fix radeon_move_blit on 32bit systems s390/sclp_ctl: fix potential information leak with /dev/sclp rds: fix an infoleak in rds_inc_info_copy powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0 nvme: Call pci_disable_device on the error path. cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork block: make sure a big bio is split into at most 256 bvecs block: Fix race triggered by blk_set_queue_dying() ext4: avoid modifying checksum fields directly during checksum verification ext4: avoid deadlock when expanding inode size ext4: properly align shifted xattrs when expanding inodes ext4: fix xattr shifting when expanding inodes part 2 ext4: fix xattr shifting when expanding inodes ext4: validate that metadata blocks do not overlap superblock net: Use ns_capable_noaudit() when determining net sysctl permissions kernel: Add noaudit variant of ns_capable() KEYS: Fix ASN.1 indefinite length object parsing drivers:hv: Lock access to hyperv_mmio resource tree cxlflash: Move to exponential back-off when cmd_room is not available netfilter: x_tables: check for size overflow drm/amdgpu/cz: enable/disable vce dpm even if vce pg is disabled cred: Reject inodes with invalid ids in set_create_file_as() fs: Check for invalid i_uid in may_follow_link() IB/IPoIB: Do not set skb truesize since using one linearskb udp: properly support MSG_PEEK with truncated buffers crypto: nx-842 - Mask XERS0 bit in return value cxlflash: Fix to avoid virtual LUN failover failure cxlflash: Fix to escalate LINK_RESET also on port 1 tipc: fix nl compat regression for link statistics tipc: fix an infoleak in tipc_nl_compat_link_dump netfilter: x_tables: check for size overflow Bluetooth: Add support for Intel Bluetooth device 8265 [8087:0a2b] drm/i915: Check VBT for port presence in addition to the strap on VLV/CHV drm/i915: Only ignore eDP ports that are connected Input: xpad - move pending clear to the correct location net: thunderx: Fix link status reporting x86/hyperv: Avoid reporting bogus NMI status for Gen2 instances crypto: vmx - IV size failing on skcipher API tda10071: Fix dependency to REGMAP_I2C crypto: vmx - Fix ABI detection crypto: vmx - comply with ABIs that specify vrsave as reserved. HID: core: prevent out-of-bound readings lpfc: Fix DMA faults observed upon plugging loopback connector block: fix blk_rq_get_max_sectors for driver private requests irqchip/gicv3-its: numa: Enable workaround for Cavium thunderx erratum 23144 clocksource: Allow unregistering the watchdog btrfs: Continue write in case of can_not_nocow blk-mq: End unstarted requests on dying queue cxlflash: Fix to resolve dead-lock during EEH recovery drm/radeon/mst: fix regression in lane/link handling. ecryptfs: fix handling of directory opening ALSA: hda: add AMD Polaris-10/11 AZ PCI IDs with proper driver caps drm: Balance error path for GEM handle allocation ntp: Fix ADJ_SETOFFSET being used w/ ADJ_NANO time: Verify time values in adjtimex ADJ_SETOFFSET to avoid overflow Input: xpad - correctly handle concurrent LED and FF requests net: thunderx: Fix receive packet stats net: thunderx: Fix for multiqset not configured upon interface toggle perf/x86/cqm: Fix CQM memory leak and notifier leak perf/x86/cqm: Fix CQM handling of grouping events into a cache_group s390/crypto: provide correct file mode at device register. proc: revert /proc/<pid>/maps [stack:TID] annotation intel_idle: Support for Intel Xeon Phi Processor x200 Product Family cxlflash: Fix to avoid unnecessary scan with internal LUNs Drivers: hv: vmbus: don't manipulate with clocksources on crash Drivers: hv: vmbus: avoid scheduling in interrupt context in vmbus_initiate_unload() Drivers: hv: vmbus: avoid infinite loop in init_vp_index() arcmsr: fixes not release allocated resource arcmsr: fixed getting wrong configuration data s390/pci_dma: fix DMA table corruption with > 4 TB main memory net/mlx5e: Don't modify CQ before it was created net/mlx5e: Don't try to modify CQ moderation if it is not supported mmc: sdhci: Do not BUG on invalid vdd UVC: Add support for R200 depth camera sched/numa: Fix use-after-free bug in the task_numa_compare ALSA: hda - add codec support for Kabylake display audio codec drm/i915: Fix hpd live status bits for g4x tipc: fix nullptr crash during subscription cancel arm64: Add workaround for Cavium erratum 27456 net: thunderx: Fix for Qset error due to CQ full drm/radeon: fix dp link rate selection (v2) drm/amdgpu: fix dp link rate selection (v2) qla2xxx: Use ATIO type to send correct tmr response mmc: sdhci: 64-bit DMA actually has 4-byte alignment drm/atomic: Do not unset crtc when an encoder is stolen drm/i915/skl: Add missing SKL ids drm/i915/bxt: update list of PCIIDs hrtimer: Catch illegal clockids i40e/i40evf: Fix RSS rx-flow-hash configuration through ethtool mpt3sas: Fix for Asynchronous completion of timedout IO and task abort of timedout IO. mpt3sas: A correction in unmap_resources net: cavium: liquidio: fix check for in progress flag arm64: KVM: Configure TCR_EL2.PS at runtime irqchip/gic-v3: Make sure read from ICC_IAR1_EL1 is visible on redestributor pwm: lpc32xx: fix and simplify duty cycle and period calculations pwm: lpc32xx: correct number of PWM channels from 2 to 1 pwm: fsl-ftm: Fix clock enable/disable when using PM megaraid_sas: Add an i/o barrier megaraid_sas: Fix SMAP issue megaraid_sas: Do not allow PCI access during OCR s390/cio: update measurement characteristics s390/cio: ensure consistent measurement state s390/cio: fix measurement characteristics memleak qeth: initialize net_device with carrier off lpfc: Fix external loopback failure. lpfc: Fix mbox reuse in PLOGI completion lpfc: Fix RDP Speed reporting. lpfc: Fix crash in fcp command completion path. lpfc: Fix driver crash when module parameter lpfc_fcp_io_channel set to 16 lpfc: Fix RegLogin failed error seen on Lancer FC during port bounce lpfc: Fix the FLOGI discovery logic to comply with T11 standards lpfc: Fix FCF Infinite loop in lpfc_sli4_fcf_rr_next_index_get. cxl: Enable PCI device ID for future IBM CXL adapter cxl: fix build for GCC 4.6.x cxlflash: Enable device id for future IBM CXL adapter cxlflash: Resolve oops in wait_port_offline cxlflash: Fix to resolve cmd leak after host reset cxl: Fix DSI misses when the context owning task exits cxl: Fix possible idr warning when contexts are released Drivers: hv: vmbus: fix rescind-offer handling for device without a driver Drivers: hv: vmbus: serialize process_chn_event() and vmbus_close_internal() Drivers: hv: vss: run only on supported host versions drivers/hv: cleanup synic msrs if vmbus connect failed Drivers: hv: util: catch allocation errors tools: hv: report ENOSPC errors in hv_fcopy_daemon Drivers: hv: utils: run polling callback always in interrupt context Drivers: hv: util: Increase the timeout for util services lightnvm: fix missing grown bad block type lightnvm: fix locking and mempool in rrpc_lun_gc lightnvm: unlock rq and free ppa_list on submission fail lightnvm: add check after mempool allocation lightnvm: fix incorrect nr_free_blocks stat lightnvm: fix bio submission issue cxlflash: a couple off by one bugs fm10k: Cleanup exception handling for mailbox interrupt fm10k: Cleanup MSI-X interrupts in case of failure fm10k: reinitialize queuing scheme after calling init_hw fm10k: always check init_hw for errors fm10k: reset max_queues on init_hw_vf failure fm10k: Fix handling of NAPI budget when multiple queues are enabled per vector fm10k: Correct MTU for jumbo frames fm10k: do not assume VF always has 1 queue clk: xgene: Fix divider with non-zero shift value e1000e: fix division by zero on jumbo MTUs e1000: fix data race between tx_ring->next_to_clean ixgbe: Fix handling of NAPI budget when multiple queues are enabled per vector igb: fix NULL derefs due to skipped SR-IOV enabling igb: use the correct i210 register for EEMNGCTL igb: don't unmap NULL hw_addr i40e: Fix Rx hash reported to the stack by our driver i40e: clean whole mac filter list i40evf: check rings before freeing resources i40e: don't add zero MAC filter i40e: properly delete VF MAC filters i40e: Fix memory leaks, sideband filter programming i40e: fix: do not sleep in netdev_ops i40e/i40evf: Fix RS bit update in Tx path and disable force WB workaround i40evf: handle many MAC filters correctly i40e: Workaround fix for mss < 256 issue UPSTREAM: audit: fix a double fetch in audit_log_single_execve_arg() UPSTREAM: ARM: 8494/1: mm: Enable PXN when running non-LPAE kernel on LPAE processor FIXUP: sched/tune: update accouting before CPU capacity FIXUP: sched/tune: add fixes missing from a previous patch arm: Fix #if/#ifdef typo in topology.c arm: Fix build error "conflicting types for 'scale_cpu_capacity'" sched/walt: use do_div instead of division operator DEBUG: cpufreq: fix cpu_capacity tracing build for non-smp systems sched/walt: include missing header for arm_timer_read_counter() cpufreq: Kconfig: Fixup incorrect selection by CPU_FREQ_DEFAULT_GOV_SCHED sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() FIXUP: sched: scheduler-driven cpu frequency selection sched/rt: Add Kconfig option to enable panicking for RT throttling sched/rt: print RT tasks when RT throttling is activated UPSTREAM: sched: Fix a race between __kthread_bind() and sched_setaffinity() sched/fair: Favor higher cpus only for boosted tasks vmstat: make vmstat_updater deferrable again and shut down on idle sched/fair: call OPP update when going idle after migration sched/cpufreq_sched: fix thermal capping events sched/fair: Picking cpus with low OPPs for tasks that prefer idle CPUs FIXUP: sched/tune: do initialization as a postcore_initicall DEBUG: sched: add tracepoint for RD overutilized sched/tune: Introducing a new schedtune attribute prefer_idle sched: use util instead of capacity to select busy cpu arch_timer: add error handling when the MPM global timer is cleared FIXUP: sched: Fix double-release of spinlock in move_queued_task FIXUP: sched/fair: Fix hang during suspend in sched_group_energy FIXUP: sched: fix SchedFreq integration for both PELT and WALT sched: EAS: Avoid causing spikes to max-freq unnecessarily FIXUP: sched: fix set_cfs_cpu_capacity when WALT is in use sched/walt: Accounting for number of irqs pending on each core sched: Introduce Window Assisted Load Tracking (WALT) sched/tune: fix PB and PC cuts indexes definition sched/fair: optimize idle cpu selection for boosted tasks FIXUP: sched/tune: fix accounting for runnable tasks sched/tune: use a single initialisation function sched/{fair,tune}: simplify fair.c code FIXUP: sched/tune: fix payoff calculation for boost region sched/tune: Add support for negative boost values FIX: sched/tune: move schedtune_nornalize_energy into fair.c FIX: sched/tune: update usage of boosted task utilisation on CPU selection sched/fair: add tunable to set initial task load sched/fair: add tunable to force selection at cpu granularity sched: EAS: take cstate into account when selecting idle core sched/cpufreq_sched: Consolidated update FIXUP: sched: fix build for non-SMP target DEBUG: sched/tune: add tracepoint on P-E space filtering DEBUG: sched/tune: add tracepoint for energy_diff() values DEBUG: sched/tune: add tracepoint for task boost signal arm: topology: Define TC2 energy and provide it to the scheduler CHROMIUM: sched: update the average of nr_running DEBUG: schedtune: add tracepoint for schedtune_tasks_update() values DEBUG: schedtune: add tracepoint for CPU boost signal DEBUG: schedtune: add tracepoint for SchedTune configuration update DEBUG: sched: add energy procfs interface DEBUG: sched,cpufreq: add cpu_capacity change tracepoint DEBUG: sched: add tracepoint for CPU load/util signals DEBUG: sched: add tracepoint for task load/util signals DEBUG: sched: add tracepoint for cpu/freq scale invariance sched/fair: filter energy_diff() based on energy_payoff value sched/tune: add support to compute normalized energy sched/fair: keep track of energy/capacity variations sched/fair: add boosted task utilization sched/{fair,tune}: track RUNNABLE tasks impact on per CPU boost value sched/tune: compute and keep track of per CPU boost value sched/tune: add initial support for CGroups based boosting sched/fair: add boosted CPU usage sched/fair: add function to convert boost value into "margin" sched/tune: add sysctl interface to define a boost value sched/tune: add detailed documentation fixup! sched/fair: jump to max OPP when crossing UP threshold fixup! sched: scheduler-driven cpu frequency selection sched: rt scheduler sets capacity requirement sched: deadline: use deadline bandwidth in scale_rt_capacity sched: remove call of sched_avg_update from sched_rt_avg_update sched/cpufreq_sched: add trace events sched/fair: jump to max OPP when crossing UP threshold sched/fair: cpufreq_sched triggers for load balancing sched/{core,fair}: trigger OPP change request on fork() sched/fair: add triggers for OPP change requests sched: scheduler-driven cpu frequency selection cpufreq: introduce cpufreq_driver_is_slow sched: Consider misfit tasks when load-balancing sched: Add group_misfit_task load-balance type sched: Add per-cpu max capacity to sched_group_capacity sched: Do eas idle balance regardless of the rq avg idle value arm64: Enable max freq invariant scheduler load-tracking and capacity support arm: Enable max freq invariant scheduler load-tracking and capacity support sched: Update max cpu capacity in case of max frequency constraints cpufreq: Max freq invariant scheduler load-tracking and cpu capacity support arm64, topology: Updates to use DT bindings for EAS costing data sched: Support for extracting EAS energy costs from DT Documentation: DT bindings for energy model cost data required by EAS sched: Disable energy-unfriendly nohz kicks sched: Consider a not over-utilized energy-aware system as balanced sched: Energy-aware wake-up task placement sched: Determine the current sched_group idle-state sched, cpuidle: Track cpuidle state index in the scheduler sched: Add over-utilization/tipping point indicator sched: Estimate energy impact of scheduling decisions sched: Extend sched_group_energy to test load-balancing decisions sched: Calculate energy consumption of sched_group sched: Highest energy aware balancing sched_domain level pointer sched: Relocated cpu_util() and change return type sched: Compute cpu capacity available at current frequency arm64: Cpu invariant scheduler load-tracking and capacity support arm: Cpu invariant scheduler load-tracking and capacity support sched: Introduce SD_SHARE_CAP_STATES sched_domain flag sched: Initialize energy data structures sched: Introduce energy data structures sched: Make energy awareness a sched feature sched: Documentation for scheduler energy cost model sched: Prevent unnecessary active balance of single task in sched group sched: Enable idle balance to pull single task towards cpu with higher capacity sched: Consider spare cpu capacity at task wake-up sched: Add cpu capacity awareness to wakeup balancing sched: Store system-wide maximum cpu capacity in root domain arm: Update arch_scale_cpu_capacity() to reflect change to define arm64: Enable frequency invariant scheduler load-tracking support arm: Enable frequency invariant scheduler load-tracking support cpufreq: Frequency invariant scheduler load-tracking support sched/fair: Fix new task's load avg removed from source CPU in wake_up_new_task() FROMLIST: pstore: drop pmsg bounce buffer UPSTREAM: usercopy: remove page-spanning test for now UPSTREAM: usercopy: force check_object_size() inline BACKPORT: usercopy: fold builtin_const check into inline function UPSTREAM: x86/uaccess: force copy__user() to be inlined UPSTREAM: HID: core: prevent out-of-bound readings Android: Fix build breakages. UPSTREAM: tty: Prevent ldisc drivers from re-using stale tty fields UPSTREAM: netfilter: nfnetlink: correctly validate length of batch messages cpuset: Make cpusets restore on hotplug UPSTREAM: mm/slub: support left redzone UPSTREAM: Make the hardened user-copy code depend on having a hardened allocator Android: MMC/UFS IO Latency Histograms. UPSTREAM: usercopy: fix overlap check for kernel text UPSTREAM: usercopy: avoid potentially undefined behavior in pointer math UPSTREAM: unsafe_[get\|put]_user: change interface to use a error target label BACKPORT: arm64: mm: fix location of _etext BACKPORT: ARM: 8583/1: mm: fix location of _etext BACKPORT: Don't show empty tag stats for unprivileged uids UPSTREAM: tcp: fix use after free in tcp_xmit_retransmit_queue() ANDROID: base-cfg: drop SECCOMP_FILTER config UPSTREAM: [media] xc2028: unlock on error in xc2028_set_config() UPSTREAM: [media] xc2028: avoid use after free ANDROID: base-cfg: enable SECCOMP config ANDROID: rcu_sync: Export rcu_sync_lockdep_assert RFC: FROMLIST: cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork RFC: FROMLIST: cgroup: avoid synchronize_sched() in __cgroup_procs_write() RFC: FROMLIST: locking/percpu-rwsem: Optimize readers and reduce global impact net: ipv6: Fix ping to link-local addresses. ipv6: fix endianness error in icmpv6_err ANDROID: dm: android-verity: Allow android-verity to be compiled as an independent module backporting: a brief introduce of backported feautures on 4.4 Linux 4.4.20 sysfs: correctly handle read offset on PREALLOC attrs hwmon: (iio_hwmon) fix memory leak in name attribute ALSA: line6: Fix POD sysfs attributes segfault ALSA: line6: Give up on the lock while URBs are released. ALSA: line6: Remove double line6_pcm_release() after failed acquire. ACPI / SRAT: fix SRAT parsing order with both LAPIC and X2APIC present ACPI / sysfs: fix error code in get_status() ACPI / drivers: replace acpi_probe_lock spinlock with mutex ACPI / drivers: fix typo in ACPI_DECLARE_PROBE_ENTRY macro staging: comedi: ni_mio_common: fix wrong insn_write handler staging: comedi: ni_mio_common: fix AO inttrig backwards compatibility staging: comedi: comedi_test: fix timer race conditions staging: comedi: daqboard2000: bug fix board type matching code USB: serial: option: add WeTelecom 0x6802 and 0x6803 products USB: serial: option: add WeTelecom WM-D200 USB: serial: mos7840: fix non-atomic allocation in write path USB: serial: mos7720: fix non-atomic allocation in write path USB: fix typo in wMaxPacketSize validation usb: chipidea: udc: don't touch DP when controller is in host mode USB: avoid left shift by -1 dmaengine: usb-dmac: check CHCR.DE bit in usb_dmac_isr_channel() crypto: qat - fix aes-xts key sizes crypto: nx - off by one bug in nx_of_update_msc() Input: i8042 - set up shared ps2_cmd_mutex for AUX ports Input: i8042 - break load dependency between atkbd/psmouse and i8042 Input: tegra-kbc - fix inverted reset logic btrfs: properly track when rescan worker is running btrfs: waiting on qgroup rescan should not always be interruptible fs/seq_file: fix out-of-bounds read gpio: Fix OF build problem on UM usb: renesas_usbhs: gadget: fix return value check in usbhs_mod_gadget_probe() megaraid_sas: Fix probing cards without io port mpt3sas: Fix resume on WarpDrive flash cards cdc-acm: fix wrong pipe type on rx interrupt xfers i2c: cros-ec-tunnel: Fix usage of cros_ec_cmd_xfer() mfd: cros_ec: Add cros_ec_cmd_xfer_status() helper aacraid: Check size values after double-fetch from user ARC: Elide redundant setup of DMA callbacks ARC: Call trace_hardirqs_on() before enabling irqs ARC: use correct offset in pt_regs for saving/restoring user mode r25 ARC: build: Better way to detect ISA compatible toolchain drm/i915: fix aliasing_ppgtt leak drm/amdgpu: record error code when ring test failed drm/amd/amdgpu: sdma resume fail during S4 on CI drm/amdgpu: skip TV/CV in display parsing drm/amdgpu: avoid a possible array overflow drm/amdgpu: fix amdgpu_move_blit on 32bit systems drm/amdgpu: Change GART offset to 64-bit iio: fix sched WARNING "do not call blocking ops when !TASK_RUNNING" sched/nohz: Fix affine unpinned timers mess sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression of: fix reference counting in of_graph_get_endpoint_by_regs arm64: dts: rockchip: add reset saradc node for rk3368 SoCs mac80211: fix purging multicast PS buffer queue s390/dasd: fix hanging device after clear subchannel EDAC: Increment correct counter in edac_inc_ue_error() pinctrl/amd: Remove the default de-bounce time iommu/arm-smmu: Don't BUG() if we find aborting STEs with disable_bypass iommu/arm-smmu: Fix CMDQ error handling iommu/dma: Don't put uninitialised IOVA domains xhci: Make sure xhci handles USB_SPEED_SUPER_PLUS devices. USB: serial: ftdi_sio: add PIDs for Ivium Technologies devices USB: serial: ftdi_sio: add device ID for WICED USB UART dev board USB: serial: option: add support for Telit LE920A4 USB: serial: option: add D-Link DWM-156/A3 USB: serial: fix memleak in driver-registration error path xhci: don't dereference a xhci member after removing xhci usb: xhci: Fix panic if disconnect xhci: always handle "Command Ring Stopped" events usb/gadget: fix gadgetfs aio support. usb: gadget: fsl_qe_udc: off by one in setup_received_handle() USB: validate wMaxPacketValue entries in endpoint descriptors usb: renesas_usbhs: Use dmac only if the pipe type is bulk usb: renesas_usbhs: clear the BRDYSTS in usbhsg_ep_enable() USB: hub: change the locking in hub_activate USB: hub: fix up early-exit pathway in hub_activate usb: hub: Fix unbalanced reference count/memory leak/deadlocks usb: define USB_SPEED_SUPER_PLUS speed for SuperSpeedPlus USB3.1 devices usb: dwc3: gadget: increment request->actual once usb: dwc3: pci: add Intel Kabylake PCI ID usb: misc: usbtest: add fix for driver hang usb: ehci: change order of register cleanup during shutdown crypto: caam - defer aead_set_sh_desc in case of zero authsize crypto: caam - fix echainiv(authenc) encrypt shared descriptor crypto: caam - fix non-hmac hashes genirq/msi: Make sure PCI MSIs are activated early genirq/msi: Remove unused MSI_FLAG_IDENTITY_MAP um: Don't discard .text.exit section ACPI / CPPC: Prevent cpc_desc_ptr points to the invalid data ACPI: CPPC: Return error if _CPC is invalid on a CPU mmc: sdhci-acpi: Reduce Baytrail eMMC/SD/SDIO hangs PCI: Limit config space size for Netronome NFP4000 PCI: Add Netronome NFP4000 PF device ID PCI: Limit config space size for Netronome NFP6000 family PCI: Add Netronome vendor and device IDs PCI: Support PCIe devices with short cfg_size NVMe: Don't unmap controller registers on reset ALSA: hda - Manage power well properly for resume libnvdimm, nd_blk: mask off reserved status bits perf intel-pt: Fix occasional decoding errors when tracing system-wide vfio/pci: Fix NULL pointer oops in error interrupt setup handling virtio: fix memory leak in virtqueue_add() parisc: Fix order of EREFUSED define in errno.h arm64: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO ALSA: usb-audio: Add quirk for ELP HD USB Camera ALSA: usb-audio: Add a sample rate quirk for Creative Live! Cam Socialize HD (VF0610) powerpc/eeh: eeh_pci_enable(): fix checking of post-request state SUNRPC: allow for upcalls for same uid but different gss service SUNRPC: Handle EADDRNOTAVAIL on connection failures tools/testing/nvdimm: fix SIGTERM vs hotplug crash uprobes/x86: Fix RIP-relative handling of EVEX-encoded instructions x86/mm: Disable preemption during CR3 read+write hugetlb: fix nr_pmds accounting with shared page tables mm: SLUB hardened usercopy support mm: SLAB hardened usercopy support s390/uaccess: Enable hardened usercopy sparc/uaccess: Enable hardened usercopy powerpc/uaccess: Enable hardened usercopy ia64/uaccess: Enable hardened usercopy arm64/uaccess: Enable hardened usercopy ARM: uaccess: Enable hardened usercopy x86/uaccess: Enable hardened usercopy x86: remove more uaccess_32.h complexity x86: remove pointless uaccess_32.h complexity x86: fix SMAP in 32-bit environments Use the new batched user accesses in generic user string handling Add 'unsafe' user access functions for batched accesses x86: reorganize SMAP handling in user space accesses mm: Hardened usercopy mm: Implement stack frame object validation mm: Add is_migrate_cma_page Linux 4.4.19 Documentation/module-signing.txt: Note need for version info if reusing a key module: Invalidate signatures on force-loaded modules dm flakey: error READ bios during the down_interval rtc: s3c: Add s3c_rtc_{enable/disable}_clk in s3c_rtc_setfreq() lpfc: fix oops in lpfc_sli4_scmd_to_wqidx_distr() from lpfc_send_taskmgmt() ACPI / EC: Work around method reentrancy limit in ACPICA for _Qxx x86/platform/intel_mid_pci: Rework IRQ0 workaround PCI: Mark Atheros AR9485 and QCA9882 to avoid bus reset MIPS: hpet: Increase HPET_MIN_PROG_DELTA and decrease HPET_MIN_CYCLES MIPS: Don't register r4k sched clock when CPUFREQ enabled MIPS: mm: Fix definition of R6 cache instruction SUNRPC: Don't allocate a full sockaddr_storage for tracing Input: elan_i2c - properly wake up touchpad on ASUS laptops target: Fix ordered task CHECK_CONDITION early exception handling target: Fix max_unmap_lba_count calc overflow target: Fix race between iscsi-target connection shutdown + ABORT_TASK target: Fix missing complete during ABORT_TASK + CMD_T_FABRIC_STOP target: Fix ordered task target_setup_cmd_from_cdb exception hang iscsi-target: Fix panic when adding second TCP connection to iSCSI session ubi: Fix race condition between ubi device creation and udev ubi: Fix early logging ubi: Make volume resize power cut aware of: fix memory leak related to safe_name() IB/mlx4: Fix memory leak if QP creation failed IB/mlx4: Fix error flow when sending mads under SRIOV IB/mlx4: Fix the SQ size of an RC QP IB/IWPM: Fix a potential skb leak IB/IPoIB: Don't update neigh validity for unresolved entries IB/SA: Use correct free function IB/mlx5: Return PORT_ERR in Active to Initializing tranisition IB/mlx5: Fix post send fence logic IB/mlx5: Fix entries check in mlx5_ib_resize_cq IB/mlx5: Fix returned values of query QP IB/mlx5: Fix entries checks in mlx5_ib_create_cq IB/mlx5: Fix MODIFY_QP command input structure ALSA: hda - Fix headset mic detection problem for two dell machines ALSA: hda: add AMD Bonaire AZ PCI ID with proper driver caps ALSA: hda/realtek - Can't adjust speaker's volume on a Dell AIO ALSA: hda: Fix krealloc() with __GFP_ZERO usage mm/hugetlb: avoid soft lockup in set_max_huge_pages() mtd: nand: fix bug writing 1 byte less than page size block: fix bdi vs gendisk lifetime mismatch block: add missing group association in bio-cloning functions metag: Fix __cmpxchg_u32 asm constraint for CMP ftrace/recordmcount: Work around for addition of metag magic but not relocations balloon: check the number of available pages in leak balloon drm/i915/dp: Revert "drm/i915/dp: fall back to 18 bpp when sink capability is unknown" drm/i915: Never fully mask the the EI up rps interrupt on SNB/IVB drm/edid: Add 6 bpc quirk for display AEO model 0. drm: Restore double clflush on the last partial cacheline drm/nouveau/fbcon: fix font width not divisible by 8 drm/nouveau/gr/nv3x: fix instobj write offsets in gr setup drm/nouveau: check for supported chipset before booting fbdev off the hw drm/radeon: support backlight control for UNIPHY3 drm/radeon: fix firmware info version checks drm/radeon: Poll for both connect/disconnect on analog connectors drm/radeon: add a delay after ATPX dGPU power off drm/amdgpu/gmc7: add missing mullins case drm/amdgpu: fix firmware info version checks drm/amdgpu: Disable RPM helpers while reprobing connectors on resume drm/amdgpu: support backlight control for UNIPHY3 drm/amdgpu: Poll for both connect/disconnect on analog connectors drm/amdgpu: add a delay after ATPX dGPU power off w1:omap_hdq: fix regression netlabel: add address family checks to netlbl_{sock,req}_delattr() ARM: dts: sunxi: Add a startup delay for fixed regulator enabled phys audit: fix a double fetch in audit_log_single_execve_arg() iommu/amd: Update Alias-DTE in update_device_table() iommu/amd: Init unity mappings only for dma_ops domains iommu/amd: Handle IOMMU_DOMAIN_DMA in ops->domain_free call-back iommu/vt-d: Return error code in domain_context_mapping_one() iommu/exynos: Suppress unbinding to prevent system failure drm/i915: Don't complain about lack of ACPI video bios nfsd: don't return an unhashed lock stateid after taking mutex nfsd: Fix race between FREE_STATEID and LOCK nfs: don't create zero-length requests MIPS: KVM: Propagate kseg0/mapped tlb fault errors MIPS: KVM: Fix gfn range check in kseg0 tlb faults MIPS: KVM: Add missing gfn range check MIPS: KVM: Fix mapped fault broken commpage handling random: add interrupt callback to VMBus IRQ handler random: print a warning for the first ten uninitialized random users random: initialize the non-blocking pool via add_hwgenerator_randomness() CIFS: Fix a possible invalid memory access in smb2_query_symlink() cifs: fix crash due to race in hmac(md5) handling cifs: Check for existing directory when opening file with O_CREAT fs/cifs: make share unaccessible at root level mountable jbd2: make journal y2038 safe ARC: mm: don't loose PTE_SPECIAL in pte_modify() remoteproc: Fix potential race condition in rproc_add ovl: disallow overlayfs as upperdir HID: uhid: fix timeout when probe races with IO EDAC: Correct channel count limit Bluetooth: Fix l2cap_sock_setsockopt() with optname BT_RCVMTU spi: pxa2xx: Clear all RFT bits in reset_sccr1() on Intel Quark i2c: efm32: fix a failure path in efm32_i2c_probe() s5p-mfc: Add release callback for memory region devs s5p-mfc: Set device name for reserved memory region devs hp-wmi: Fix wifi cannot be hard-unblocked dm: set DMF_SUSPENDED* _before_ clearing DMF_NOFLUSH_SUSPENDING sur40: fix occasional oopses on device close sur40: lower poll interval to fix occasional FPS drops to ~56 FPS Fix RC5 decoding with Fintek CIR chipset vb2: core: Skip planes array verification if pb is NULL videobuf2-v4l2: Verify planes array in buffer dequeueing media: dvb_ringbuffer: Add memory barriers media: usbtv: prevent access to free'd resources mfd: qcom_rpm: Parametrize also ack selector size mfd: qcom_rpm: Fix offset error for msm8660 intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate() s390/cio: allow to reset channel measurement block KVM: nVMX: Fix memory corruption when using VMCS shadowing KVM: VMX: handle PML full VMEXIT that occurs during event delivery KVM: MTRR: fix kvm_mtrr_check_gfn_range_consistency page fault KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures arm64: mm: avoid fdt_check_header() before the FDT is fully mapped arm64: dts: rockchip: fixes the gic400 2nd region size for rk3368 pinctrl: cherryview: prevent concurrent access to GPIO controllers Bluetooth: hci_intel: Fix null gpio desc pointer dereference gpio: intel-mid: Remove potentially harmful code gpio: pca953x: Fix NBANK calculation for PCA9536 tty/serial: atmel: fix RS485 half duplex with DMA serial: samsung: Fix ERR pointer dereference on deferred probe tty: serial: msm: Don't read off end of tx fifo arm64: Fix incorrect per-cpu usage for boot CPU arm64: debug: unmask PSTATE.D earlier arm64: kernel: Save and restore UAO and addr_limit on exception entry USB: usbfs: fix potential infoleak in devio usb: renesas_usbhs: fix NULL pointer dereference in xfer_work() USB: serial: option: add support for Telit LE910 PID 0x1206 usb: dwc3: fix for the isoc transfer EP_BUSY flag usb: quirks: Add no-lpm quirk for Elan usb: renesas_usbhs: protect the CFIFOSEL setting in usbhsg_ep_enable() usb: f_fs: off by one bug in _ffs_func_bind() usb: gadget: avoid exposing kernel stack UPSTREAM: usb: gadget: configfs: add mutex lock before unregister gadget ANDROID: dm-verity: adopt changes made to dm callbacks UPSTREAM: ecryptfs: fix handling of directory opening ANDROID: net: core: fix UID-based routing ANDROID: net: fib: remove duplicate assignment FROMLIST: proc: Fix timerslack_ns CAP_SYS_NICE check when adjusting self ANDROID: dm verity fec: pack the fec_header structure ANDROID: dm: android-verity: Verify header before fetching table ANDROID: dm: allow adb disable-verity only in userdebug ANDROID: dm: mount as linear target if eng build ANDROID: dm: use default verity public key ANDROID: dm: fix signature verification flag ANDROID: dm: use name_to_dev_t ANDROID: dm: rename dm-linear methods for dm-android-verity ANDROID: dm: Minor cleanup ANDROID: dm: Mounting root as linear device when verity disabled ANDROID: dm-android-verity: Rebase on top of 4.1 ANDROID: dm: Add android verity target ANDROID: dm: fix dm_substitute_devices() ANDROID: dm: Rebase on top of 4.1 CHROMIUM: dm: boot time specification of dm= Implement memory_state_time, used by qcom,cpubw Revert "panic: Add board ID to panic output" usb: gadget: f_accessory: remove duplicate endpoint alloc BACKPORT: brcmfmac: defer DPC processing during probe FROMLIST: proc: Add LSM hook checks to /proc/<tid>/timerslack_ns FROMLIST: proc: Relax /proc/<tid>/timerslack_ns capability requirements UPSTREAM: ppp: defer netns reference release for ppp channel cpuset: Add allow_attach hook for cpusets on android. UPSTREAM: KEYS: Fix ASN.1 indefinite length object parsing ANDROID: sdcardfs: fix itnull.cocci warnings android-recommended.cfg: enable fstack-protector-strong Linux 4.4.18 mm: memcontrol: fix memcg id ref counter on swap charge move mm: memcontrol: fix swap counter leak on swapout from offline cgroup mm: memcontrol: fix cgroup creation failure after many small jobs ext4: fix reference counting bug on block allocation error ext4: short-cut orphan cleanup on error ext4: validate s_reserved_gdt_blocks on mount ext4: don't call ext4_should_journal_data() on the journal inode ext4: fix deadlock during page writeback ext4: check for extents that wrap around crypto: scatterwalk - Fix test in scatterwalk_done crypto: gcm - Filter out async ghash if necessary fs/dcache.c: avoid soft-lockup in dput() fuse: fix wrong assignment of ->flags in fuse_send_init() fuse: fuse_flush must check mapping->flags for errors fuse: fsync() did not return IO errors sysv, ipc: fix security-layer leaking block: fix use-after-free in seq file x86/syscalls/64: Add compat_sys_keyctl for 32-bit userspace drm/i915: Pretend cursor is always on for ILK-style WM calculations (v2) x86/mm/pat: Fix BUG_ON() in mmap_mem() on QEMU/i386 x86/pat: Document the PAT initialization sequence x86/xen, pat: Remove PAT table init code from Xen x86/mtrr: Fix PAT init handling when MTRR is disabled x86/mtrr: Fix Xorg crashes in Qemu sessions x86/mm/pat: Replace cpu_has_pat with boot_cpu_has() x86/mm/pat: Add pat_disable() interface x86/mm/pat: Add support of non-default PAT MSR setting devpts: clean up interface to pty drivers random: strengthen input validation for RNDADDTOENTCNT apparmor: fix ref count leak when profile sha1 hash is read Revert "s390/kdump: Clear subchannel ID to signal non-CCW/SCSI IPL" KEYS: 64-bit MIPS needs to use compat_sys_keyctl for 32-bit userspace arm: oabi compat: add missing access checks cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind i2c: i801: Allow ACPI SystemIO OpRegion to conflict with PCI BAR x86/mm/32: Enable full randomization on i386 and X86_32 HID: sony: do not bail out when the sixaxis refuses the output report PNP: Add Broadwell to Intel MCH size workaround PNP: Add Haswell-ULT to Intel MCH size workaround scsi: ignore errors from scsi_dh_add_device() ipath: Restrict use of the write() interface tcp: consider recv buf for the initial window scale qed: Fix setting/clearing bit in completion bitmap net/irda: fix NULL pointer dereference on memory allocation failure net: bgmac: Fix infinite loop in bgmac_dma_tx_add() bonding: set carrier off for devices created through netlink ipv4: reject RTNH_F_DEAD and RTNH_F_LINKDOWN from user space tcp: enable per-socket rate limiting of all 'challenge acks' tcp: make challenge acks less predictable arm64: relocatable: suppress R_AARCH64_ABS64 relocations in vmlinux arm64: vmlinux.lds: make __rela_offset and __dynsym_offset ABSOLUTE Linux 4.4.17 vfs: fix deadlock in file_remove_privs() on overlayfs intel_th: Fix a deadlock in modprobing intel_th: pci: Add Kaby Lake PCH-H support net: mvneta: set real interrupt per packet for tx_done libceph: apply new_state before new_up_client on incrementals libata: LITE-ON CX1-JB256-HP needs lower max_sectors i2c: mux: reg: wrong condition checked for of_address_to_resource return value posix_cpu_timer: Exit early when process has been reaped media: fix airspy usb probe error path ipr: Clear interrupt on croc/crocodile when running with LSI SCSI: fix new bug in scsi_dev_info_list string matching RDS: fix rds_tcp_init() error path can: fix oops caused by wrong rtnl dellink usage can: fix handling of unmodifiable configuration options fix can: c_can: Update D_CAN TX and RX functions to 32 bit - fix Altera Cyclone access can: at91_can: RX queue could get stuck at high bus load perf/x86: fix PEBS issues on Intel Atom/Core2 ovl: handle ATTR_KILL* sched/fair: Fix effective_load() to consistently use smoothed load mmc: block: fix packed command header endianness block: fix use-after-free in sys_ioprio_get() qeth: delete napi struct when removing a qeth device platform/chrome: cros_ec_dev - double fetch bug in ioctl clk: rockchip: initialize flags of clk_init_data in mmc-phase clock spi: sun4i: fix FIFO limit spi: sunxi: fix transfer timeout namespace: update event counter when umounting a deleted dentry 9p: use file_dentry() ext4: verify extent header depth ecryptfs: don't allow mmap when the lower fs doesn't support it Revert "ecryptfs: forbid opening files without mmap handler" locks: use file_inode() power_supply: power_supply_read_temp only if use_cnt > 0 cgroup: set css->id to -1 during init pinctrl: imx: Do not treat a PIN without MUX register as an error pinctrl: single: Fix missing flush of posted write for a wakeirq pvclock: Add CPU barriers to get correct version value Input: tsc200x - report proper input_dev name Input: xpad - validate USB endpoint count during probe Input: wacom_w8001 - w8001_MAX_LENGTH should be 13 Input: xpad - fix oops when attaching an unknown Xbox One gamepad Input: elantech - add more IC body types to the list Input: vmmouse - remove port reservation ALSA: timer: Fix leak in events via snd_timer_user_tinterrupt ALSA: timer: Fix leak in events via snd_timer_user_ccallback ALSA: timer: Fix leak in SNDRV_TIMER_IOCTL_PARAMS xenbus: don't bail early from xenbus_dev_request_and_reply() xenbus: don't BUG() on user mode induced condition xen/pciback: Fix conf_space read/write overlap check. ARC: unwind: ensure that .debug_frame is generated (vs. .eh_frame) arc: unwind: warn only once if DW2_UNWIND is disabled kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w pps: do not crash when failed to register vmlinux.lds: account for destructor sections mm, meminit: ensure node is online before checking whether pages are uninitialised mm, meminit: always return a valid node from early_pfn_to_nid mm, compaction: prevent VM_BUG_ON when terminating freeing scanner fs/nilfs2: fix potential underflow in call to crc32_le mm, compaction: abort free scanner if split fails mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask dmaengine: at_xdmac: double FIFO flush needed to compute residue dmaengine: at_xdmac: fix residue corruption dmaengine: at_xdmac: align descriptors on 64 bits x86/quirks: Add early quirk to reset Apple AirPort card x86/quirks: Reintroduce scanning of secondary buses x86/quirks: Apply nvidia_bugs quirk only on root bus USB: OHCI: Don't mark EDs as ED_OPER if scheduling fails Conflicts: arch/arm/kernel/topology.c arch/arm64/include/asm/arch_gicv3.h arch/arm64/kernel/topology.c block/bio.c drivers/cpufreq/Kconfig drivers/md/Makefile drivers/media/dvb-core/dvb_ringbuffer.c drivers/media/tuners/tuner-xc2028.c drivers/misc/Kconfig drivers/misc/Makefile drivers/mmc/core/host.c drivers/scsi/ufs/ufshcd.c drivers/scsi/ufs/ufshcd.h drivers/usb/dwc3/gadget.c drivers/usb/gadget/configfs.c fs/ecryptfs/file.c include/linux/mmc/core.h include/linux/mmc/host.h include/linux/mmzone.h include/linux/sched.h include/linux/sched/sysctl.h include/trace/events/power.h include/trace/events/sched.h init/Kconfig kernel/cpuset.c kernel/exit.c kernel/sched/Makefile kernel/sched/core.c kernel/sched/cputime.c kernel/sched/fair.c kernel/sched/features.h kernel/sched/rt.c kernel/sched/sched.h kernel/sched/stop_task.c kernel/sched/tune.c lib/Kconfig.debug mm/Makefile mm/vmstat.c Change-Id: I243a43231ca56a6362076fa6301827e1b0493be5 Signed-off-by: Runmin Wang <runminw@codeaurora.org> 2016-12-12 15:32:39 -08:00			`#ifdef CONFIG_SCHED_HMP`
			`/*`
			`* HMP and EAS are orthogonal. Hopefully the compiler just elides out all code`
			`* with the energy_aware() check, so that we don't even pay the comparison`
			`* penalty at runtime.`
			`*/`
			`#define energy_aware() false`
			`#else`
			`static inline bool energy_aware(void)`
			`{`
			`return sched_feat(ENERGY_AWARE);`
			`}`
			`#endif`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)`
			`{`
sched: Make scale_rt invariant with frequency The average running time of RT tasks is used to estimate the remaining compute capacity for CFS tasks. This remaining capacity is the original capacity scaled down by a factor (aka scale_rt_capacity). This estimation of available capacity must also be invariant with frequency scaling. A frequency scaling factor is applied on the running time of the RT tasks for computing scale_rt_capacity. In sched_rt_avg_update(), we now scale the RT execution time like below: rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT Then, scale_rt_capacity can be summarized by: scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total with available = total - rq->rt_avg This has been been optimized in current code by: scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT) But we can also developed the equation like below: scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total) and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in the computation of rq->rt_avg and scale_rt_capacity(). so rq->rt_avg += rt_delta * arch_scale_freq_capacity() and scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total) arch_scale_frequency_capacity() will be called in the hot path of the scheduler which implies to have a short and efficient function. As an example, arch_scale_frequency_capacity() should return a cached value that is updated periodically outside of the hot path. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-02-27 16:54:08 +01:00			`rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`}`
			`#else`
			`static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }`
			`static inline void sched_avg_update(struct rq *rq) { }`
			`#endif`

sched: Make dl_task_time() use task_rq_lock() Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c90ce ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-02-17 13:22:25 +01:00			`/*`
			`* __task_rq_lock - lock the rq @p resides on.`
			`*/`
			`static inline struct rq __task_rq_lock(struct task_struct p)`
			`__acquires(rq->lock)`
			`{`
			`struct rq *rq;`

			`lockdep_assert_held(&p->pi_lock);`

			`for (;;) {`
			`rq = task_rq(p);`
			`raw_spin_lock(&rq->lock);`
sched,lockdep: Employ lock pinning Employ the new lockdep lock pinning annotation to ensure no 'accidental' lock-breaks happen with rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2015-06-11 14:46:54 +02:00			`if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {`
			`lockdep_pin_lock(&rq->lock);`
sched: Make dl_task_time() use task_rq_lock() Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c90ce ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-02-17 13:22:25 +01:00			`return rq;`
sched,lockdep: Employ lock pinning Employ the new lockdep lock pinning annotation to ensure no 'accidental' lock-breaks happen with rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2015-06-11 14:46:54 +02:00			`}`
sched: Make dl_task_time() use task_rq_lock() Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c90ce ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-02-17 13:22:25 +01:00			`raw_spin_unlock(&rq->lock);`

			`while (unlikely(task_on_rq_migrating(p)))`
			`cpu_relax();`
			`}`
			`}`

			`/*`
			`* task_rq_lock - lock p->pi_lock and lock the rq @p resides on.`
			`*/`
			`static inline struct rq task_rq_lock(struct task_struct p, unsigned long *flags)`
			`__acquires(p->pi_lock)`
			`__acquires(rq->lock)`
			`{`
			`struct rq *rq;`

			`for (;;) {`
			`raw_spin_lock_irqsave(&p->pi_lock, *flags);`
			`rq = task_rq(p);`
			`raw_spin_lock(&rq->lock);`
			`/*`
			`* move_queued_task() task_rq_lock()`
			`*`
			`* ACQUIRE (rq->lock)`
			`* [S] ->on_rq = MIGRATING [L] rq = task_rq()`
			`* WMB (__set_task_cpu()) ACQUIRE (rq->lock);`
			`* [S] ->cpu = new_cpu [L] task_rq()`
			`* [L] ->on_rq`
			`* RELEASE (rq->lock)`
			`*`
			`* If we observe the old cpu in task_rq_lock, the acquire of`
			`* the old rq->lock will fully serialize against the stores.`
			`*`
			`* If we observe the new cpu in task_rq_lock, the acquire will`
			`* pair with the WMB to ensure we must then also see migrating.`
			`*/`
sched,lockdep: Employ lock pinning Employ the new lockdep lock pinning annotation to ensure no 'accidental' lock-breaks happen with rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2015-06-11 14:46:54 +02:00			`if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {`
			`lockdep_pin_lock(&rq->lock);`
sched: Make dl_task_time() use task_rq_lock() Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c90ce ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-02-17 13:22:25 +01:00			`return rq;`
sched,lockdep: Employ lock pinning Employ the new lockdep lock pinning annotation to ensure no 'accidental' lock-breaks happen with rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2015-06-11 14:46:54 +02:00			`}`
sched: Make dl_task_time() use task_rq_lock() Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c90ce ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-02-17 13:22:25 +01:00			`raw_spin_unlock(&rq->lock);`
			`raw_spin_unlock_irqrestore(&p->pi_lock, *flags);`

			`while (unlikely(task_on_rq_migrating(p)))`
			`cpu_relax();`
			`}`
			`}`

			`static inline void __task_rq_unlock(struct rq *rq)`
			`__releases(rq->lock)`
			`{`
sched,lockdep: Employ lock pinning Employ the new lockdep lock pinning annotation to ensure no 'accidental' lock-breaks happen with rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2015-06-11 14:46:54 +02:00			`lockdep_unpin_lock(&rq->lock);`
sched: Make dl_task_time() use task_rq_lock() Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c90ce ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-02-17 13:22:25 +01:00			`raw_spin_unlock(&rq->lock);`
			`}`

			`static inline void`
			`task_rq_unlock(struct rq rq, struct task_struct p, unsigned long *flags)`
			`__releases(rq->lock)`
			`__releases(p->pi_lock)`
			`{`
sched,lockdep: Employ lock pinning Employ the new lockdep lock pinning annotation to ensure no 'accidental' lock-breaks happen with rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> 2015-06-11 14:46:54 +02:00			`lockdep_unpin_lock(&rq->lock);`
sched: Make dl_task_time() use task_rq_lock() Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c90ce ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-02-17 13:22:25 +01:00			`raw_spin_unlock(&rq->lock);`
			`raw_spin_unlock_irqrestore(&p->pi_lock, *flags);`
			`}`

FIXUP: sched/tune: fix accounting for runnable tasks Contains: sched/tune: fix accounting for runnable tasks (1/5) The accounting for tasks into boost groups of different CPUs is currently broken mainly because: a) we do not properly track the change of boost group of a RUNNABLE task b) there are race conditions between migration code and accounting code This patch provides a fixes to ensure enqueue/dequeue accounting also for throttled tasks. Without this patch is can happen that a task is enqueued into a throttled RQ thus not being accounted for the boosting of the corresponding RQ. We could argue that a throttled task should not boost a CPU, however: a) properly implementing CPU boosting considering throttled tasks will increase a lot the complexity of the solution b) it's not easy to quantify the benefits introduced by such a more complex solution Since task throttling requires the usage of the CFS bandwidth controller, which is not widely used on mobile systems (at least not by Android kernels so far), for the time being we go for the simple solution and boost also for throttled RQs. sched/tune: fix accounting for runnable tasks (2/5) This patch provides the code required to enforce proper locking. A per boost group spinlock has been added to grant atomic accounting of tasks as well as to serialise enqueue/dequeue operations, triggered by tasks migrations, with cgroups's attach/detach operations. sched/tune: fix accounting for runnable tasks (3/5) This patch adds cgroups {allow,can,cancel}_attach callbacks. Since a task can be migrated between boost groups while it's running, the CGroups's attach callbacks have been added to properly migrate boost contributions of RUNNABLE tasks. The RQ's lock is used to serialise enqueue/dequeue operations, triggered by tasks migrations, with cgroups's attach/detach operations. While the SchedTune's CPU lock is used to grant atrocity of the accounting within the CPU. NOTE: the current implementation does not allows a concurrent CPU migration and CGroups change. sched/tune: fix accounting for runnable tasks (4/5) This fixes accounting for exiting tasks by adding a dedicated call early in the do_exit() syscall, which disables SchedTune accounting as soon as a task is flagged PF_EXITING. This flag is set before the multiple dequeue/enqueue dance triggered by cgroup_exit() which is useful only to inject useless tasks movements thus increasing possibilities for race conditions with the migration code. The schedtune_exit_task() call does the last dequeue of a task from its current boost group. This is a solution more aligned with what happens in mainline kernels (>v4.4) where the exit_cgroup does not move anymore a dying task to the root control group. sched/tune: fix accounting for runnable tasks (5/5) To avoid accounting issues at startup, this patch disable the SchedTune accounting until the required data structures have been properly initialized. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> [jstultz: fwdported to 4.4] Signed-off-by: John Stultz <john.stultz@linaro.org> 2016-07-28 18:44:40 +01:00			`extern struct rq lock_rq_of(struct task_struct p, unsigned long *flags);`
			`extern void unlock_rq_of(struct rq rq, struct task_struct p, unsigned long *flags);`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#ifdef CONFIG_SMP`
			`#ifdef CONFIG_PREEMPT`

			`static inline void double_rq_lock(struct rq rq1, struct rq rq2);`

			`/*`
			`* fair double_lock_balance: Safely acquires both rq->locks in a fair`
			`* way at the expense of forcing extra atomic operations in all`
			`* invocations. This assures that the double_lock is acquired using the`
			`* same underlying policy as the spinlock_t on this architecture, which`
			`* reduces latency compared to the unfair variant below. However, it`
			`* also adds more overhead and therefore may reduce throughput.`
			`*/`
			`static inline int _double_lock_balance(struct rq this_rq, struct rq busiest)`
			`__releases(this_rq->lock)`
			`__acquires(busiest->lock)`
			`__acquires(this_rq->lock)`
			`{`
			`raw_spin_unlock(&this_rq->lock);`
			`double_rq_lock(this_rq, busiest);`

			`return 1;`
			`}`

			`#else`
			`/*`
			`* Unfair double_lock_balance: Optimizes throughput at the expense of`
			`* latency by eliminating extra atomic operations when the locks are`
			`* already in proper order on entry. This favors lower cpu-ids and will`
			`* grant the double lock to lower cpus over higher ids under contention,`
			`* regardless of entry order into the function.`
			`*/`
			`static inline int _double_lock_balance(struct rq this_rq, struct rq busiest)`
			`__releases(this_rq->lock)`
			`__acquires(busiest->lock)`
			`__acquires(this_rq->lock)`
			`{`
			`int ret = 0;`

			`if (unlikely(!raw_spin_trylock(&busiest->lock))) {`
			`if (busiest < this_rq) {`
			`raw_spin_unlock(&this_rq->lock);`
			`raw_spin_lock(&busiest->lock);`
			`raw_spin_lock_nested(&this_rq->lock,`
			`SINGLE_DEPTH_NESTING);`
			`ret = 1;`
			`} else`
			`raw_spin_lock_nested(&busiest->lock,`
			`SINGLE_DEPTH_NESTING);`
			`}`
			`return ret;`
			`}`

			`#endif /* CONFIG_PREEMPT */`

			`/*`
			`* double_lock_balance - lock the busiest runqueue, this_rq is locked already.`
			`*/`
			`static inline int double_lock_balance(struct rq this_rq, struct rq busiest)`
			`{`
			`if (unlikely(!irqs_disabled())) {`
			`/* printk() doesn't work good under rq->lock */`
			`raw_spin_unlock(&this_rq->lock);`
			`BUG_ON(1);`
			`}`

			`return _double_lock_balance(this_rq, busiest);`
			`}`

			`static inline void double_unlock_balance(struct rq this_rq, struct rq busiest)`
			`__releases(busiest->lock)`
			`{`
FIXUP: sched: Fix double-release of spinlock in move_queued_task BUG: 29519455 Change-Id: I4d1c27a1b4bcbba03d4b175d170cfe1701a90ffd 2016-07-04 15:04:45 +01:00			`if (this_rq != busiest)`
			`raw_spin_unlock(&busiest->lock);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);`
			`}`

sched: Fix race in migrate_swap_stop() There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap places with task T, on CPU B. Task P: - call migrate_swap Task T: - go to sleep, removing itself from the runqueue Task P: - double lock the runqueues on CPU A & B Task T: - get woken up, place itself on the runqueue of CPU C Task P: - see that task T is on a runqueue, and pretend to remove it from the runqueue on CPU B Now CPUs B & C both have corrupted scheduler data structures. This patch fixes it, by holding the pi_lock for both of the tasks involved in the migrate swap. This prevents task T from waking up, and placing itself onto another runqueue, until after migrate_swap has released all locks. This means that, when migrate_swap checks, task T will be either on the runqueue where it was originally seen, or not on any runqueue at all. Migrate_swap deals correctly with of those cases. Tested-by: Joe Mario <jmario@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: hannes@cmpxchg.org Cc: aarcange@redhat.com Cc: srikar@linux.vnet.ibm.com Cc: tglx@linutronix.de Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-10 20:17:22 +02:00			`static inline void double_lock(spinlock_t l1, spinlock_t l2)`
			`{`
			`if (l1 > l2)`
			`swap(l1, l2);`

			`spin_lock(l1);`
			`spin_lock_nested(l2, SINGLE_DEPTH_NESTING);`
			`}`

sched/numa: Fix task_numa_free() lockdep splat Sasha reported that lockdep claims that the following commit: made numa_group.lock interrupt unsafe: 156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()") While I don't see how that could be, given the commit in question moved task_numa_free() from one irq enabled region to another, the below does make both gripes and lockups upon gripe with numa=fake=4 go away. Reported-by: Sasha Levin <sasha.levin@oracle.com> Fixes: 156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()") Signed-off-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: torvalds@linux-foundation.org Cc: mgorman@suse.com Cc: akpm@linux-foundation.org Cc: Dave Jones <davej@redhat.com> Link: http://lkml.kernel.org/r/1396860915.5170.5.camel@marge.simpson.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-04-07 10:55:15 +02:00			`static inline void double_lock_irq(spinlock_t l1, spinlock_t l2)`
			`{`
			`if (l1 > l2)`
			`swap(l1, l2);`

			`spin_lock_irq(l1);`
			`spin_lock_nested(l2, SINGLE_DEPTH_NESTING);`
			`}`

sched: Fix race in migrate_swap_stop() There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap places with task T, on CPU B. Task P: - call migrate_swap Task T: - go to sleep, removing itself from the runqueue Task P: - double lock the runqueues on CPU A & B Task T: - get woken up, place itself on the runqueue of CPU C Task P: - see that task T is on a runqueue, and pretend to remove it from the runqueue on CPU B Now CPUs B & C both have corrupted scheduler data structures. This patch fixes it, by holding the pi_lock for both of the tasks involved in the migrate swap. This prevents task T from waking up, and placing itself onto another runqueue, until after migrate_swap has released all locks. This means that, when migrate_swap checks, task T will be either on the runqueue where it was originally seen, or not on any runqueue at all. Migrate_swap deals correctly with of those cases. Tested-by: Joe Mario <jmario@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: hannes@cmpxchg.org Cc: aarcange@redhat.com Cc: srikar@linux.vnet.ibm.com Cc: tglx@linutronix.de Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-10 20:17:22 +02:00			`static inline void double_raw_lock(raw_spinlock_t l1, raw_spinlock_t l2)`
			`{`
			`if (l1 > l2)`
			`swap(l1, l2);`

			`raw_spin_lock(l1);`
			`raw_spin_lock_nested(l2, SINGLE_DEPTH_NESTING);`
			`}`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`/*`
			`* double_rq_lock - safely lock two runqueues`
			`*`
			`* Note this does not disable interrupts like task_rq_lock,`
			`* you need to do so manually before calling.`
			`*/`
			`static inline void double_rq_lock(struct rq rq1, struct rq rq2)`
			`__acquires(rq1->lock)`
			`__acquires(rq2->lock)`
			`{`
			`BUG_ON(!irqs_disabled());`
			`if (rq1 == rq2) {`
			`raw_spin_lock(&rq1->lock);`
			`__acquire(rq2->lock); /* Fake it out ;) */`
			`} else {`
			`if (rq1 < rq2) {`
			`raw_spin_lock(&rq1->lock);`
			`raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);`
			`} else {`
			`raw_spin_lock(&rq2->lock);`
			`raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);`
			`}`
			`}`
			`}`

			`/*`
			`* double_rq_unlock - safely unlock two runqueues`
			`*`
			`* Note this does not restore interrupts like task_rq_unlock,`
			`* you need to do so manually after calling.`
			`*/`
			`static inline void double_rq_unlock(struct rq rq1, struct rq rq2)`
			`__releases(rq1->lock)`
			`__releases(rq2->lock)`
			`{`
			`raw_spin_unlock(&rq1->lock);`
			`if (rq1 != rq2)`
			`raw_spin_unlock(&rq2->lock);`
			`else`
			`__release(rq2->lock);`
			`}`

sched: avoid scheduling RT threads on cores currently handling softirqs Bug: 31501544 Change-Id: I99dd7aaa12c11270b28dbabea484bcc8fb8ba0c1 Git-commit: 080ea011fd9f47315e1fc53185872ef813b59d00 Git-repo: https://android.googlesource.com/kernel/msm [pkondeti@codeaurora.org: resolved minor merge conflicts and fixed checkpatch warnings] Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> 2016-09-15 08:52:27 -07:00			`/*`
			`* task_may_not_preempt - check whether a task may not be preemptible soon`
			`*/`
			`extern bool task_may_not_preempt(struct task_struct *task, int cpu);`

sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`#else /* CONFIG_SMP */`

			`/*`
			`* double_rq_lock - safely lock two runqueues`
			`*`
			`* Note this does not disable interrupts like task_rq_lock,`
			`* you need to do so manually before calling.`
			`*/`
			`static inline void double_rq_lock(struct rq rq1, struct rq rq2)`
			`__acquires(rq1->lock)`
			`__acquires(rq2->lock)`
			`{`
			`BUG_ON(!irqs_disabled());`
			`BUG_ON(rq1 != rq2);`
			`raw_spin_lock(&rq1->lock);`
			`__acquire(rq2->lock); /* Fake it out ;) */`
			`}`

			`/*`
			`* double_rq_unlock - safely unlock two runqueues`
			`*`
			`* Note this does not restore interrupts like task_rq_unlock,`
			`* you need to do so manually after calling.`
			`*/`
			`static inline void double_rq_unlock(struct rq rq1, struct rq rq2)`
			`__releases(rq1->lock)`
			`__releases(rq2->lock)`
			`{`
			`BUG_ON(rq1 != rq2);`
			`raw_spin_unlock(&rq1->lock);`
			`__release(rq2->lock);`
			`}`

			`#endif`

			`extern struct sched_entity __pick_first_entity(struct cfs_rq cfs_rq);`
			`extern struct sched_entity __pick_last_entity(struct cfs_rq cfs_rq);`
sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h Currently print_cfs_rq() is declared in include/linux/sched.h. However it's not used outside kernel/sched. Hence move the declaration to kernel/sched/sched.h Also some functions are only available for CONFIG_SCHED_DEBUG=y. Hence move the declarations to within the #ifdef. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Iulia Manda <iulia.manda21@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1435252903-1081-2-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-06-25 22:51:41 +05:30
			`#ifdef CONFIG_SCHED_DEBUG`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00			`extern void print_cfs_stats(struct seq_file *m, int cpu);`
			`extern void print_rt_stats(struct seq_file *m, int cpu);`
sched/deadline: Add deadline rq status print This patch add deadline rq status print. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2014-10-31 06:39:33 +08:00			`extern void print_dl_stats(struct seq_file *m, int cpu);`
sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h Currently print_cfs_rq() is declared in include/linux/sched.h. However it's not used outside kernel/sched. Hence move the declaration to kernel/sched/sched.h Also some functions are only available for CONFIG_SCHED_DEBUG=y. Hence move the declarations to within the #ifdef. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Iulia Manda <iulia.manda21@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1435252903-1081-2-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-06-25 22:51:41 +05:30			`extern void`
			`print_cfs_rq(struct seq_file m, int cpu, struct cfs_rq cfs_rq);`
sched/numa: Fix numa balancing stats in /proc/pid/sched Commit 44dba3d5d6a1 ("sched: Refactor task_struct to use numa_faults instead of numa_* pointers") modified the way tsk->numa_faults stats are accounted. However that commit never touched show_numa_stats() that is displayed in /proc/pid/sched and thus the numbers displayed in /proc/pid/sched don't match the actual numbers. Fix it by making sure that /proc/pid/sched reflects the task fault numbers. Also add group fault stats too. Also couple of more modifications are added here: 1. Format changes: - Previously we would list two entries per node, one for private and one for shared. Also the home node info was listed in each entry. - Now preferred node, total_faults and current node are displayed separately. - Now there is one entry per node, that lists private,shared task and group faults. 2. Unit changes: - p->numa_pages_migrated was getting reset after every read of /proc/pid/sched. It's more useful to have absolute numbers since differential migrations between two accesses can be more easily calculated. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Iulia Manda <iulia.manda21@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1435252903-1081-4-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-06-25 22:51:43 +05:30
			`#ifdef CONFIG_NUMA_BALANCING`
			`extern void`
			`show_numa_stats(struct task_struct p, struct seq_file m);`
			`extern void`
			`print_numa_stats(struct seq_file *m, int node, unsigned long tsf,`
			`unsigned long tpf, unsigned long gsf, unsigned long gpf);`
			`#endif /* CONFIG_NUMA_BALANCING */`
			`#endif /* CONFIG_SCHED_DEBUG */`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
			`extern void init_cfs_rq(struct cfs_rq *cfs_rq);`
sched/core: Remove unused argument from init_[rt\|dl]_rq() Obviously, 'rq' is not used in these two functions, therefore, there is no reason for it to be passed as an argument. Signed-off-by: Abel Vesa <abelvesa@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1425383427-26244-1-git-send-email-abelvesa@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2015-03-03 13:50:27 +02:00			`extern void init_rt_rq(struct rt_rq *rt_rq);`
			`extern void init_dl_rq(struct dl_rq *dl_rq);`
sched: Make separate sched.c translation units Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include ".c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-10-25 10:00:11 +02:00
sched: Fix race on toggling cfs_bandwidth_used When we transition cfs_bandwidth_used to false, any currently throttled groups will incorrectly return false from cfs_rq_throttled. While tg_set_cfs_bandwidth will unthrottle them eventually, currently running code (including at least dequeue_task_fair and distribute_cfs_runtime) will cause errors. Fix this by turning off cfs_bandwidth_used only after unthrottling all cfs_rqs. Tested: toggle bandwidth back and forth on a loaded cgroup. Caused crashes in minutes without the patch, hasn't crashed with it. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181611.22647.80365.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> 2013-10-16 11:16:12 -07:00			`extern void cfs_bandwidth_usage_inc(void);`
			`extern void cfs_bandwidth_usage_dec(void);`
sched, nohz: Introduce nohz_flags in 'struct rq' Introduce nohz_flags in the struct rq, which will track these two flags for now. NOHZ_TICK_STOPPED keeps track of the tick stopped status that gets set when the tick is stopped. It will be used to update the nohz idle load balancer data structures during the first busy tick after the tick is restarted. At this first busy tick after tickless idle, NOHZ_TICK_STOPPED flag will be reset. This will minimize the nohz idle load balancer status updates that currently happen for every tickless exit, making it more scalable when there are many logical cpu's that enter and exit idle often. NOHZ_BALANCE_KICK will track the need for nohz idle load balance on this rq. This will replace the nohz_balance_kick in the rq, which was not being updated atomically. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.499438999@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-01 17:07:32 -08:00
nohz: Rename CONFIG_NO_HZ to CONFIG_NO_HZ_COMMON We are planning to convert the dynticks Kconfig options layout into a choice menu. The user must be able to easily pick any of the following implementations: constant periodic tick, idle dynticks, full dynticks. As this implies a mutual exclusion, the two dynticks implementions need to converge on the selection of a common Kconfig option in order to ease the sharing of a common infrastructure. It would thus seem pretty natural to reuse CONFIG_NO_HZ to that end. It already implements all the idle dynticks code and the full dynticks depends on all that code for now. So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ. On the other hand we want to stay backward compatible: if CONFIG_NO_HZ is set in an older config file, we want to enable CONFIG_NO_HZ_IDLE by default. But we can't afford both at the same time or we run into a circular dependency: 1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select CONFIG_NO_HZ 2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE We might be able to support that from Kconfig/Kbuild but it may not be wise to introduce such a confusing behaviour. So to solve this, create a new CONFIG_NO_HZ_COMMON option which gathers the common code between idle and full dynticks (that common code for now is simply the idle dynticks code) and select it from their referring Kconfig. Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ to it for backward compatibility. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> 2011-08-10 23:21:01 +02:00			`#ifdef CONFIG_NO_HZ_COMMON`
sched, nohz: Introduce nohz_flags in 'struct rq' Introduce nohz_flags in the struct rq, which will track these two flags for now. NOHZ_TICK_STOPPED keeps track of the tick stopped status that gets set when the tick is stopped. It will be used to update the nohz idle load balancer data structures during the first busy tick after the tick is restarted. At this first busy tick after tickless idle, NOHZ_TICK_STOPPED flag will be reset. This will minimize the nohz idle load balancer status updates that currently happen for every tickless exit, making it more scalable when there are many logical cpu's that enter and exit idle often. NOHZ_BALANCE_KICK will track the need for nohz idle load balance on this rq. This will replace the nohz_balance_kick in the rq, which was not being updated atomically. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.499438999@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-01 17:07:32 -08:00			`enum rq_nohz_flag_bits {`
			`NOHZ_TICK_STOPPED,`
			`NOHZ_BALANCE_KICK,`
			`};`

sched: Revise the inter cluster load balance restrictions The frequency based inter cluster load balance restrictions are not reliable as frequency does not provide a good estimate of the CPU's current load. Replace them with the spill_load and spill_nr_run based checks. The higher capacity cluster is restricted from pulling the tasks from the lower capacity cluster unless all of the lower capacity CPUs are above spill. This behavior can be controlled by a sysctl tunable and it is disabled by default (i.e. no load balance restrictions). Change-Id: I45c09c8adcb61a8a7d4e08beadf2f97f1805fb42 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> [joonwoop@codeaurora.org: fixed merge conflicts due to omitted changes for CONFIG_SCHED_QHMP.] 2015-12-04 06:34:03 +05:30			`#define NOHZ_KICK_ANY 0`
			`#define NOHZ_KICK_RESTRICT 1`

sched, nohz: Introduce nohz_flags in 'struct rq' Introduce nohz_flags in the struct rq, which will track these two flags for now. NOHZ_TICK_STOPPED keeps track of the tick stopped status that gets set when the tick is stopped. It will be used to update the nohz idle load balancer data structures during the first busy tick after the tick is restarted. At this first busy tick after tickless idle, NOHZ_TICK_STOPPED flag will be reset. This will minimize the nohz idle load balancer status updates that currently happen for every tickless exit, making it more scalable when there are many logical cpu's that enter and exit idle often. NOHZ_BALANCE_KICK will track the need for nohz idle load balance on this rq. This will replace the nohz_balance_kick in the rq, which was not being updated atomically. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.499438999@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu> 2011-12-01 17:07:32 -08:00			`#define nohz_flags(cpu) (&cpu_rq(cpu)->nohz_flags)`
			`#endif`
sched: Move cputime code to its own file Extract cputime code from the giant sched/core.c and put it in its own file. This make it easier to deal with this particular area and de-bloat a bit more core.c Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> 2012-06-16 15:57:37 +02:00
			`#ifdef CONFIG_IRQ_TIME_ACCOUNTING`

			`DECLARE_PER_CPU(u64, cpu_hardirq_time);`
			`DECLARE_PER_CPU(u64, cpu_softirq_time);`

			`#ifndef CONFIG_64BIT`
			`DECLARE_PER_CPU(seqcount_t, irq_time_seq);`

			`static inline void irq_time_write_begin(void)`
			`{`
			`__this_cpu_inc(irq_time_seq.sequence);`
			`smp_wmb();`
			`}`

			`static inline void irq_time_write_end(void)`
			`{`
			`smp_wmb();`
			`__this_cpu_inc(irq_time_seq.sequence);`
			`}`

			`static inline u64 irq_time_read(int cpu)`
			`{`
			`u64 irq_time;`
			`unsigned seq;`

			`do {`
			`seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));`
			`irq_time = per_cpu(cpu_softirq_time, cpu) +`
			`per_cpu(cpu_hardirq_time, cpu);`
			`} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq));`

			`return irq_time;`
			`}`
			`#else /* CONFIG_64BIT */`
			`static inline void irq_time_write_begin(void)`
			`{`
			`}`

			`static inline void irq_time_write_end(void)`
			`{`
			`}`

			`static inline u64 irq_time_read(int cpu)`
			`{`
			`return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);`
			`}`
			`#endif /* CONFIG_64BIT */`
			`#endif /* CONFIG_IRQ_TIME_ACCOUNTING */`
sched/cputime: Fix steal time accounting vs. CPU hotplug commit e9532e69b8d1d1284e8ecf8d2586de34aec61244 upstream. On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time value over CPU down and up. So after the CPU comes up again the delta calculation in steal_account_process_tick() wreckages itself due to the unsigned math: u64 steal = paravirt_steal_clock(smp_processor_id()); steal -= this_rq()->prev_steal_time; So if steal is smaller than rq->prev_steal_time we end up with an insane large value which then gets added to rq->prev_steal_time, resulting in a permanent wreckage of the accounting. As a consequence the per CPU stats in /proc/stat become stale. Nice trick to tell the world how idle the system is (100%) while the CPU is 100% busy running tasks. Though we prefer realistic numbers. None of the accounting values which use a previous value to account for fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity check for prev_irq_time and prev_steal_time_rq, but that sanity check solely deals with clock warps and limits the /proc/stat visible wreckage. The prev_time values are still wrong. Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: commit 095c0aa83e52 "sched: adjust scheduler cpu power for stolen time" Fixes: commit aa483808516c "sched: Remove irq time from available CPU power" Fixes: commit e6e6685accfa "KVM guest: Steal time accounting" Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanos Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 2016-03-04 15:59:42 +01:00
sched: backport cpufreq hooks from 4.9-rc4 The scheduler cpufreq hooks are required by the schedutil cpufreq governor. Change-Id: Ied6c46262bb33b7e81bbb3d3d2761124e0c676b7 Signed-off-by: Steve Muckle <smuckle@linaro.org> [trivial cherry-picking fixes] Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> 2016-11-11 14:04:43 -08:00			`#ifdef CONFIG_CPU_FREQ`
			`DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);`

			`/**`
			`* cpufreq_update_util - Take a note about CPU utilization changes.`
			`* @rq: Runqueue to carry out the update for.`
			`* @flags: Update reason flags.`
			`*`
			`* This function is called by the scheduler on the CPU whose utilization is`
			`* being updated.`
			`*`
			`* It can only be called from RCU-sched read-side critical sections.`
			`*`
			`* The way cpufreq is currently arranged requires it to evaluate the CPU`
			`* performance state (frequency/voltage) on a regular basis to prevent it from`
			`* being stuck in a completely inadequate performance level for too long.`
			`* That is not guaranteed to happen if the updates are only triggered from CFS,`
			`* though, because they may not be coming in if RT or deadline tasks are active`
			`* all the time (or there are RT and DL tasks only).`
			`*`
			`* As a workaround for that issue, this function is called by the RT and DL`
			`* sched classes to trigger extra cpufreq updates to prevent it from stalling,`
			`* but that really is a band-aid. Going forward it should be replaced with`
			`* solutions targeted more specifically at RT and DL tasks.`
			`*/`
			`static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)`
			`{`
			`struct update_util_data *data;`

			`data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));`
			`if (data)`
			`data->func(data, rq_clock(rq), flags);`
			`}`

			`static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int flags)`
			`{`
			`if (cpu_of(rq) == smp_processor_id())`
			`cpufreq_update_util(rq, flags);`
			`}`
			`#else`
			`static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}`
			`static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int flags) {}`
			`#endif /* CONFIG_CPU_FREQ */`

sched: WALT: account cumulative window demand Energy cost estimation has been a long lasting challenge for WALT because WALT guides CPU frequency based on the CPU utilization of previous window. Consequently it's not possible to know newly waking-up task's energy cost until WALT's end of the current window. The WALT already tracks 'Previous Runnable Sum' (prev_runnable_sum) and 'Cumulative Runnable Average' (cr_avg). They are designed for CPU frequency guidance and task placement but unfortunately both are not suitable for the energy cost estimation. It's because using prev_runnable_sum for energy cost calculation would make us to account CPU and task's energy solely based on activity in the previous window so for example, any task didn't have an activity in the previous window will be accounted as a 'zero energy cost' task. Energy estimation with cr_avg is what energy_diff() relies on at present. However cr_avg can only represent instantaneous picture of energy cost thus for example, if a CPU was fully occupied for an entire WALT window and became idle just before window boundary, and if there is a wake-up, energy_diff() accounts that CPU is a 'zero energy cost' CPU. As a result, introduce a new accounting unit 'Cumulative Window Demand'. The cumulative window demand tracks all the tasks' demands have seen in current window which is neither instantaneous nor actual execution time. Because task demand represents estimated scaled execution time when the task runs a full window, accumulation of all the demands represents predicted CPU load at the end of window. Thus we can estimate CPU's frequency at the end of current WALT window with the cumulative window demand. The use of prev_runnable_sum for the CPU frequency guidance and cr_avg for the task placement have not changed and these are going to be used for both purpose while this patch aims to add an additional statistics. Change-Id: I9908c77ead9973a26dea2b36c001c2baf944d4f5 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> 2017-02-03 11:15:31 -08:00			`#ifdef CONFIG_SCHED_WALT`

			`static inline bool`
			`walt_task_in_cum_window_demand(struct rq rq, struct task_struct p)`
			`{`
			`return cpu_of(rq) == task_cpu(p) &&`
			`(p->on_rq \|\| p->last_sleep_ts >= rq->window_start);`
			`}`

			`#endif /* CONFIG_SCHED_WALT */`

sched: backport cpufreq hooks from 4.9-rc4 The scheduler cpufreq hooks are required by the schedutil cpufreq governor. Change-Id: Ied6c46262bb33b7e81bbb3d3d2761124e0c676b7 Signed-off-by: Steve Muckle <smuckle@linaro.org> [trivial cherry-picking fixes] Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> 2016-11-11 14:04:43 -08:00			`#ifdef arch_scale_freq_capacity`
			`#ifndef arch_scale_freq_invariant`
			`#define arch_scale_freq_invariant() (true)`
			`#endif`
			`#else /* arch_scale_freq_capacity */`
			`#define arch_scale_freq_invariant() (false)`
			`#endif`