WALT accounts two major statistics; CPU load and cumulative tasks
demand.
The CPU load which is account of accumulated each CPU's absolute
execution time is for CPU frequency guidance. Whereas cumulative
tasks demand which is each CPU's instantaneous load to reflect
CPU's load at given time is for task placement decision.
Use cumulative tasks demand for cpu_util() for task placement and
introduce cpu_util_freq() for frequency guidance.
Change-Id: Id928f01dbc8cb2a617cdadc584c1f658022565c5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
then thread_info is defined as a single 'u32 flags' and is the first
entry of task_struct. thread_info::task is removed (it serves no
purpose if thread_info is embedded in task_struct), and
thread_info::cpu gets its own slot in task_struct.
This is heavily based on a patch written by Linus.
Originally-from: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bug: 38331309
Change-Id: I25e5a830f2ada5e74fa93661e97e5e701b1b70d2
(cherry picked from commit c65eacbe290b8141554c71b2c94489e73ade8c8d)
Signed-off-by: Zubin Mithra <zsm@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlmHzogACgkQONu9yGCS
aT72Kg/9Ea02hrf7SCaEmReH0CNBsZiWBp0u/4b6QtXt3TrPDXK0oteIB4SUIVi/
zOzjU5SkssMLL9RoRQob81DLFJlL0b9ME5nLXxAACe2P74DaRSxA3DDmrYILgerH
Gnv4k9xjbVMXMjdk6qAZ/SahCFfYPfnPCRO/zPeb3+6EZk8UQpaaB/GNxVCsGFTZ
AfThsAHYzfFOg2fYdK0T09eDtAFqAokwGY6O8uaigkJt3u5mbMXcgxSp4o322OcG
V3jxCUPzSk/78QtoSqQErXDCj/30451oLVByMBuRpBJAilsDf6VaURuz1dVfKFW8
PdkLiy397sir696HwPU0HwHz++kRnZK2u2z//TRDE5wmgsC9VSq9fkggZdmNBol5
N4ekCWjhYyyJzxf9hTxK/fA4t4KRFtOcdRiEkJj9RDIhT9jxsxPMr3TGJ25LJaUH
8Qae+nNlYVe7lmaojckGa+AjIMm5HRB7LZnf4VQr1E8kvWpWpwA/0YtnduzPsXhH
6xqT0rL/1/Z1Jz63/zPAtZ9OSL/ne0hJs+xOuUhKHGwH3oWBKrgmxAH8CAxYq0x9
Y6ALkDweS3e+vVt+4BcHpUz8JTNTlspMcebt4VvjqvmERpKwmVsl7tEY242Uw4LQ
wMF50vA9Cc0bVkVS7w2Ns/dn6XEWYpqS4a/MninjaBOMbtMia78=
=l+tE
-----END PGP SIGNATURE-----
Merge 4.4.80 into android-4.4
Changes in 4.4.80
af_key: Add lock to key dump
pstore: Make spinlock per zone instead of global
net: reduce skb_warn_bad_offload() noise
powerpc/pseries: Fix of_node_put() underflow during reconfig remove
crypto: authencesn - Fix digest_null crash
md/raid5: add thread_group worker async_tx_issue_pending_all
drm/vmwgfx: Fix gcc-7.1.1 warning
drm/nouveau/bar/gf100: fix access to upper half of BAR2
KVM: PPC: Book3S HV: Context-switch EBB registers properly
KVM: PPC: Book3S HV: Restore critical SPRs to host values on guest exit
KVM: PPC: Book3S HV: Reload HTM registers explicitly
KVM: PPC: Book3S HV: Save/restore host values of debug registers
Revert "powerpc/numa: Fix percpu allocations to be NUMA aware"
Staging: comedi: comedi_fops: Avoid orphaned proc entry
drm/rcar: Nuke preclose hook
drm: rcar-du: Perform initialization/cleanup at probe/remove time
drm: rcar-du: Simplify and fix probe error handling
perf intel-pt: Fix ip compression
perf intel-pt: Fix last_ip usage
perf intel-pt: Use FUP always when scanning for an IP
perf intel-pt: Ensure never to set 'last_ip' when packet 'count' is zero
xfs: don't BUG() on mixed direct and mapped I/O
nfc: fdp: fix NULL pointer dereference
net: phy: Do not perform software reset for Generic PHY
isdn: Fix a sleep-in-atomic bug
isdn/i4l: fix buffer overflow
ath10k: fix null deref on wmi-tlv when trying spectral scan
wil6210: fix deadlock when using fw_no_recovery option
mailbox: always wait in mbox_send_message for blocking Tx mode
mailbox: skip complete wait event if timer expired
mailbox: handle empty message in tx_tick
mpt3sas: Don't overreach ioc->reply_post[] during initialization
kaweth: fix firmware download
kaweth: fix oops upon failed memory allocation
sched/cgroup: Move sched_online_group() back into css_online() to fix crash
PM / Domains: defer dev_pm_domain_set() until genpd->attach_dev succeeds if present
RDMA/uverbs: Fix the check for port number
libnvdimm, btt: fix btt_rw_page not returning errors
ipmi/watchdog: fix watchdog timeout set on reboot
dentry name snapshots
v4l: s5c73m3: fix negation operator
Make file credentials available to the seqfile interfaces
/proc/iomem: only expose physical resource addresses to privileged users
vlan: Propagate MAC address to VLANs
pstore: Allow prz to control need for locking
pstore: Correctly initialize spinlock and flags
pstore: Use dynamic spinlock initializer
net: skb_needs_check() accepts CHECKSUM_NONE for tx
sched/cputime: Fix prev steal time accouting during CPU hotplug
xen/blkback: don't free be structure too early
xen/blkback: don't use xen_blkif_get() in xen-blkback kthread
tpm: fix a kernel memory leak in tpm-sysfs.c
tpm: Replace device number bitmap with IDR
x86/mce/AMD: Make the init code more robust
r8169: add support for RTL8168 series add-on card.
ARM: dts: n900: Mark eMMC slot with no-sdio and no-sd flags
ipv6: Should use consistent conditional judgement for ip6 fragment between __ip6_append_data and ip6_finish_output
net/mlx4: Remove BUG_ON from ICM allocation routine
drm/msm: Ensure that the hardware write pointer is valid
drm/msm: Verify that MSM_SUBMIT_BO_FLAGS are set
vfio-pci: use 32-bit comparisons for register address for gcc-4.5
irqchip/keystone: Fix "scheduling while atomic" on rt
ASoC: tlv320aic3x: Mark the RESET register as volatile
spi: dw: Make debugfs name unique between instances
ASoC: nau8825: fix invalid configuration in Pre-Scalar of FLL
irqchip/mxs: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
openrisc: Add _text symbol to fix ksym build error
dmaengine: ioatdma: Add Skylake PCI Dev ID
dmaengine: ioatdma: workaround SKX ioatdma version
dmaengine: ti-dma-crossbar: Add some 'of_node_put()' in error path.
ARM64: zynqmp: Fix W=1 dtc 1.4 warnings
ARM64: zynqmp: Fix i2c node's compatible string
ARM: s3c2410_defconfig: Fix invalid values for NF_CT_PROTO_*
ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
usb: gadget: Fix copy/pasted error message
Btrfs: adjust outstanding_extents counter properly when dio write is split
tools lib traceevent: Fix prev/next_prio for deadline tasks
xfrm: Don't use sk_family for socket policy lookups
perf tools: Install tools/lib/traceevent plugins with install-bin
perf symbols: Robustify reading of build-id from sysfs
video: fbdev: cobalt_lcdfb: Handle return NULL error from devm_ioremap
vfio-pci: Handle error from pci_iomap
arm64: mm: fix show_pte KERN_CONT fallout
nvmem: imx-ocotp: Fix wrong register size
sh_eth: enable RX descriptor word 0 shift on SH7734
ALSA: usb-audio: test EP_FLAG_RUNNING at urb completion
HID: ignore Petzl USB headlamp
scsi: fnic: Avoid sending reset to firmware when another reset is in progress
scsi: snic: Return error code on memory allocation failure
ASoC: dpcm: Avoid putting stream state to STOP when FE stream is paused
Linux 4.4.80
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 3d89e5478bf550a50c99e93adf659369798263b0 upstream.
Commit:
e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")
... set rq->prev_* to 0 after a CPU hotplug comes back, in order to
fix the case where (after CPU hotplug) steal time is smaller than
rq->prev_steal_time.
However, this should never happen. Steal time was only smaller because of the
KVM-specific bug fixed by the previous patch. Worse, the previous patch
triggers a bug on CPU hot-unplug/plug operation: because
rq->prev_steal_time is cleared, all of the CPU's past steal time will be
accounted again on hot-plug.
Since the root cause has been fixed, we can just revert commit e9532e69b8d1.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")'
Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andres Oportus <andresoportus@google.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add appropriate #ifdef guards to ensure the smp-only easstats structs
are not used when smp is not enabled. Arnd got a report from buildbot,
analysed it, and pointed out exactly what the issue was.
Reported-by: "Arnd Bergmann" <arnd@arndb.de>
Suggested-by: "Arnd Bergmann" <arnd@arndb.de>
Fixes: 4b85765a3d ("sched/fair: Add eas (& cas)
specific rq, sd and task stats")
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I60554dea20137f6774db3f59b4afd40a06554cfc
When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.
For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.
For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.
Change-Id: Id34a9888484716961c9027299c0b4d82881a39d1
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 09a43ace1f986b003c118fdf6ddf1fd685692d49)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.
The hierarchical order in shares update list has been introduced by
commit:
67e86250f8 ("sched: Introduce hierarchal order on shares update list")
With the current implementation a child can be still put after its
parent.
Lets take the example of:
root
\
b
/\
c d*
|
e*
with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail
The branch d -> e will be added the first time that they are enqueued,
starting with e then d.
When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail
Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail
e is not placed at the right position and will be called the last
whereas it should be called at the beginning.
Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.
Change-Id: I4fe0b8502ea628c13d14e8e5c5279bce67fb8845
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9c2791f936ef5fd04a118b5c284f2c9a95f4a647)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
A new task's util_avg is set to full utilization of a CPU (100% time
running). This accelerates a new task's utilization ramp-up, useful to
boost its execution in early time. However, it may result in
(insanely) high utilization for a transient time period when a flood
of tasks are spawned. Importantly, it violates the "fundamentally
bounded" CPU utilization, and its side effect is negative if we don't
take any measure to bound it.
This patch proposes an algorithm to address this issue. It has
two methods to approach a sensible initial util_avg:
(1) An expected (or average) util_avg based on its cfs_rq's util_avg:
util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
(2) A trajectory of how successive new tasks' util develops, which
gives 1/2 of the left utilization budget to a new task such that
the additional util is noticeably large (when overall util is low) or
unnoticeably small (when overall util is high enough). In the meantime,
the aggregate utilization is well bounded:
util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
where n denotes the nth task.
If util_avg is larger than util_avg_cap, then the effective util is
clamped to the util_avg_cap.
Change-Id: Idafe989b24d9e70911666f09800bf1d5a011e1f4
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2b8c41daba327c633228169e8bd8ec067ab443f8)
[integrate with schedfreq - schedfreq has a tuneable for init task util
but this commit removes the use of the tuneable since we have a new
algorithm for calculating an initial utilisation. I've left the tuneable
in place, but it is no longer used even when schedfreq is the CPUFreq
governor]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
This will allow to start iterating from a cpu with max or min original
capacity in the wakeup path regardless on which cpu the scheduler is
currently running (smp_processor_id()) or the previous cpu of the task
(task_cpu(p)). This iteration has to happen on a sched_domain spanning
all cpus in the order of the sched_groups of this sched_domain seen by
the starting cpu.
In case of an SMP system the first cpu with max orig capacity and the
the one with min orig capacity is the same. This can temporally happen
on a big.LITTLE system with hotplug as well.
E.g. the different order of cpu iteration can be used to map schedtune
task parameter 'boosted' into the cpu iteration order in
find_best_target().
Use of READ_ONCE()/WRITE_ONCE() to avoid load/store tearing.
Change-Id: I812fbd9c7e5f506617e456c0eec3edcd2c016e92
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit fd6e9543c1fd8971a5e2e68e39b2f6e591d46114)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
struct sched_group_capacity currently represents the compute capacity
sum of all CPUs in the sched_group.
Unless it is divided by the group_weight to get the average capacity
per CPU, it hides differences in CPU capacity for mixed capacity systems
(e.g. high RT/IRQ utilization or ARM big.LITTLE).
But even the average may not be sufficient if the group covers CPUs of
different capacities.
Instead, by extending struct sched_group_capacity to indicate min per-CPU
capacity in the group a suitable group for a given task utilization can
more easily be found such that CPUs with reduced capacity can be avoided
for tasks with high utilization (not implemented by this patch).
Change-Id: If3cae1be62d01a199e752bca5abb45357d5d0fbd
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit bf475ce0a3dd75b5d1df6c6c14ae25168caa15ac)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
The scheduler cpufreq hooks are required by the schedutil cpufreq
governor.
Change-Id: Ied6c46262bb33b7e81bbb3d3d2761124e0c676b7
Signed-off-by: Steve Muckle <smuckle@linaro.org>
[trivial cherry-picking fixes]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlkm0zAACgkQONu9yGCS
aT5QnxAAh9uZYFJtQ7wYngD7cQcDH1KVztqEYxCP5OtxzAZBrSNBufLdhKBbc1ZP
C04Mo+FzzNiJtBwkmlOqYaEPYUSx/uwCEk9mNX85VtchIhKBrwWF7GxkeXCPs6e5
yP5TUXmxbbSp3qM4q2Z4XSW8eEPZ2l3zoy0fkjz2kS02e4RW0yQ34dvzw0BG2urr
+9ocyVjDBoU3QNKyVw3fd1AltKesSZK0fa2vEO+TOTW6Bm3xD4egCJdOzu9saUwK
hfSKXsJ0/pf1r1iyfz2foR/Hi3i4j6vRqnneyqozT7nxEJEuBQ3B5WhnsbDfzrXu
+CY23KBkDkQ1RBngmtTQd3ABHEN1E2StpBImG5RUr+5giV6/e4rdz0/HWGMvCvAz
iWqXdgZNdCnc96HPEWaDGUKxndCxsiaJOhgZwW2zm/0drVWRE+vjsOmFLyUp2Ky1
1vnKfwlvTFU4xjQ5H44AuuSHQsv+GNEtPPIHrbBv/wg90/2VuF0aYuNYjHSsc4Ca
3YM53S6/sjQqmsKixWboax8Kh2wRrEuFbqSFQV64JjFpGau61JQFMtRNl4+FFXzm
Cm+26Fan4Wtyo5zB9xnBZbDwCOXqwTXQYUP2SejtObq+Uk2tXxF05emeta9pURF3
vdgv6N0cTPm4K3VZyBZvj8JitEr2OEaIxoUqE2BXkA1MPmbqOoI=
=Z1no
-----END PGP SIGNATURE-----
Merge 4.4.70 into android-4.4
Changes in 4.4.70
usb: misc: legousbtower: Fix buffers on stack
usb: misc: legousbtower: Fix memory leak
USB: ene_usb6250: fix DMA to the stack
watchdog: pcwd_usb: fix NULL-deref at probe
char: lp: fix possible integer overflow in lp_setup()
USB: core: replace %p with %pK
ARM: tegra: paz00: Mark panel regulator as enabled on boot
tpm_crb: check for bad response size
infiniband: call ipv6 route lookup via the stub interface
dm btree: fix for dm_btree_find_lowest_key()
dm raid: select the Kconfig option CONFIG_MD_RAID0
dm bufio: avoid a possible ABBA deadlock
dm bufio: check new buffer allocation watermark every 30 seconds
dm cache metadata: fail operations if fail_io mode has been established
dm bufio: make the parameter "retain_bytes" unsigned long
dm thin metadata: call precommit before saving the roots
dm space map disk: fix some book keeping in the disk space map
md: update slab_cache before releasing new stripes when stripes resizing
rtlwifi: rtl8821ae: setup 8812ae RFE according to device type
mwifiex: pcie: fix cmd_buf use-after-free in remove/reset
ima: accept previously set IMA_NEW_FILE
KVM: x86: Fix load damaged SSEx MXCSR register
KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulation
regulator: tps65023: Fix inverted core enable logic.
s390/kdump: Add final note
s390/cputime: fix incorrect system time
ath9k_htc: Add support of AirTies 1eda:2315 AR9271 device
ath9k_htc: fix NULL-deref at probe
drm/amdgpu: Avoid overflows/divide-by-zero in latency_watermark calculations.
drm/amdgpu: Make display watermark calculations more accurate
drm/nouveau/therm: remove ineffective workarounds for alarm bugs
drm/nouveau/tmr: ack interrupt before processing alarms
drm/nouveau/tmr: fix corruption of the pending list when rescheduling an alarm
drm/nouveau/tmr: avoid processing completed alarms when adding a new one
drm/nouveau/tmr: handle races with hw when updating the next alarm time
cdc-acm: fix possible invalid access when processing notification
proc: Fix unbalanced hard link numbers
of: fix sparse warning in of_pci_range_parser_one
iio: dac: ad7303: fix channel description
pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()
USB: serial: ftdi_sio: fix setting latency for unprivileged users
USB: serial: ftdi_sio: add Olimex ARM-USB-TINY(H) PIDs
ext4 crypto: don't let data integrity writebacks fail with ENOMEM
ext4 crypto: fix some error handling
net: qmi_wwan: Add SIMCom 7230E
fscrypt: fix context consistency check when key(s) unavailable
f2fs: check entire encrypted bigname when finding a dentry
fscrypt: avoid collisions when presenting long encrypted filenames
sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
sched/fair: Initialize throttle_count for new task-groups lazily
usb: host: xhci-plat: propagate return value of platform_get_irq()
xhci: apply PME_STUCK_QUIRK and MISSING_CAS quirk for Denverton
usb: host: xhci-mem: allocate zeroed Scratchpad Buffer
net: irda: irda-usb: fix firmware name on big-endian hosts
usbvision: fix NULL-deref at probe
mceusb: fix NULL-deref at probe
ttusb2: limit messages to buffer size
usb: musb: tusb6010_omap: Do not reset the other direction's packet size
USB: iowarrior: fix info ioctl on big-endian hosts
usb: serial: option: add Telit ME910 support
USB: serial: qcserial: add more Lenovo EM74xx device IDs
USB: serial: mct_u232: fix big-endian baud-rate handling
USB: serial: io_ti: fix div-by-zero in set_termios
USB: hub: fix SS hub-descriptor handling
USB: hub: fix non-SS hub-descriptor handling
ipx: call ipxitf_put() in ioctl error path
iio: proximity: as3935: fix as3935_write
ceph: fix recursion between ceph_set_acl() and __ceph_setattr()
gspca: konica: add missing endpoint sanity check
s5p-mfc: Fix unbalanced call to clock management
dib0700: fix NULL-deref at probe
zr364xx: enforce minimum size when reading header
dvb-frontends/cxd2841er: define symbol_rate_min/max in T/C fe-ops
cx231xx-audio: fix init error path
cx231xx-audio: fix NULL-deref at probe
cx231xx-cards: fix NULL-deref at probe
powerpc/book3s/mce: Move add_taint() later in virtual mode
powerpc/pseries: Fix of_node_put() underflow during DLPAR remove
powerpc/64e: Fix hang when debugging programs with relocated kernel
ARM: dts: at91: sama5d3_xplained: fix ADC vref
ARM: dts: at91: sama5d3_xplained: not all ADC channels are available
arm64: xchg: hazard against entire exchange variable
arm64: uaccess: ensure extension of access_ok() addr
arm64: documentation: document tagged pointer stack constraints
xc2028: Fix use-after-free bug properly
mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp
staging: rtl8192e: fix 2 byte alignment of register BSSIDR.
staging: rtl8192e: rtl92e_get_eeprom_size Fix read size of EPROM_CMD.
iommu/vt-d: Flush the IOTLB to get rid of the initial kdump mappings
metag/uaccess: Fix access_ok()
metag/uaccess: Check access_ok in strncpy_from_user
uwb: fix device quirk on big-endian hosts
genirq: Fix chained interrupt data ordering
osf_wait4(): fix infoleak
tracing/kprobes: Enforce kprobes teardown after testing
PCI: Fix pci_mmap_fits() for HAVE_PCI_RESOURCE_TO_USER platforms
PCI: Freeze PME scan before suspending devices
drm/edid: Add 10 bpc quirk for LGD 764 panel in HP zBook 17 G2
nfsd: encoders mustn't use unitialized values in error cases
drivers: char: mem: Check for address space wraparound with mmap()
Linux 4.4.70
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 094f469172e00d6ab0a3130b0e01c83b3cf3a98d upstream.
Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().
This patch initialize cfs_rq->throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Ben Pineau <benjamin.pineau@mirakl.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
EAS uses "const struct sched_group_energy * const" fairly consistently.
But a couple of places swap the "*" and second "const", making the
pointer mutable.
In the case of struct sched_group, "* const" would have been an error,
since init_sched_energy() writes to sd->groups->sge.
Change-Id: Ic6a8fcf99e65c0f25d9cc55c32625ef3ca5c9aca
Signed-off-by: Greg Hackmann <ghackmann@google.com>
Use do_div() instead of "/" operator to fix undefined references to
"__aeabi_uldivmod" build error for ARCH=arm.
Also in TP_fast_assign(), along with do_div() usage, replace "," with
";" which would have resulted in a syntax error (!), because
'#define TP_fast_assign(args...) args' would have stripped off the ","
and left white space between these two assignments after CPP phase.
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[jstultz: Cherry-picked from common/android-3.18]
Signed-off-by: John Stultz <john.stultz@linaro.org>
The CPU utilization reported when WALT is in use already tracks the
contributions due to RT and DL workloads. However, SchedFreq exposes
different capacity update functions, one for each class, and does classes
utilization internally at update_cpu_capacity_request() call time.
This patch ensures that when WALT is in use, the:
cpu_sched_capacity_reqs::cfs
value is tracking just the load generated by SCHED_OTHER tasks.
Change-Id: Ibd9c9a10874a1d91f62477034548f7664e57cd6a
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
use a window based view of time in order to track task
demand and CPU utilization in the scheduler.
Window Assisted Load Tracking (WALT) implementation credits:
Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,
Pavan Kumar Kondeti, Olav Haugan
2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla
and Todd Kjos
Change-Id: I21408236836625d4e7d7de1843d20ed5ff36c708
Includes fixes for issues:
eas/walt: Use walt_ktime_clock() instead of ktime_get_ns() to avoid a
race resulting in watchdog resets
BUG: 29353986
Change-Id: Ic1820e22a136f7c7ebd6f42e15f14d470f6bbbdb
Handle walt accounting anomoly during resume
During resume, there is a corner case where on wakeup, a task's
prev_runnable_sum can go negative. This is a workaround that
fixes the condition and warns (instead of crashing).
BUG: 29464099
Change-Id: I173e7874324b31a3584435530281708145773508
Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Contains:
sched/tune: fix accounting for runnable tasks (1/5)
The accounting for tasks into boost groups of different CPUs is currently
broken mainly because:
a) we do not properly track the change of boost group of a RUNNABLE task
b) there are race conditions between migration code and accounting code
This patch provides a fixes to ensure enqueue/dequeue
accounting also for throttled tasks.
Without this patch is can happen that a task is enqueued into a throttled
RQ thus not being accounted for the boosting of the corresponding RQ.
We could argue that a throttled task should not boost a CPU, however:
a) properly implementing CPU boosting considering throttled tasks will
increase a lot the complexity of the solution
b) it's not easy to quantify the benefits introduced by such a more
complex solution
Since task throttling requires the usage of the CFS bandwidth controller,
which is not widely used on mobile systems (at least not by Android kernels
so far), for the time being we go for the simple solution and boost also
for throttled RQs.
sched/tune: fix accounting for runnable tasks (2/5)
This patch provides the code required to enforce proper locking.
A per boost group spinlock has been added to grant atomic
accounting of tasks as well as to serialise enqueue/dequeue operations,
triggered by tasks migrations, with cgroups's attach/detach operations.
sched/tune: fix accounting for runnable tasks (3/5)
This patch adds cgroups {allow,can,cancel}_attach callbacks.
Since a task can be migrated between boost groups while it's running,
the CGroups's attach callbacks have been added to properly migrate
boost contributions of RUNNABLE tasks.
The RQ's lock is used to serialise enqueue/dequeue operations, triggered
by tasks migrations, with cgroups's attach/detach operations. While the
SchedTune's CPU lock is used to grant atrocity of the accounting within
the CPU.
NOTE: the current implementation does not allows a concurrent CPU migration
and CGroups change.
sched/tune: fix accounting for runnable tasks (4/5)
This fixes accounting for exiting tasks by adding a dedicated call early
in the do_exit() syscall, which disables SchedTune accounting as soon as a
task is flagged PF_EXITING.
This flag is set before the multiple dequeue/enqueue dance triggered
by cgroup_exit() which is useful only to inject useless tasks movements
thus increasing possibilities for race conditions with the migration code.
The schedtune_exit_task() call does the last dequeue of a task from its
current boost group. This is a solution more aligned with what happens in
mainline kernels (>v4.4) where the exit_cgroup does not move anymore a dying
task to the root control group.
sched/tune: fix accounting for runnable tasks (5/5)
To avoid accounting issues at startup, this patch disable the SchedTune
accounting until the required data structures have been properly
initialized.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Currently the build for a single-core (e.g. user-mode) Linux is broken
and this configuration is required (at least) to run some network tests.
The main issues for the current code support on single-core systems are:
1. {se,rq}::sched_avg is not available nor maintained for !SMP systems
This means that load and utilisation signals are NOT available in single
core systems. All the EAS code depends on these signals.
2. sched_group_energy is also SMP dependant. Again this means that all the
EAS setup and preparation code (energyn model initialization) has to be
properly guarded/disabled for !SMP systems.
3. SchedFreq depends on utilization signal, which is not available on
!SMP systems.
4. SchedTune is useless on unicore systems if SchedFreq is not available.
5. WALT machinery is not required on single-core systems.
This patch addresses all these issues by enforcing some constraints for
single-core systems:
a) WALT, SchedTune and SchedTune are now dependant on SMP
b) The default governor for !SMP systems is INTERACTIVE
c) The energy model initialisation/build functions are
d) Other minor code re-arrangements and CONFIG_SMP guarding to enable
single core builds.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Doing a Exponential moving average per nr_running++/-- does not
guarantee a fixed sample rate which induces errors if there are lots of
threads being enqueued/dequeued from the rq (Linpack mt). Instead of
keeping track of the avg, the scheduler now keeps track of the integral
of nr_running and allows the readers to perform filtering on top.
Original-author: Sai Charan Gurrappadi <sgurrappadi@nvidia.com>
Change-Id: Id946654f32fa8be0eaf9d8fa7c9a8039b5ef9fab
Signed-off-by: Joseph Lo <josephl@nvidia.com>
Signed-off-by: Andrew Bresticker <abrestic@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/174694
Reviewed-on: https://chromium-review.googlesource.com/272853
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Instead of monitoring the exec time of deadline tasks to evaluate the
CPU capacity consumed by deadline scheduler class, we can directly
calculate it thanks to the sum of utilization of deadline tasks on the
CPU. We can remove deadline tasks from rt_avg metric and directly use
the average bandwidth of deadline scheduler in scale_rt_capacity.
Based in part on a similar patch from Luca Abeni <luca.abeni@unitn.it>.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
rt_avg is only used to scale the available CPU's capacity for CFS
tasks. As the update of this scaling is done during periodic load
balance, we only have to ensure that sched_avg_update has been called
before any periodic load balancing. This requirement is already
fulfilled by __update_cpu_load so the call in sched_rt_avg_update,
which is part of the hotpath, is useless.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Since the true utilization of a long running task is not detectable
while it is running and might be bigger than the current cpu capacity,
create the maximum cpu capacity head room by requesting the maximum
cpu capacity once the cpu usage plus the capacity margin exceeds the
current capacity. This is also done to try to harm the performance of
a task the least.
Original fair-class only version authored by Juri Lelli
<juri.lelli@arm.com>.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Patch "sched/fair: add triggers for OPP change requests" introduced OPP
change triggers for enqueue_task_fair(), but the trigger was operating only
for wakeups. Fact is that it makes sense to consider wakeup_new also (i.e.,
fork()), as we don't know anything about a newly created task and thus we
most certainly want to jump to max OPP to not harm performance too much.
However, it is not currently possible (or at least it wasn't evident to me
how to do so :/) to tell new wakeups from other (non wakeup) operations.
This patch introduces an additional flag in sched.h that is only set at
fork() time and it is then consumed in enqueue_task_fair() for our purpose.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Scheduler-driven CPU frequency selection hopes to exploit both
per-task and global information in the scheduler to improve frequency
selection policy, achieving lower power consumption, improved
responsiveness/performance, and less reliance on heuristics and
tunables. For further discussion on the motivation of this integration
see [0].
This patch implements a shim layer between the Linux scheduler and the
cpufreq subsystem. The interface accepts capacity requests from the
CFS, RT and deadline sched classes. The requests from each sched class
are summed on each CPU with a margin applied to the CFS and RT
capacity requests to provide some headroom. Deadline requests are
expected to be precise enough given their nature to not require
headroom. The maximum total capacity request for a CPU in a frequency
domain drives the requested frequency for that domain.
Policy is determined by both the sched classes and this shim layer.
Note that this algorithm is event-driven. There is no polling loop to
check cpu idle time nor any other method which is unsynchronized with
the scheduler, aside from a throttling mechanism to ensure frequency
changes are not attempted faster than the hardware can accommodate them.
Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
code and test results, and to Ricky Liang <jcliang@chromium.org>
for initialization and static key inc/dec fixes.
[0] http://article.gmane.org/gmane.linux.kernel/1499836
[smuckle@linaro.org: various additions and fixes, revised commit text]
CC: Ricky Liang <jcliang@chromium.org>
Signed-off-by: Michael Turquette <mturquette@baylibre.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
To maximize throughput in systems with reduced capacity cpus (e.g.
high RT/IRQ load and/or ARM big.LITTLE) load-balancing has to consider
task and cpu utilization as well as per-cpu compute capacity when
load-balancing in addition to the current average load based
load-balancing policy. Tasks that are scheduled on a reduced capacity
cpu need to be identified and migrated to a higher capacity cpu if
possible.
To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has tasks that are not suitable for its
per-cpu capacity. group_misfit_task is only considered if the system is
not overloaded in any other way (group_imbalanced or group_overloaded).
Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
struct sched_group_capacity currently represents the compute capacity
sum of all cpus in the sched_group. Unless it is divided by the
group_weight to get the average capacity per cpu it hides differences in
cpu capacity for mixed capacity systems (e.g. high RT/IRQ utilization or
ARM big.LITTLE). But even the average may not be sufficient if the group
covers cpus of different capacities. Instead, by extending struct
sched_group_capacity to indicate max per-cpu capacity in the group a
suitable group for a given task utilization can easily be found such
that cpus with reduced capacity can be avoided for tasks with high
utilization (not implemented by this patch).
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Wakeup balancing uses cpu capacity awareness and needs to know the
system-wide maximum cpu capacity.
Patch "sched: Store system-wide maximum cpu capacity in root domain"
finds the system-wide maximum cpu capacity during scheduler domain
hierarchy setup. This is sufficient as long as maximum frequency
invariance is not enabled.
If it is enabled, the system-wide maximum cpu capacity can change
between scheduler domain hierarchy setups due to frequency capping.
The cpu capacity is changed in update_cpu_capacity() which is called in
load balance on the lowest scheduler domain hierarchy level. To be able
to know if a change in cpu capacity for a certain cpu also has an effect
on the system-wide maximum cpu capacity it is normally necessary to
iterate over all cpus. This would be way too costly. That's why this
patch follows a different approach.
The unsigned long max_cpu_capacity value in struct root_domain is
replaced with a struct max_cpu_capacity, containing value (the
max_cpu_capacity) and cpu (the cpu index of the cpu providing the
maximum cpu_capacity).
Changes to the system-wide maximum cpu capacity and the cpu index are
made if:
1 System-wide maximum cpu capacity < cpu capacity
2 System-wide maximum cpu capacity > cpu capacity and cpu index == cpu
There are no changes to the system-wide maximum cpu capacity in all
other cases.
Atomic read and write access to the pair (max_cpu_capacity.val,
max_cpu_capacity.cpu) is enforced by max_cpu_capacity.lock.
The access to max_cpu_capacity.val in task_fits_max() is still performed
without taking the max_cpu_capacity.lock.
The code to set max cpu capacity in build_sched_domains() has been
removed because the whole functionality is now provided by
update_cpu_capacity() instead.
This approach can introduce errors temporarily, e.g. in case the cpu
currently providing the max cpu capacity has its cpu capacity lowered
due to frequency capping and calls update_cpu_capacity() before any cpu
which might provide the max cpu now.
There is also an outstanding question:
Should the cpu capacity of a cpu going idle be set to a very small
value?
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
The idle-state of each cpu is currently pointed to by rq->idle_state but
there isn't any information in the struct cpuidle_state that can used to
look up the idle-state energy model data stored in struct
sched_group_energy. For this purpose is necessary to store the idle
state index as well. Ideally, the idle-state data should be unified.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.
The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as:
sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)
some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.
For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.
To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at it's
highest frequency instead:
cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity
IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.
With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.
For system where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
For energy-aware load-balancing decisions it is necessary to know the
energy consumption estimates of groups of cpus. This patch introduces a
basic function, sched_group_energy(), which estimates the energy
consumption of the cpus in the group and any resources shared by the
members of the group.
NOTE: The function has five levels of identation and breaks the 80
character limit. Refactoring is necessary.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the highest level at which energy
model is provided. At this level and all levels below all sched_groups
have energy model data attached.
Partial energy model information is possible but restricted to providing
energy model data for lower level sched_domains (sd_ea and below) and
leaving load-balancing on levels above to non-energy-aware
load-balancing. For example, it is possible to apply energy-aware
scheduling within each socket on a multi-socket system and let normal
scheduling handle load-balancing between sockets.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
The struct sched_group_energy represents the per sched_group related
data which is needed for energy aware scheduling. It contains:
(1) number of elements of the idle state array
(2) pointer to the idle state array which comprises 'power consumption'
for each idle state
(3) number of elements of the capacity state array
(4) pointer to the capacity state array which comprises 'compute
capacity and power consumption' tuples for each capacity state
The struct sched_group obtains a pointer to a struct sched_group_energy.
The function pointer sched_domain_energy_f is introduced into struct
sched_domain_topology_level which will allow the arch to pass a particular
struct sched_group_energy from the topology shim layer into the scheduler
core.
The function pointer sched_domain_energy_f has an 'int cpu' parameter
since the folding of two adjacent sd levels via sd degenerate doesn't work
for all sd levels. I.e. it is not possible for example to use this feature
to provide per-cpu energy in sd level DIE on ARM's TC2 platform.
It was discussed that the folding of sd levels approach is preferable
over the cpu parameter approach, simply because the user (the arch
specifying the sd topology table) can introduce less errors. But since
it is not working, the 'int cpu' parameter is the only way out. It's
possible to use the folding of sd levels approach for
sched_domain_flags_f and the cpu parameter approach for the
sched_domain_energy_f at the same time though. With the use of the
'int cpu' parameter, an extra check function has to be provided to make
sure that all cpus spanned by a sched group are provisioned with the same
energy data.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
To be able to compare the capacity of the target cpu with the highest
cpu capacity of the system in the wakeup path, store the system-wide
maximum cpu capacity in the root domain.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
commit e9532e69b8d1d1284e8ecf8d2586de34aec61244 upstream.
On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time
value over CPU down and up. So after the CPU comes up again the delta
calculation in steal_account_process_tick() wreckages itself due to the
unsigned math:
u64 steal = paravirt_steal_clock(smp_processor_id());
steal -= this_rq()->prev_steal_time;
So if steal is smaller than rq->prev_steal_time we end up with an insane large
value which then gets added to rq->prev_steal_time, resulting in a permanent
wreckage of the accounting. As a consequence the per CPU stats in /proc/stat
become stale.
Nice trick to tell the world how idle the system is (100%) while the CPU is
100% busy running tasks. Though we prefer realistic numbers.
None of the accounting values which use a previous value to account for
fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity
check for prev_irq_time and prev_steal_time_rq, but that sanity check solely
deals with clock warps and limits the /proc/stat visible wreckage. The
prev_time values are still wrong.
Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: commit 095c0aa83e "sched: adjust scheduler cpu power for stolen time"
Fixes: commit aa48380851 "sched: Remove irq time from available CPU power"
Fixes: commit e6e6685acc "KVM guest: Steal time accounting"
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Explain how the control dependency and smp_rmb() end up providing
ACQUIRE semantics and pair with smp_store_release() in
finish_lock_switch().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The parameter "int next_cpu" in the following function is unused:
migrate_task_rq(struct task_struct *p, int next_cpu)
Remove it.
Signed-off-by: xiaofeng.yan <yanxiaofeng@inspur.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1442991360-31945-1-git-send-email-yanxiaofeng@inspur.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike Meyer reported the following bug:
> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected. Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
> a. thread is running.
> b. thread moves beween cgroups, changes scheduling class or priority.
> c. thread sleeps OR
> d. thread involuntarily gives up cpu.
>
> a. implies:
>
> thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
> 1. dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
> delta = 0
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> 2. enqueue_task(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* thread is still on cpu at this point. */
> thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
> dequeue_task(rq, thread)
>
> sched_info_dequeued(rq, thread)
>
> /* delta is execution time not run_delay. */
> delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
> sched_info_reset_dequeued(thread)
> thread->sched_info.last_queued = 0
>
> thread->sched_info.run_delay += delta
>
> Since thread was running between enqueue_task(rq, thread) and
> dequeue_task(rq, thread), the delta above is really execution
> time and not run_delay.
>
> d. results in:
>
> __sched_info_switch(thread, next_thread)
>
> sched_info_depart(rq, thread)
>
> sched_info_queued(rq, thread)
>
> /* last_queued not updated due to being non-zero */
> return
>
> Since thread was running between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread), the execution time
> between enqueue_task(rq, thread) and
> __sched_info_switch(thread, next_thread) now will become
> associated with run_delay due to when last_queued was last updated.
>
This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.
By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.
It even shrinks the code (due to inlining {en,de}queue_task()):
$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
text data bss dec hex filename
64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o
64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig
Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the problem this patch is trying to address is as follows:
CPU0 CPU1
context_switch(A, B)
ttwu(A)
LOCK A->pi_lock
A->on_cpu == 0
finish_task_switch(A)
prev_state = A->state <-.
WMB |
A->on_cpu = 0; |
UNLOCK rq0->lock |
| context_switch(C, A)
`-- A->state = TASK_DEAD
prev_state == TASK_DEAD
put_task_struct(A)
context_switch(A, C)
finish_task_switch(A)
A->state == TASK_DEAD
put_task_struct(A)
The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.
Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.
We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.
The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org> # v3.1+
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: manfred@colorfullife.com
Cc: will.deacon@arm.com
Fixes: e4a52bcb9a ("sched: Remove rq->lock from the first half of ttwu()")
Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move dl_time_before() static definition in include/linux/sched/deadline.h
so that it can be used by different parties without being re-defined.
Reported-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441188096-23021-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Most of the policy-tests are done via the <class>_policy() helpers with
the notable exception of idle. A new wrapper for valid_policy() has also
been added to improve readability in set_load_weight().
This commit does not change the logical behavior of the scheduler core.
Signed-off-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1441810841-4756-1-git-send-email-henrik@austad.us
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Besides the existing frequency scale-invariance correction factor, apply
CPU scale-invariance correction factor to utilization tracking to
compensate for any differences in compute capacity. This could be due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
maximum frequency supported by individual cpus in SMP systems. In the
existing implementation utilization isn't comparable between cpus as it
is relative to the capacity of each individual CPU.
Each segment of the sched_avg.util_sum geometric series is now scaled
by the CPU performance factor too so the sched_avg.util_avg of each
sched entity will be invariant from the particular CPU of the HMP/SMP
system on which the sched entity is scheduled.
With this patch, the utilization of a CPU stays relative to the max CPU
performance of the fastest CPU in the system.
In contrast to utilization (sched_avg.util_sum), load
(sched_avg.load_sum) should not be scaled by compute capacity. The
utilization metric is based on running time which only makes sense when
cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
more tasks are added), where load is runnable time which isn't limited
by the capacity of the CPU and therefore is a better metric for
overloaded scenarios. If we run two nice-0 busy loops on two cpus with
different compute capacity their load should be similar since their
compute demands are the same. We have to assume that the compute demand
of any task running on a fully utilized CPU (no spare cycles = 100%
utilization) is high and the same no matter of the compute capacity of
its current CPU, hence we shouldn't scale load by CPU capacity.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/55CE7409.1000700@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bring arch_scale_cpu_capacity() in line with the recent change of its
arch_scale_freq_capacity() sibling in commit dfbca41f34 ("sched:
Optimize freq invariant accounting") from weak function to #define to
allow inlining of the function.
While at it, remove the ARCH_CAPACITY sched_feature as well. With the
change to #define there isn't a straightforward way to allow runtime
switch between an arch implementation and the default implementation of
arch_scale_cpu_capacity() using sched_feature. The default was to use
the arch-specific implementation, but only the arm architecture provides
one and that is essentially equivalent to the default implementation.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>