The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.
That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register. Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again. That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.
That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task. If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away. That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.
This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway. On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.
On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.
To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes <joelaf@google.com>
(cherry picked from commit b7eaf1aab9f8bd2e49fceed77ebc66c1b5800718)
(simple CPUFREQ_RT_DL vs CPUFREQ_DL usage conflicts)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I531ec02c052944ee07a904dc2a25c59948ee762b
The loop in sugov_next_freq_shared() contains an if block to skip the
loop for the current CPU. This turns out to be an unnecessary
conditional in the scheduler's hot-path for every CPU in the policy.
It would be better to drop the conditional and make the loop treat all
the CPUs in the same way. That would eliminate the need of calling
sugov_iowait_boost() at the top of the routine.
To keep the code optimized to return early if the current CPU has RT/DL
flags set, move the flags check to sugov_update_shared() instead in
order to avoid the function call entirely.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit cba1dfb57b94c234728b689d9b00d4267fa1a879)
(modified for SCHED_CPUFREQ_DL vs SCHED_CPUFREQ_RT)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ie046fdc8eda46821356750edd0fb6f7d077af363
sugov_start() only initializes struct sugov_cpu per-CPU structures
for shared policies, but it should do that for single-CPU policies too.
That in particular makes the IO-wait boost mechanism work in the
cases when cpufreq policies correspond to individual CPUs.
Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
(cherry picked from commit 4296f23ed49a15d36949458adcc66ff993dee2a8)
(we use SCHED_CPUFREQ_DL instead of SCHED_CPUFREQ_RT in cpu->flags)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I5b837a0ee4432115d85caa1a9808ea61e1e1b07f
get_next_freq() uses sg_cpu only to get sg_policy, which the callers of
get_next_freq() already have. Pass sg_policy instead of sg_cpu to
get_next_freq(), to make it more efficient.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 655cb1ebff4b7918fc560502c3297af2d3c7d114)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ia210058da32930a6cdb18258aa679cd1a44a747e
cached_raw_freq applies to the entire cpufreq policy and not individual
CPUs. Apart from wasting per-cpu memory, it is actually wrong to keep it
in struct sugov_cpu as we may end up comparing next_freq with a stale
cached_raw_freq of a random CPU.
Move cached_raw_freq to struct sugov_policy.
Fixes: 5cbea46984d6 (cpufreq: schedutil: map raw required frequency to driver frequency)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry-picked from 6c4f0fa643cb9e775dcc976e3db00d649468ff1d)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ie91420f710819b383947f9031da9be1f3bb7f636
This patch rectifies a comment present in sugov_irq_work() function to
follow proper grammar.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit d06e622d3d9206e6a2cc45a0f9a3256da8773ff4)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Iaf996445d411725639d511432cc424086892a146
Execute the irq-work specific initialization/exit code only when the
fast path isn't available.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 21ef57297b15a49b0c4dd4e7135c1a08e9a29a1c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Icfd68f455ef71846d799fcd2d8ec6aa1bf59573e
The fast_switch_enabled flag will be used by both sugov_policy_alloc()
and sugov_policy_free() with a later patch.
Prepare for that by moving the calls to enable and disable it to the
beginning of sugov_init() and end of sugov_exit().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 4a71ce4348bb61740d411822357061f8bf870f4c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ia174f423ca02d59360657ac2e77a5098ce5cf99c
Switch to the more common practice of writing labels.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 8e2ddb03643eb9d0bc4926946d7ce0d308eef0a5)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ida75c99cf3dff5cae24d3866454c83bcdb3385b9
When placement boost is active, we are currently considering
only the highest capacity cluster. If all of the active CPUs
in this cluster are busy with RT tasks, the waking task is
placed on it's previous CPU, which may be running a RT task.
This results in suboptimal performance. Fix this by expanding
the search to the other clusters, when there is no eligible CPU
found in the highest capacity cluster.
Change-Id: Iaab2e397b994c2b219dc086c7a6fa91ca26a5128
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Due to rounding error hrtimer tick interval becomes 3333333 ns when HZ=300.
Consequently the tick time stamp nearest to the WALT's default window size
20ms will be also 19999998 (3333333 * 6).
Change-Id: I08f9bd2dbecccbb683e4490d06d8b0da703d3ab2
Suggested-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* refs/heads/tmp-64a73ff:
Linux 4.4.76
KVM: nVMX: Fix exception injection
KVM: x86: zero base3 of unusable segments
KVM: x86/vPMU: fix undefined shift in intel_pmu_refresh()
KVM: x86: fix emulation of RSM and IRET instructions
cpufreq: s3c2416: double free on driver init error path
iommu/amd: Fix incorrect error handling in amd_iommu_bind_pasid()
iommu: Handle default domain attach failure
iommu/vt-d: Don't over-free page table directories
ocfs2: o2hb: revert hb threshold to keep compatible
x86/mm: Fix flush_tlb_page() on Xen
x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
ARM: 8685/1: ensure memblock-limit is pmd-aligned
ARM64/ACPI: Fix BAD_MADT_GICC_ENTRY() macro implementation
sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
watchdog: bcm281xx: Fix use of uninitialized spinlock.
xfrm: Oops on error in pfkey_msg2xfrm_state()
xfrm: NULL dereference on allocation failure
xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY
jump label: fix passing kbuild_cflags when checking for asm goto support
ravb: Fix use-after-free on `ifconfig eth0 down`
sctp: check af before verify address in sctp_addr_id2transport
net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
perf probe: Fix to show correct locations for events on modules
be2net: fix status check in be_cmd_pmac_add()
s390/ctl_reg: make __ctl_load a full memory barrier
swiotlb: ensure that page-sized mappings are page-aligned
coredump: Ensure proper size of sparse core files
x86/mpx: Use compatible types in comparison to fix sparse error
mac80211: initialize SMPS field in HT capabilities
spi: davinci: use dma_mapping_error()
scsi: lpfc: avoid double free of resource identifiers
HID: i2c-hid: Add sleep between POWER ON and RESET
kernel/panic.c: add missing \n
ibmveth: Add a proper check for the availability of the checksum features
vxlan: do not age static remote mac entries
virtio_net: fix PAGE_SIZE > 64k
vfio/spapr: fail tce_iommu_attach_group() when iommu_data is null
drm/amdgpu: check ring being ready before using
net: dsa: Check return value of phy_connect_direct()
amd-xgbe: Check xgbe_init() return code
platform/x86: ideapad-laptop: handle ACPI event 1
scsi: virtio_scsi: Reject commands when virtqueue is broken
xen-netfront: Fix Rx stall during network stress and OOM
swiotlb-xen: update dev_addr after swapping pages
virtio_console: fix a crash in config_work_handler
Btrfs: fix truncate down when no_holes feature is enabled
gianfar: Do not reuse pages from emergency reserve
powerpc/eeh: Enable IO path on permanent error
net: bgmac: Remove superflous netif_carrier_on()
net: bgmac: Start transmit queue in bgmac_open
net: bgmac: Fix SOF bit checking
bgmac: Fix reversed test of build_skb() return value.
mtd: bcm47xxpart: don't fail because of bit-flips
bgmac: fix a missing check for build_skb
mtd: bcm47xxpart: limit scanned flash area on BCM47XX (MIPS) only
MIPS: ralink: fix MT7628 wled_an pinmux gpio
MIPS: ralink: fix MT7628 pinmux typos
MIPS: ralink: Fix invalid assignment of SoC type
MIPS: ralink: fix USB frequency scaling
MIPS: ralink: MT7688 pinmux fixes
net: korina: Fix NAPI versus resources freeing
MIPS: ath79: fix regression in PCI window initialization
net: mvneta: Fix for_each_present_cpu usage
ARM: dts: BCM5301X: Correct GIC_PPI interrupt flags
qla2xxx: Fix erroneous invalid handle message
scsi: lpfc: Set elsiocb contexts to NULL after freeing it
scsi: sd: Fix wrong DPOFUA disable in sd_read_cache_type
KVM: x86: fix fixing of hypercalls
mm: numa: avoid waiting on freed migrated pages
block: fix module reference leak on put_disk() call for cgroups throttle
sysctl: enable strict writes
usb: gadget: f_fs: Fix possibe deadlock
drm/vmwgfx: Free hash table allocated by cmdbuf managed res mgr
ALSA: hda - set input_path bitmap to zero after moving it to new place
ALSA: hda - Fix endless loop of codec configure
MIPS: Fix IRQ tracing & lockdep when rescheduling
MIPS: pm-cps: Drop manual cache-line alignment of ready_count
MIPS: Avoid accidental raw backtrace
mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()
drm/ast: Handle configuration without P2A bridge
NFSv4: fix a reference leak caused WARNING messages
netfilter: synproxy: fix conntrackd interaction
netfilter: xt_TCPMSS: add more sanity tests on tcph->doff
rtnetlink: add IFLA_GROUP to ifla_policy
ipv6: Do not leak throw route references
sfc: provide dummy definitions of vswitch functions
net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev
decnet: always not take dst->__refcnt when inserting dst into hash table
net/mlx5: Wait for FW readiness before initializing command interface
ipv6: fix calling in6_ifa_hold incorrectly for dad work
igmp: add a missing spin_lock_init()
igmp: acquire pmc lock for ip_mc_clear_src()
net: caif: Fix a sleep-in-atomic bug in cfpkt_create_pfx
Fix an intermittent pr_emerg warning about lo becoming free.
af_unix: Add sockaddr length checks before accessing sa_family in bind and connect handlers
net: Zero ifla_vf_info in rtnl_fill_vfinfo()
decnet: dn_rtmsg: Improve input length sanitization in dnrmg_receive_user_skb
net: don't call strlen on non-terminated string in dev_set_alias()
ipv6: release dst on error in ip6_dst_lookup_tail
UPSTREAM: selinux: enable genfscon labeling for tracefs
Change-Id: I05ae1d6271769a99ea3817e5066f5ab6511f3254
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAllc3f0ACgkQONu9yGCS
aT4fmA/+OHeYbhpaMRKqrUpsxB3NpROr2Z47ow6vaVjYZzd0irrODLlfIfDQ6EEo
N3v28povu16VeYXk+4h8bsAP2K2j6/BlRaSi2hB6dmnY8GDMaXEfRojPYAlzVz50
qnK/6152siDDarUx1h5Zc8GcmX/tEl6h3bOOxDcwLR+RvyIcWxenuR+uqRM/AV6o
BPEiOuMu7P6LjID7KYgBTFNajVBMLrDXt4SCWdzOZmlNt0QXgKB9yw68vTcc+edC
ZcXqa0M6nEWSDvwobbwBZhFL8H2dJjzweyjeFBgxnxgmOrRh6kvZG2wsz2c8O3/P
g8TuMxU7siu+I3lFwKy+dgZ/1REz+6Q3oFBqXsuddrcPYu23rV6mz/GxqWy4cerb
M4eTWz6L9vA2GoYpvBaWi0tKC9tkNM49g48Y24a6CW1O4dJWlz3RrpTiZmequbNF
mo8EKomSXn4kYAm1xT03DGljQkK/i2JtyI5sk2hLEqqxKvZ/3q9xxLLKOVx8dPvs
PIbfpapfYMXXMWgR6e+UKueNLgevfWE12X/OU4SgvSY4n/07/mH40XEd3zd82IsZ
1Mw0qj3JnqCAFDBBMsDYa+OvABaGD1dHARuiv+aeqW8tqoBglFHxWqF+SQVNXLIE
qTLiKz78vjQpH0zGpkA3HEOh/h4L7a0y3qRMECsk5SUxXsgu1gg=
=bwNU
-----END PGP SIGNATURE-----
Merge 4.4.76 into android-4.4
Changes in 4.4.76
ipv6: release dst on error in ip6_dst_lookup_tail
net: don't call strlen on non-terminated string in dev_set_alias()
decnet: dn_rtmsg: Improve input length sanitization in dnrmg_receive_user_skb
net: Zero ifla_vf_info in rtnl_fill_vfinfo()
af_unix: Add sockaddr length checks before accessing sa_family in bind and connect handlers
Fix an intermittent pr_emerg warning about lo becoming free.
net: caif: Fix a sleep-in-atomic bug in cfpkt_create_pfx
igmp: acquire pmc lock for ip_mc_clear_src()
igmp: add a missing spin_lock_init()
ipv6: fix calling in6_ifa_hold incorrectly for dad work
net/mlx5: Wait for FW readiness before initializing command interface
decnet: always not take dst->__refcnt when inserting dst into hash table
net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev
sfc: provide dummy definitions of vswitch functions
ipv6: Do not leak throw route references
rtnetlink: add IFLA_GROUP to ifla_policy
netfilter: xt_TCPMSS: add more sanity tests on tcph->doff
netfilter: synproxy: fix conntrackd interaction
NFSv4: fix a reference leak caused WARNING messages
drm/ast: Handle configuration without P2A bridge
mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()
MIPS: Avoid accidental raw backtrace
MIPS: pm-cps: Drop manual cache-line alignment of ready_count
MIPS: Fix IRQ tracing & lockdep when rescheduling
ALSA: hda - Fix endless loop of codec configure
ALSA: hda - set input_path bitmap to zero after moving it to new place
drm/vmwgfx: Free hash table allocated by cmdbuf managed res mgr
usb: gadget: f_fs: Fix possibe deadlock
sysctl: enable strict writes
block: fix module reference leak on put_disk() call for cgroups throttle
mm: numa: avoid waiting on freed migrated pages
KVM: x86: fix fixing of hypercalls
scsi: sd: Fix wrong DPOFUA disable in sd_read_cache_type
scsi: lpfc: Set elsiocb contexts to NULL after freeing it
qla2xxx: Fix erroneous invalid handle message
ARM: dts: BCM5301X: Correct GIC_PPI interrupt flags
net: mvneta: Fix for_each_present_cpu usage
MIPS: ath79: fix regression in PCI window initialization
net: korina: Fix NAPI versus resources freeing
MIPS: ralink: MT7688 pinmux fixes
MIPS: ralink: fix USB frequency scaling
MIPS: ralink: Fix invalid assignment of SoC type
MIPS: ralink: fix MT7628 pinmux typos
MIPS: ralink: fix MT7628 wled_an pinmux gpio
mtd: bcm47xxpart: limit scanned flash area on BCM47XX (MIPS) only
bgmac: fix a missing check for build_skb
mtd: bcm47xxpart: don't fail because of bit-flips
bgmac: Fix reversed test of build_skb() return value.
net: bgmac: Fix SOF bit checking
net: bgmac: Start transmit queue in bgmac_open
net: bgmac: Remove superflous netif_carrier_on()
powerpc/eeh: Enable IO path on permanent error
gianfar: Do not reuse pages from emergency reserve
Btrfs: fix truncate down when no_holes feature is enabled
virtio_console: fix a crash in config_work_handler
swiotlb-xen: update dev_addr after swapping pages
xen-netfront: Fix Rx stall during network stress and OOM
scsi: virtio_scsi: Reject commands when virtqueue is broken
platform/x86: ideapad-laptop: handle ACPI event 1
amd-xgbe: Check xgbe_init() return code
net: dsa: Check return value of phy_connect_direct()
drm/amdgpu: check ring being ready before using
vfio/spapr: fail tce_iommu_attach_group() when iommu_data is null
virtio_net: fix PAGE_SIZE > 64k
vxlan: do not age static remote mac entries
ibmveth: Add a proper check for the availability of the checksum features
kernel/panic.c: add missing \n
HID: i2c-hid: Add sleep between POWER ON and RESET
scsi: lpfc: avoid double free of resource identifiers
spi: davinci: use dma_mapping_error()
mac80211: initialize SMPS field in HT capabilities
x86/mpx: Use compatible types in comparison to fix sparse error
coredump: Ensure proper size of sparse core files
swiotlb: ensure that page-sized mappings are page-aligned
s390/ctl_reg: make __ctl_load a full memory barrier
be2net: fix status check in be_cmd_pmac_add()
perf probe: Fix to show correct locations for events on modules
net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
sctp: check af before verify address in sctp_addr_id2transport
ravb: Fix use-after-free on `ifconfig eth0 down`
jump label: fix passing kbuild_cflags when checking for asm goto support
xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY
xfrm: NULL dereference on allocation failure
xfrm: Oops on error in pfkey_msg2xfrm_state()
watchdog: bcm281xx: Fix use of uninitialized spinlock.
sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
ARM64/ACPI: Fix BAD_MADT_GICC_ENTRY() macro implementation
ARM: 8685/1: ensure memblock-limit is pmd-aligned
x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
x86/mm: Fix flush_tlb_page() on Xen
ocfs2: o2hb: revert hb threshold to keep compatible
iommu/vt-d: Don't over-free page table directories
iommu: Handle default domain attach failure
iommu/amd: Fix incorrect error handling in amd_iommu_bind_pasid()
cpufreq: s3c2416: double free on driver init error path
KVM: x86: fix emulation of RSM and IRET instructions
KVM: x86/vPMU: fix undefined shift in intel_pmu_refresh()
KVM: x86: zero base3 of unusable segments
KVM: nVMX: Fix exception injection
Linux 4.4.76
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 upstream.
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.
This situation on exiting NO_HZ is described by:
this_rq->calc_load_update < jiffies < calc_load_update
In this scenario, what we should be doing is:
this_rq->calc_load_update = calc_load_update [ next window ]
But what we actually do is:
this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ]
This has the effect of delaying load average updates for potentially
up to ~9seconds.
This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.
It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().
This issue is easy to reproduce before,
commit 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* refs/heads/tmp-6fc0573:
Linux 4.4.71
xfs: only return -errno or success from attr ->put_listent
xfs: in _attrlist_by_handle, copy the cursor back to userspace
xfs: fix unaligned access in xfs_btree_visit_blocks
xfs: bad assertion for delalloc an extent that start at i_size
xfs: fix indlen accounting error on partial delalloc conversion
xfs: wait on new inodes during quotaoff dquot release
xfs: update ag iterator to support wait on new inodes
xfs: support ability to wait on new inodes
xfs: fix up quotacheck buffer list error handling
xfs: prevent multi-fsb dir readahead from reading random blocks
xfs: handle array index overrun in xfs_dir2_leaf_readbuf()
xfs: fix over-copying of getbmap parameters from userspace
xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
xfs: Fix missed holes in SEEK_HOLE implementation
mlock: fix mlock count can not decrease in race condition
mm/migrate: fix refcount handling when !hugepage_migration_supported()
drm/gma500/psb: Actually use VBT mode when it is found
slub/memcg: cure the brainless abuse of sysfs attributes
ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430
pcmcia: remove left-over %Z format
drm/radeon: Unbreak HPD handling for r600+
drm/radeon/ci: disable mclk switching for high refresh rates (v2)
scsi: mpt3sas: Force request partial completion alignment
HID: wacom: Have wacom_tpc_irq guard against possible NULL dereference
mmc: sdhci-iproc: suppress spurious interrupt with Multiblock read
i2c: i2c-tiny-usb: fix buffer not being DMA capable
vlan: Fix tcp checksum offloads in Q-in-Q vlans
net: phy: marvell: Limit errata to 88m1101
netem: fix skb_orphan_partial()
ipv4: add reference counting to metrics
sctp: fix ICMP processing if skb is non-linear
tcp: avoid fastopen API to be used on AF_UNSPEC
virtio-net: enable TSO/checksum offloads for Q-in-Q vlans
be2net: Fix offload features for Q-in-Q packets
ipv6: fix out of bound writes in __ip6_append_data()
bridge: start hello_timer when enabling KERNEL_STP in br_stp_start
qmi_wwan: add another Lenovo EM74xx device ID
bridge: netlink: check vlan_default_pvid range
ipv6: Check ip6_find_1stfragopt() return value properly.
ipv6: Prevent overrun when parsing v6 header options
net: Improve handling of failures on link and route dumps
tcp: eliminate negative reordering in tcp_clean_rtx_queue
sctp: do not inherit ipv6_{mc|ac|fl}_list from parent
sctp: fix src address selection if using secondary addresses for ipv6
tcp: avoid fragmenting peculiar skbs in SACK
s390/qeth: avoid null pointer dereference on OSN
s390/qeth: unbreak OSM and OSN support
s390/qeth: handle sysfs error during initialization
ipv6/dccp: do not inherit ipv6_mc_list from parent
dccp/tcp: do not inherit mc_list from parent
sparc: Fix -Wstringop-overflow warning
android: base-cfg: disable CONFIG_NFS_FS and CONFIG_NFSD
schedstats/eas: guard properly to avoid breaking non-smp schedstats users
BACKPORT: f2fs: sanity check size of nat and sit cache
FROMLIST: f2fs: sanity check checkpoint segno and blkoff
sched/tune: don't use schedtune before it is ready
sched/fair: use SCHED_CAPACITY_SCALE for energy normalization
sched/{fair,tune}: use reciprocal_value to compute boost margin
sched/tune: Initialize raw_spin_lock in boosted_groups
sched/tune: report when SchedTune has not been initialized
sched/tune: fix sched_energy_diff tracepoint
sched/tune: increase group count to 5
cpufreq/schedutil: use boosted_cpu_util for PELT to match WALT
sched/fair: Fix sched_group_energy() to support per-cpu capacity states
sched/fair: discount task contribution to find CPU with lowest utilization
sched/fair: ensure utilization signals are synchronized before use
sched/fair: remove task util from own cpu when placing waking task
trace:sched: Make util_avg in load_avg trace reflect PELT/WALT as used
sched/fair: Add eas (& cas) specific rq, sd and task stats
sched/core: Fix PELT jump to max OPP upon util increase
sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability
UPSTREAM: sched/core: Fix group_entity's share update
UPSTREAM: sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion
UPSTREAM: sched/fair: Fix incorrect task group ->load_avg
UPSTREAM: sched/fair: Fix effective_load() to consistently use smoothed load
UPSTREAM: sched/fair: Propagate asynchrous detach
UPSTREAM: sched/fair: Propagate load during synchronous attach/detach
UPSTREAM: sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list
BACKPORT: sched/fair: Factorize PELT update
UPSTREAM: sched/fair: Factorize attach/detach entity
UPSTREAM: sched/fair: Improve PELT stuff some more
UPSTREAM: sched/fair: Apply more PELT fixes
UPSTREAM: sched/fair: Fix post_init_entity_util_avg() serialization
BACKPORT: sched/fair: Initiate a new task's util avg to a bounded value
sched/fair: Simplify idle_idx handling in select_idle_sibling()
sched/fair: refactor find_best_target() for simplicity
sched/fair: Change cpu iteration order in find_best_target()
sched/core: Add first cpu w/ max/min orig capacity to root domain
sched/core: Remove remnants of commit fd5c98da1a42
sched: Remove sysctl_sched_is_big_little
sched/fair: Code !is_big_little path into select_energy_cpu_brute()
EAS: sched/fair: Re-integrate 'honor sync wakeups' into wakeup path
Fixup!: sched/fair.c: Set SchedTune specific struct energy_env.task
sched/fair: Energy-aware wake-up task placement
sched/fair: Add energy_diff dead-zone margin
sched/fair: Decommission energy_aware_wake_cpu()
sched/fair: Do not force want_affine eq. true if EAS is enabled
arm64: Set SD_ASYM_CPUCAPACITY sched_domain flag on DIE level
UPSTREAM: sched/fair: Fix incorrect comment for capacity_margin
UPSTREAM: sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
UPSTREAM: sched/fair: Add per-CPU min capacity to sched_group_capacity
UPSTREAM: sched/fair: Consider spare capacity in find_idlest_group()
UPSTREAM: sched/fair: Compute task/cpu utilization at wake-up correctly
UPSTREAM: sched/fair: Let asymmetric CPU configurations balance at wake-up
UPSTREAM: sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems
UPSTREAM: sched/core: Pass child domain into sd_init()
UPSTREAM: sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
UPSTREAM: sched/core: Remove unnecessary NULL-pointer check
UPSTREAM: sched/fair: Optimize find_idlest_cpu() when there is no choice
BACKPORT: sched/fair: Make the use of prev_cpu consistent in the wakeup path
UPSTREAM: sched/core: Fix power to capacity renaming in comment
Partial Revert: "WIP: sched: Add cpu capacity awareness to wakeup balancing"
Revert "WIP: sched: Consider spare cpu capacity at task wake-up"
FROM-LIST: cpufreq: schedutil: Redefine the rate_limit_us tunable
cpufreq: schedutil: add up/down frequency transition rate limits
trace/sched: add rq utilization signal for WALT
sched/cpufreq: make schedutil use WALT signal
sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
BACKPORT: kthread: allow to cancel kthread work
sched/cpufreq: fix tunables for schedfreq governor
BACKPORT: cpufreq: schedutil: New governor based on scheduler utilization data
sched: backport cpufreq hooks from 4.9-rc4
ANDROID: Kconfig: add depends for UID_SYS_STATS
ANDROID: hid: uhid: implement refcount for open and close
Revert "ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY"
ANDROID: mnt: Fix next_descendent
Conflicts:
include/trace/events/sched.h
kernel/sched/Makefile
kernel/sched/core.c
kernel/sched/fair.c
kernel/sched/sched.h
Change-Id: I55318828f2c858e192ac7015bcf2bf0ec5c5b2c5
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
The scheduling change (bug 31501544) to avoid putting RT threads on cores that
are handling softint's was catching cases where there was no reason
to believe the softint would take a long time, resulting in unnecessary
migration overhead. This patch reduces the migration to cases where
the core has a softint that is actually likely to take a long time,
as opposed to the RCU, SCHED, and TIMER softints that are rather quick.
Bug: 31752786
Change-Id: Ib4e179f1e15c736b2fdba31070494e357e9fbbe2
Git-commit: ce05770bd37b8065b61ef650108ecef2b97b148b
Git-repo: https://android.googlesource.com/kernel/msm
[pkondeti@codeaurora.org: resolved minor merge conflicts]
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
select_best_cpu() has previous CPU's cluster bias which overrides
the best_cpu with best_sibling_cpu when the power cost is same.
When the power table is configured incorrectly or static_cpu_pwr_cost/
static_cluster_pwr_cost tunables are set to a large value, the
power_cost() for all candidate CPUs can return INT_MAX. So the
stats.min_cost is never changed from it's initial value i.e INT_MAX.
In the above scenario, we find stats.best_cpu >= 0 && stats.min_cost =
stats.best_sibling_cpu_cost = INT_MAX && stats.best_sibling_cpu_cost = -1
and replace best_cpu with best_sibling_cpu i.e -1.
Change-Id: I09829e278e41daaaff959428ff50927aba29104c
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
* refs/heads/tmp-9bc4622:
Linux 4.4.70
drivers: char: mem: Check for address space wraparound with mmap()
nfsd: encoders mustn't use unitialized values in error cases
drm/edid: Add 10 bpc quirk for LGD 764 panel in HP zBook 17 G2
PCI: Freeze PME scan before suspending devices
PCI: Fix pci_mmap_fits() for HAVE_PCI_RESOURCE_TO_USER platforms
tracing/kprobes: Enforce kprobes teardown after testing
osf_wait4(): fix infoleak
genirq: Fix chained interrupt data ordering
uwb: fix device quirk on big-endian hosts
metag/uaccess: Check access_ok in strncpy_from_user
metag/uaccess: Fix access_ok()
iommu/vt-d: Flush the IOTLB to get rid of the initial kdump mappings
staging: rtl8192e: rtl92e_get_eeprom_size Fix read size of EPROM_CMD.
staging: rtl8192e: fix 2 byte alignment of register BSSIDR.
mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp
xc2028: Fix use-after-free bug properly
arm64: documentation: document tagged pointer stack constraints
arm64: uaccess: ensure extension of access_ok() addr
arm64: xchg: hazard against entire exchange variable
ARM: dts: at91: sama5d3_xplained: not all ADC channels are available
ARM: dts: at91: sama5d3_xplained: fix ADC vref
powerpc/64e: Fix hang when debugging programs with relocated kernel
powerpc/pseries: Fix of_node_put() underflow during DLPAR remove
powerpc/book3s/mce: Move add_taint() later in virtual mode
cx231xx-cards: fix NULL-deref at probe
cx231xx-audio: fix NULL-deref at probe
cx231xx-audio: fix init error path
dvb-frontends/cxd2841er: define symbol_rate_min/max in T/C fe-ops
zr364xx: enforce minimum size when reading header
dib0700: fix NULL-deref at probe
s5p-mfc: Fix unbalanced call to clock management
gspca: konica: add missing endpoint sanity check
ceph: fix recursion between ceph_set_acl() and __ceph_setattr()
iio: proximity: as3935: fix as3935_write
ipx: call ipxitf_put() in ioctl error path
USB: hub: fix non-SS hub-descriptor handling
USB: hub: fix SS hub-descriptor handling
USB: serial: io_ti: fix div-by-zero in set_termios
USB: serial: mct_u232: fix big-endian baud-rate handling
USB: serial: qcserial: add more Lenovo EM74xx device IDs
usb: serial: option: add Telit ME910 support
USB: iowarrior: fix info ioctl on big-endian hosts
usb: musb: tusb6010_omap: Do not reset the other direction's packet size
ttusb2: limit messages to buffer size
mceusb: fix NULL-deref at probe
usbvision: fix NULL-deref at probe
net: irda: irda-usb: fix firmware name on big-endian hosts
usb: host: xhci-mem: allocate zeroed Scratchpad Buffer
xhci: apply PME_STUCK_QUIRK and MISSING_CAS quirk for Denverton
usb: host: xhci-plat: propagate return value of platform_get_irq()
sched/fair: Initialize throttle_count for new task-groups lazily
sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
fscrypt: avoid collisions when presenting long encrypted filenames
f2fs: check entire encrypted bigname when finding a dentry
fscrypt: fix context consistency check when key(s) unavailable
net: qmi_wwan: Add SIMCom 7230E
ext4 crypto: fix some error handling
ext4 crypto: don't let data integrity writebacks fail with ENOMEM
USB: serial: ftdi_sio: add Olimex ARM-USB-TINY(H) PIDs
USB: serial: ftdi_sio: fix setting latency for unprivileged users
pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()
pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
iio: dac: ad7303: fix channel description
of: fix sparse warning in of_pci_range_parser_one
proc: Fix unbalanced hard link numbers
cdc-acm: fix possible invalid access when processing notification
drm/nouveau/tmr: handle races with hw when updating the next alarm time
drm/nouveau/tmr: avoid processing completed alarms when adding a new one
drm/nouveau/tmr: fix corruption of the pending list when rescheduling an alarm
drm/nouveau/tmr: ack interrupt before processing alarms
drm/nouveau/therm: remove ineffective workarounds for alarm bugs
drm/amdgpu: Make display watermark calculations more accurate
drm/amdgpu: Avoid overflows/divide-by-zero in latency_watermark calculations.
ath9k_htc: fix NULL-deref at probe
ath9k_htc: Add support of AirTies 1eda:2315 AR9271 device
s390/cputime: fix incorrect system time
s390/kdump: Add final note
regulator: tps65023: Fix inverted core enable logic.
KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulation
KVM: x86: Fix load damaged SSEx MXCSR register
ima: accept previously set IMA_NEW_FILE
mwifiex: pcie: fix cmd_buf use-after-free in remove/reset
rtlwifi: rtl8821ae: setup 8812ae RFE according to device type
md: update slab_cache before releasing new stripes when stripes resizing
dm space map disk: fix some book keeping in the disk space map
dm thin metadata: call precommit before saving the roots
dm bufio: make the parameter "retain_bytes" unsigned long
dm cache metadata: fail operations if fail_io mode has been established
dm bufio: check new buffer allocation watermark every 30 seconds
dm bufio: avoid a possible ABBA deadlock
dm raid: select the Kconfig option CONFIG_MD_RAID0
dm btree: fix for dm_btree_find_lowest_key()
infiniband: call ipv6 route lookup via the stub interface
tpm_crb: check for bad response size
ARM: tegra: paz00: Mark panel regulator as enabled on boot
USB: core: replace %p with %pK
char: lp: fix possible integer overflow in lp_setup()
watchdog: pcwd_usb: fix NULL-deref at probe
USB: ene_usb6250: fix DMA to the stack
usb: misc: legousbtower: Fix memory leak
usb: misc: legousbtower: Fix buffers on stack
ANDROID: uid_sys_stats: defer io stats calulation for dead tasks
ANDROID: AVB: Fix linter errors.
ANDROID: AVB: Fix invalidate_vbmeta_submit().
ANDROID: sdcardfs: Check for NULL in revalidate
Linux 4.4.69
ipmi: Fix kernel panic at ipmi_ssif_thread()
wlcore: Add RX_BA_WIN_SIZE_CHANGE_EVENT event
wlcore: Pass win_size taken from ieee80211_sta to FW
mac80211: RX BA support for sta max_rx_aggregation_subframes
mac80211: pass block ack session timeout to to driver
mac80211: pass RX aggregation window size to driver
Bluetooth: hci_intel: add missing tty-device sanity check
Bluetooth: hci_bcm: add missing tty-device sanity check
Bluetooth: Fix user channel for 32bit userspace on 64bit kernel
tty: pty: Fix ldisc flush after userspace become aware of the data already
serial: omap: suspend device on probe errors
serial: omap: fix runtime-pm handling on unbind
serial: samsung: Use right device for DMA-mapping calls
arm64: KVM: Fix decoding of Rt/Rt2 when trapping AArch32 CP accesses
padata: free correct variable
CIFS: add misssing SFM mapping for doublequote
cifs: fix CIFS_IOC_GET_MNT_INFO oops
CIFS: fix mapping of SFM_SPACE and SFM_PERIOD
SMB3: Work around mount failure when using SMB3 dialect to Macs
Set unicode flag on cifs echo request to avoid Mac error
fs/block_dev: always invalidate cleancache in invalidate_bdev()
ceph: fix memory leak in __ceph_setxattr()
fs/xattr.c: zero out memory copied to userspace in getxattr
ext4: evict inline data when writing to memory map
IB/mlx4: Reduce SRIOV multicast cleanup warning message to debug level
IB/mlx4: Fix ib device initialization error flow
IB/IPoIB: ibX: failed to create mcg debug file
IB/core: Fix sysfs registration error flow
vfio/type1: Remove locked page accounting workqueue
dm era: save spacemap metadata root after the pre-commit
crypto: algif_aead - Require setkey before accept(2)
block: fix blk_integrity_register to use template's interval_exp if not 0
KVM: arm/arm64: fix races in kvm_psci_vcpu_on
KVM: x86: fix user triggerable warning in kvm_apic_accept_events()
um: Fix PTRACE_POKEUSER on x86_64
x86, pmem: Fix cache flushing for iovec write < 8 bytes
selftests/x86/ldt_gdt_32: Work around a glibc sigaction() bug
x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup
usb: hub: Do not attempt to autosuspend disconnected devices
usb: hub: Fix error loop seen after hub communication errors
usb: Make sure usb/phy/of gets built-in
usb: misc: add missing continue in switch
staging: comedi: jr3_pci: cope with jiffies wraparound
staging: comedi: jr3_pci: fix possible null pointer dereference
staging: gdm724x: gdm_mux: fix use-after-free on module unload
staging: vt6656: use off stack for out buffer USB transfers.
staging: vt6656: use off stack for in buffer USB transfers.
USB: Proper handling of Race Condition when two USB class drivers try to call init_usb_class simultaneously
USB: serial: ftdi_sio: add device ID for Microsemi/Arrow SF2PLUS Dev Kit
usb: host: xhci: print correct command ring address
iscsi-target: Set session_fall_back_to_erl0 when forcing reinstatement
target: Convert ACL change queue_depth se_session reference usage
target/fileio: Fix zero-length READ and WRITE handling
target: Fix compare_and_write_callback handling for non GOOD status
xen: adjust early dom0 p2m handling to xen hypervisor behavior
ANDROID: AVB: Only invalidate vbmeta when told to do so.
ANDROID: sdcardfs: Move top to its own struct
ANDROID: lowmemorykiller: account for unevictable pages
ANDROID: usb: gadget: fix NULL pointer issue in mtp_read()
ANDROID: usb: f_mtp: return error code if transfer error in receive_file_work function
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
Conflicts:
drivers/usb/gadget/function/f_mtp.c
fs/ext4/page-io.c
net/mac80211/agg-rx.c
Change-Id: Id65e75bf3bcee4114eb5d00730a9ef2444ad58eb
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
Add appropriate #ifdef guards to ensure the smp-only easstats structs
are not used when smp is not enabled. Arnd got a report from buildbot,
analysed it, and pointed out exactly what the issue was.
Reported-by: "Arnd Bergmann" <arnd@arndb.de>
Suggested-by: "Arnd Bergmann" <arnd@arndb.de>
Fixes: 4b85765a3d ("sched/fair: Add eas (& cas)
specific rq, sd and task stats")
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I60554dea20137f6774db3f59b4afd40a06554cfc
When EAS is enabled during boot, we have to be careful not to use
schedtune from fair.c before it is ready or it will warn us and we'll
get a traceback in the console.
Change-Id: I1a5cf29b18af626545c636c51219f9ed497c19fa
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
sched_energy_diff tracepoint is in a place where it can never trace
payoff or nrg.delta. If CONFIG_SCHED_TUNE is enabled, put it in
a place where those values exist. If it is not enabled, trace from
the current location
Change-Id: Id5442f2b34ec76625491d27c0f4285433ca12699
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
We use 5 groups everywhere else, this should default to the same.
Change-Id: I05a20bdcf8046ea90a2e36979940cef11246e735
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
When using WALT we always used boosted cpu util for OPP selection.
This is the primary purpose for boosted cpu util, but we hadn't
changed the PELT utilization check to do the same thing.
Fix that here.
Change-Id: Id5ffb26eac23b25fe754255221f6d21b8cededfd
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
sched_group_energy() was supposed to support per-cpu capacity states
(DVFS), however, while fixing a hotplug issue this was broken as we bail
out if there is no SD_SHARE_CAP_STATES flag set.
This patch implements the hotplug race check differently and should
therefore reinstate support for per-cpu capacity states.
Change-Id: I5b865666c9ce833dcfa6514c574580d75aa0a195
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
In some cases, the new_util of a task can be the same on several
CPUs. This causes an issue because the target_util is only updated
if the current new_util is strictly smaller than target_util.
To fix that, the cpu_util_wake() return value is used alongside the
new_util value. If two CPUs compute the same new_util value,
we'll now also look at their cpu_util_wake() return value. In this
case, the CPU that last ran the task will be chosen in priority.
Change-Id: Ia1ea2c4b3ec39621372c2f748862317d5b497723
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
wake_cap performs task and cpu utilization synchronization which is
what allows us to subtract current task util from prev_cpu util and
have a sensible number to work with.
It looks as though if wake_wide returns 0, we could potentially not
execute wake_cap, which would result in unsynced signals we then use
for energy calculations.
This is not necessarily an issue we've seen in traces, but it looks
as though it should be changed.
Change-Id: Ic54a3cba2a10d946ea20113a04371dea04115e82
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
When we place a waking task with find_best_target, we calculate the
existing and new utilisation of each candidate cpu. However, we do
not remove any blocked load resulting from the waking task on the
previous cpu which might cause unnecessary migrations.
Switch to using cpu_util_wake which does this for us, which requires
moving cpu_util_wake a few functions earlier.
Also, we have multiple potential cpu utilization signals here, so
update the necessary bits to allow WALT to work properly (including
not subtracting task util for WALT).
When WALT is in use, cpu utilization is the utilization
in the previous completed window, whilst the task utilization
ignores fully idle windows. There seems to be no way to have a
decently accurate estimate of how much (if any) utilization from
this task remains on the prev cpu.
Instead, just return cpu_util when we're using WALT.
Change-Id: I448203ab98ffb5c020dfb6b218581eef1f5601f7
Reported-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
With the ability to choose between WALT and PELT for utilisation tracking
we can have the situation where we're using WALT to make all the
decisions and reporting PELT figures in the sched_load_avg_(cpu|task)
trace points. This is not too much of an issue, but when analysing trace
it is nice to see numbers representing what the scheduler is using rather
than needing to add in additional sched_walt_* traces to figure it out.
Add reporting for both types, and make the util_avg member reflect what
will be seen from cpu or task_util functions in the scheduler.
Change-Id: I2abbd2c5fa70822096d0f3372b4c12b1c6af1590
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
For Energy-Aware Scheduling (EAS) to work properly, even in the
case that there is only one cpu per cluster or that cpus are hot-plugged
out, the Energy Model (EM) data on all energy-aware sched domains (sd)
has to be present for all online cpus.
Mainline sd hierarchy setup code will remove sd's which are not useful
for task scheduling e.g. in the following situations:
1. Only 1 cpu is/remains in one cluster of a multi cluster system.
This remaining cpu only has DIE and no MC sd.
2. A complete cluster in a two cluster system is hot-plugged out.
The cpus of the remaining cluster only have MC and no DIE sd.
To make sure that all online cpus keep all their energy-aware sd's,
the sd degenerate functionality has been changed to not free a sd if
its first sched group (sg) contains EM data in case:
1. There is only 1 cpu left in the sd.
2. There have to be at least 2 sg's if certain sd flags are set.
Instead of freeing such a sd it now clears only its SD_LOAD_BALANCE
flag. This will make sure that the EAS functionality will always see
all energy-aware sd's for all online cpus.
It will introduce a tiny performance degradation for operations on
affected cpus since the hot-path macro for_each_domain() has to deal
with sd's not contributing to task scheduling at all now.
In most cases the exisiting code makes sure that task scheduling is not
invoked on a sd with !SD_LOAD_BALANCE.
However, a small change is necessary in update_sd_lb_stats() to make
sure that sd->parent is only initialized to !NULL in case the parent sd
contains more than 1 sg.
The handling of newidle decay values before the SD_LOAD_BALANCE check in
rebalance_domains() stays unchanged.
Test (w/ CONFIG_SCHED_DEBUG):
JUNO r0 default system:
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd07
CPU part : 0xd07
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags:
$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
Test 1: Hotplug-out one A57 (CPU part 0xd07) cpu:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd07
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags for remaining A57 (cpu2) cpu:
$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags`
832e <-- MC SD with !SD_LOAD_BALANCE
102f
Test 2: Hotplug-out the entire A57 cluster:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 0 > /sys/devices/system/cpu/cpu2/online
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags for the remaining A53 (CPU part 0xd03) cluster:
$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102e <-- DIE SD with !SD_LOAD_BALANCE
832f
102e
832f
102e
832f
102e
Change-Id: If24aa2b2628f334abbf0207d39e2a86168d9d673
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.
Let take the example of a task a that is dequeued of its task group A:
root
(cfs_rq)
\
(se)
A
(cfs_rq)
\
(se)
a
Task "a" was the only task in task group A which becomes idle when a is
dequeued.
We have the sequence:
- dequeue_entity a->se
- update_load_avg(a->se)
- dequeue_entity_load_avg(A->cfs_rq, a->se)
- update_cfs_shares(A->cfs_rq)
A->cfs_rq->load.weight == 0
A->se->load.weight is updated with the new share (0 in this case)
- dequeue_entity A->se
- update_load_avg(A->se) but its weight is now null so the last time
slot (up to a tick) will be accounted with a weight of 0 instead of
its real weight during the time slot. The last time slot will be
accounted as an idle one whereas it was a running one.
If the running time of task a is short enough that no tick happens when it
runs, all running time of group entity A->se will be accounted as idle
time.
Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.
update_cfs_shares() now takes the sched_entity as a parameter instead of the
cfs_rq, and the weight of the group_entity is updated only once its load_avg
has been synced with current time.
Change-Id: Id6ce3be1767b44b444ce2a77ed1ba063e57c0664
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 89ee048f3cc796db6f26906c6bef4edf0bee70fd)
[minor cherry pick stuff]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Commit:
fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
did something non-obvious but also did it buggy yet latent.
The problem was exposed for real by a later commit in the v4.7 merge window:
2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
... after which tg->load_avg and cfs_rq->load.weight had different
units (10 bit fixed point and 20 bit fixed point resp.).
Add a comment to explain the use of cfs_rq->load.weight over the
'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct
for the difference in unit.
Since this is (now, as per a previous commit) the only user of
calc_tg_weight(), collapse it.
The effects of this bug should be randomly inconsistent SMP-balancing
of cgroups workloads.
Change-Id: If1e565662ea163485edd94a12aef644d0e0dfe7a
Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit ea1dc6fc6242f991656e35e2ed3d90ec1cd13418)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
A scheduler performance regression has been reported by Joseph Salisbury,
which he bisected back to:
3d30544f0212 ("sched/fair: Apply more PELT fixes)
The regression triggers when several levels of task groups are involved
(read: SystemD) and cpu_possible_mask != cpu_present_mask.
The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
is initialized to scale_load_down(se->load.weight). During the creation of
a child task group, its group entities on possible CPUs are attached to
parent's cfs_rq (tg_parent) and their loads are added to the parent's load
(tg_parent->load_avg) with update_tg_load_avg().
But only the load on online CPUs will then be updated to reflect real load,
whereas load on other CPUs will stay at the initial value.
The result is a tg_parent->load_avg that is higher than the real load, the
weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
smaller than it should be, and the task group gets a less running time than
what it could expect.
( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
of the task group will be much higher than sum of ".tg_load_avg_contrib"
of online cfs_rqs of the task group. )
The load of group entities don't have to be intialized to something else
than 0 because their load will increase when an entity is attached.
Change-Id: Ie55021ff98ba49016adfddb2444e9c9709939226
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org> # 4.8.x
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: joonwoop@codeaurora.org
Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes)
Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b5a9b340789b2b24c6896bcf7a065c31a4db671c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Starting with the following commit:
fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
calc_tg_weight() doesn't compute the right value as expected by effective_load().
The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.
Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.
So stop using calc_tg_weight() and do it explicitly.
The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.
Change-Id: I1c0058ff674650cf295c8dc3b88a5a3de4bddab0
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v4.3+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7dd4912594daf769a46744848b05bd5bc6d62469)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.
During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.
The propagation relies on patch:
"sched: Fix hierarchical order in rq->leaf_cfs_rq_list"
... which orders children and parents, to ensure that it's done in one pass.
Change-Id: I33782e35fc4711f5901e8c23d6aa7ec5f2ff7ee5
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4e5160766fcc9f41bbd38bac11f92dce993644aa)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.
For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.
For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.
Change-Id: Id34a9888484716961c9027299c0b4d82881a39d1
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 09a43ace1f986b003c118fdf6ddf1fd685692d49)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.
The hierarchical order in shares update list has been introduced by
commit:
67e86250f8 ("sched: Introduce hierarchal order on shares update list")
With the current implementation a child can be still put after its
parent.
Lets take the example of:
root
\
b
/\
c d*
|
e*
with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail
The branch d -> e will be added the first time that they are enqueued,
starting with e then d.
When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail
Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail
e is not placed at the right position and will be called the last
whereas it should be called at the beginning.
Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.
Change-Id: I4fe0b8502ea628c13d14e8e5c5279bce67fb8845
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9c2791f936ef5fd04a118b5c284f2c9a95f4a647)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>