Commit graph

2241 commits

Author SHA1 Message Date
Chris Redpath
537d19226a UPSTREAM: cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register.  Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again.  That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task.  If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away.  That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.

This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway.  On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.

On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.

To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes <joelaf@google.com>
(cherry picked from commit b7eaf1aab9f8bd2e49fceed77ebc66c1b5800718)
(simple CPUFREQ_RT_DL vs CPUFREQ_DL usage conflicts)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I531ec02c052944ee07a904dc2a25c59948ee762b
2017-07-18 18:18:46 +00:00
Chris Redpath
a8a200d83b UPSTREAM: cpufreq: schedutil: Refactor sugov_next_freq_shared()
The loop in sugov_next_freq_shared() contains an if block to skip the
loop for the current CPU. This turns out to be an unnecessary
conditional in the scheduler's hot-path for every CPU in the policy.

It would be better to drop the conditional and make the loop treat all
the CPUs in the same way. That would eliminate the need of calling
sugov_iowait_boost() at the top of the routine.

To keep the code optimized to return early if the current CPU has RT/DL
flags set, move the flags check to sugov_update_shared() instead in
order to avoid the function call entirely.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit cba1dfb57b94c234728b689d9b00d4267fa1a879)
(modified for SCHED_CPUFREQ_DL vs SCHED_CPUFREQ_RT)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ie046fdc8eda46821356750edd0fb6f7d077af363
2017-07-18 18:18:40 +00:00
Chris Redpath
7378c38a80 UPSTREAM: cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
sugov_start() only initializes struct sugov_cpu per-CPU structures
for shared policies, but it should do that for single-CPU policies too.

That in particular makes the IO-wait boost mechanism work in the
cases when cpufreq policies correspond to individual CPUs.

Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
(cherry picked from commit 4296f23ed49a15d36949458adcc66ff993dee2a8)
(we use SCHED_CPUFREQ_DL instead of SCHED_CPUFREQ_RT in cpu->flags)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I5b837a0ee4432115d85caa1a9808ea61e1e1b07f
2017-07-18 18:18:33 +00:00
Viresh Kumar
cbaccedead UPSTREAM: cpufreq: schedutil: Pass sg_policy to get_next_freq()
get_next_freq() uses sg_cpu only to get sg_policy, which the callers of
get_next_freq() already have. Pass sg_policy instead of sg_cpu to
get_next_freq(), to make it more efficient.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 655cb1ebff4b7918fc560502c3297af2d3c7d114)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ia210058da32930a6cdb18258aa679cd1a44a747e
2017-07-18 18:18:27 +00:00
Chris Redpath
0646dd3592 UPSTREAM: cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
cached_raw_freq applies to the entire cpufreq policy and not individual
CPUs. Apart from wasting per-cpu memory, it is actually wrong to keep it
in struct sugov_cpu as we may end up comparing next_freq with a stale
cached_raw_freq of a random CPU.

Move cached_raw_freq to struct sugov_policy.

Fixes: 5cbea46984d6 (cpufreq: schedutil: map raw required frequency to driver frequency)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry-picked from 6c4f0fa643cb9e775dcc976e3db00d649468ff1d)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ie91420f710819b383947f9031da9be1f3bb7f636
2017-07-18 18:18:20 +00:00
Viresh Kumar
69fc75780d UPSTREAM: cpufreq: schedutil: Rectify comment in sugov_irq_work() function
This patch rectifies a comment present in sugov_irq_work() function to
follow proper grammar.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit d06e622d3d9206e6a2cc45a0f9a3256da8773ff4)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Iaf996445d411725639d511432cc424086892a146
2017-07-18 18:18:13 +00:00
Chris Redpath
d9e7d036e7 UPSTREAM: cpufreq: schedutil: irq-work and mutex are only used in slow path
Execute the irq-work specific initialization/exit code only when the
fast path isn't available.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 21ef57297b15a49b0c4dd4e7135c1a08e9a29a1c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Icfd68f455ef71846d799fcd2d8ec6aa1bf59573e
2017-07-18 18:18:03 +00:00
Chris Redpath
ceed1eb2b4 UPSTREAM: cpufreq: schedutil: enable fast switch earlier
The fast_switch_enabled flag will be used by both sugov_policy_alloc()
and sugov_policy_free() with a later patch.

Prepare for that by moving the calls to enable and disable it to the
beginning of sugov_init() and end of sugov_exit().

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 4a71ce4348bb61740d411822357061f8bf870f4c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ia174f423ca02d59360657ac2e77a5098ce5cf99c
2017-07-18 18:13:58 +00:00
Chris Redpath
bab9c2fbe4 UPSTREAM: cpufreq: schedutil: Avoid indented labels
Switch to the more common practice of writing labels.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 8e2ddb03643eb9d0bc4926946d7ce0d308eef0a5)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ida75c99cf3dff5cae24d3866454c83bcdb3385b9
2017-07-18 18:09:20 +00:00
Pavankumar Kondeti
f261bf42cc sched: avoid RT tasks contention during sched boost
When placement boost is active, we are currently considering
only the highest capacity cluster. If all of the active CPUs
in this cluster are busy with RT tasks, the waking task is
placed on it's previous CPU, which may be running a RT task.
This results in suboptimal performance. Fix this by expanding
the search to the other clusters, when there is no eligible CPU
found in the highest capacity cluster.

Change-Id: Iaab2e397b994c2b219dc086c7a6fa91ca26a5128
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-07-14 20:57:48 +05:30
Joonwoo Park
d368c6faa1 sched: walt: fix window misalignment when HZ=300
Due to rounding error hrtimer tick interval becomes 3333333 ns when HZ=300.
Consequently the tick time stamp nearest to the WALT's default window size
20ms will be also 19999998 (3333333 * 6).

Change-Id: I08f9bd2dbecccbb683e4490d06d8b0da703d3ab2
Suggested-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2017-07-12 21:01:07 +00:00
Blagovest Kolenichev
4c8daae4af Merge android-4.4@64a73ff (v4.4.76) into msm-4.4
* refs/heads/tmp-64a73ff:
  Linux 4.4.76
  KVM: nVMX: Fix exception injection
  KVM: x86: zero base3 of unusable segments
  KVM: x86/vPMU: fix undefined shift in intel_pmu_refresh()
  KVM: x86: fix emulation of RSM and IRET instructions
  cpufreq: s3c2416: double free on driver init error path
  iommu/amd: Fix incorrect error handling in amd_iommu_bind_pasid()
  iommu: Handle default domain attach failure
  iommu/vt-d: Don't over-free page table directories
  ocfs2: o2hb: revert hb threshold to keep compatible
  x86/mm: Fix flush_tlb_page() on Xen
  x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
  ARM: 8685/1: ensure memblock-limit is pmd-aligned
  ARM64/ACPI: Fix BAD_MADT_GICC_ENTRY() macro implementation
  sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
  watchdog: bcm281xx: Fix use of uninitialized spinlock.
  xfrm: Oops on error in pfkey_msg2xfrm_state()
  xfrm: NULL dereference on allocation failure
  xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY
  jump label: fix passing kbuild_cflags when checking for asm goto support
  ravb: Fix use-after-free on `ifconfig eth0 down`
  sctp: check af before verify address in sctp_addr_id2transport
  net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
  perf probe: Fix to show correct locations for events on modules
  be2net: fix status check in be_cmd_pmac_add()
  s390/ctl_reg: make __ctl_load a full memory barrier
  swiotlb: ensure that page-sized mappings are page-aligned
  coredump: Ensure proper size of sparse core files
  x86/mpx: Use compatible types in comparison to fix sparse error
  mac80211: initialize SMPS field in HT capabilities
  spi: davinci: use dma_mapping_error()
  scsi: lpfc: avoid double free of resource identifiers
  HID: i2c-hid: Add sleep between POWER ON and RESET
  kernel/panic.c: add missing \n
  ibmveth: Add a proper check for the availability of the checksum features
  vxlan: do not age static remote mac entries
  virtio_net: fix PAGE_SIZE > 64k
  vfio/spapr: fail tce_iommu_attach_group() when iommu_data is null
  drm/amdgpu: check ring being ready before using
  net: dsa: Check return value of phy_connect_direct()
  amd-xgbe: Check xgbe_init() return code
  platform/x86: ideapad-laptop: handle ACPI event 1
  scsi: virtio_scsi: Reject commands when virtqueue is broken
  xen-netfront: Fix Rx stall during network stress and OOM
  swiotlb-xen: update dev_addr after swapping pages
  virtio_console: fix a crash in config_work_handler
  Btrfs: fix truncate down when no_holes feature is enabled
  gianfar: Do not reuse pages from emergency reserve
  powerpc/eeh: Enable IO path on permanent error
  net: bgmac: Remove superflous netif_carrier_on()
  net: bgmac: Start transmit queue in bgmac_open
  net: bgmac: Fix SOF bit checking
  bgmac: Fix reversed test of build_skb() return value.
  mtd: bcm47xxpart: don't fail because of bit-flips
  bgmac: fix a missing check for build_skb
  mtd: bcm47xxpart: limit scanned flash area on BCM47XX (MIPS) only
  MIPS: ralink: fix MT7628 wled_an pinmux gpio
  MIPS: ralink: fix MT7628 pinmux typos
  MIPS: ralink: Fix invalid assignment of SoC type
  MIPS: ralink: fix USB frequency scaling
  MIPS: ralink: MT7688 pinmux fixes
  net: korina: Fix NAPI versus resources freeing
  MIPS: ath79: fix regression in PCI window initialization
  net: mvneta: Fix for_each_present_cpu usage
  ARM: dts: BCM5301X: Correct GIC_PPI interrupt flags
  qla2xxx: Fix erroneous invalid handle message
  scsi: lpfc: Set elsiocb contexts to NULL after freeing it
  scsi: sd: Fix wrong DPOFUA disable in sd_read_cache_type
  KVM: x86: fix fixing of hypercalls
  mm: numa: avoid waiting on freed migrated pages
  block: fix module reference leak on put_disk() call for cgroups throttle
  sysctl: enable strict writes
  usb: gadget: f_fs: Fix possibe deadlock
  drm/vmwgfx: Free hash table allocated by cmdbuf managed res mgr
  ALSA: hda - set input_path bitmap to zero after moving it to new place
  ALSA: hda - Fix endless loop of codec configure
  MIPS: Fix IRQ tracing & lockdep when rescheduling
  MIPS: pm-cps: Drop manual cache-line alignment of ready_count
  MIPS: Avoid accidental raw backtrace
  mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()
  drm/ast: Handle configuration without P2A bridge
  NFSv4: fix a reference leak caused WARNING messages
  netfilter: synproxy: fix conntrackd interaction
  netfilter: xt_TCPMSS: add more sanity tests on tcph->doff
  rtnetlink: add IFLA_GROUP to ifla_policy
  ipv6: Do not leak throw route references
  sfc: provide dummy definitions of vswitch functions
  net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev
  decnet: always not take dst->__refcnt when inserting dst into hash table
  net/mlx5: Wait for FW readiness before initializing command interface
  ipv6: fix calling in6_ifa_hold incorrectly for dad work
  igmp: add a missing spin_lock_init()
  igmp: acquire pmc lock for ip_mc_clear_src()
  net: caif: Fix a sleep-in-atomic bug in cfpkt_create_pfx
  Fix an intermittent pr_emerg warning about lo becoming free.
  af_unix: Add sockaddr length checks before accessing sa_family in bind and connect handlers
  net: Zero ifla_vf_info in rtnl_fill_vfinfo()
  decnet: dn_rtmsg: Improve input length sanitization in dnrmg_receive_user_skb
  net: don't call strlen on non-terminated string in dev_set_alias()
  ipv6: release dst on error in ip6_dst_lookup_tail
  UPSTREAM: selinux: enable genfscon labeling for tracefs

Change-Id: I05ae1d6271769a99ea3817e5066f5ab6511f3254
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
2017-07-10 03:00:34 -07:00
Greg Kroah-Hartman
64a73ff728 This is the 4.4.76 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAllc3f0ACgkQONu9yGCS
 aT4fmA/+OHeYbhpaMRKqrUpsxB3NpROr2Z47ow6vaVjYZzd0irrODLlfIfDQ6EEo
 N3v28povu16VeYXk+4h8bsAP2K2j6/BlRaSi2hB6dmnY8GDMaXEfRojPYAlzVz50
 qnK/6152siDDarUx1h5Zc8GcmX/tEl6h3bOOxDcwLR+RvyIcWxenuR+uqRM/AV6o
 BPEiOuMu7P6LjID7KYgBTFNajVBMLrDXt4SCWdzOZmlNt0QXgKB9yw68vTcc+edC
 ZcXqa0M6nEWSDvwobbwBZhFL8H2dJjzweyjeFBgxnxgmOrRh6kvZG2wsz2c8O3/P
 g8TuMxU7siu+I3lFwKy+dgZ/1REz+6Q3oFBqXsuddrcPYu23rV6mz/GxqWy4cerb
 M4eTWz6L9vA2GoYpvBaWi0tKC9tkNM49g48Y24a6CW1O4dJWlz3RrpTiZmequbNF
 mo8EKomSXn4kYAm1xT03DGljQkK/i2JtyI5sk2hLEqqxKvZ/3q9xxLLKOVx8dPvs
 PIbfpapfYMXXMWgR6e+UKueNLgevfWE12X/OU4SgvSY4n/07/mH40XEd3zd82IsZ
 1Mw0qj3JnqCAFDBBMsDYa+OvABaGD1dHARuiv+aeqW8tqoBglFHxWqF+SQVNXLIE
 qTLiKz78vjQpH0zGpkA3HEOh/h4L7a0y3qRMECsk5SUxXsgu1gg=
 =bwNU
 -----END PGP SIGNATURE-----

Merge 4.4.76 into android-4.4

Changes in 4.4.76
	ipv6: release dst on error in ip6_dst_lookup_tail
	net: don't call strlen on non-terminated string in dev_set_alias()
	decnet: dn_rtmsg: Improve input length sanitization in dnrmg_receive_user_skb
	net: Zero ifla_vf_info in rtnl_fill_vfinfo()
	af_unix: Add sockaddr length checks before accessing sa_family in bind and connect handlers
	Fix an intermittent pr_emerg warning about lo becoming free.
	net: caif: Fix a sleep-in-atomic bug in cfpkt_create_pfx
	igmp: acquire pmc lock for ip_mc_clear_src()
	igmp: add a missing spin_lock_init()
	ipv6: fix calling in6_ifa_hold incorrectly for dad work
	net/mlx5: Wait for FW readiness before initializing command interface
	decnet: always not take dst->__refcnt when inserting dst into hash table
	net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev
	sfc: provide dummy definitions of vswitch functions
	ipv6: Do not leak throw route references
	rtnetlink: add IFLA_GROUP to ifla_policy
	netfilter: xt_TCPMSS: add more sanity tests on tcph->doff
	netfilter: synproxy: fix conntrackd interaction
	NFSv4: fix a reference leak caused WARNING messages
	drm/ast: Handle configuration without P2A bridge
	mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()
	MIPS: Avoid accidental raw backtrace
	MIPS: pm-cps: Drop manual cache-line alignment of ready_count
	MIPS: Fix IRQ tracing & lockdep when rescheduling
	ALSA: hda - Fix endless loop of codec configure
	ALSA: hda - set input_path bitmap to zero after moving it to new place
	drm/vmwgfx: Free hash table allocated by cmdbuf managed res mgr
	usb: gadget: f_fs: Fix possibe deadlock
	sysctl: enable strict writes
	block: fix module reference leak on put_disk() call for cgroups throttle
	mm: numa: avoid waiting on freed migrated pages
	KVM: x86: fix fixing of hypercalls
	scsi: sd: Fix wrong DPOFUA disable in sd_read_cache_type
	scsi: lpfc: Set elsiocb contexts to NULL after freeing it
	qla2xxx: Fix erroneous invalid handle message
	ARM: dts: BCM5301X: Correct GIC_PPI interrupt flags
	net: mvneta: Fix for_each_present_cpu usage
	MIPS: ath79: fix regression in PCI window initialization
	net: korina: Fix NAPI versus resources freeing
	MIPS: ralink: MT7688 pinmux fixes
	MIPS: ralink: fix USB frequency scaling
	MIPS: ralink: Fix invalid assignment of SoC type
	MIPS: ralink: fix MT7628 pinmux typos
	MIPS: ralink: fix MT7628 wled_an pinmux gpio
	mtd: bcm47xxpart: limit scanned flash area on BCM47XX (MIPS) only
	bgmac: fix a missing check for build_skb
	mtd: bcm47xxpart: don't fail because of bit-flips
	bgmac: Fix reversed test of build_skb() return value.
	net: bgmac: Fix SOF bit checking
	net: bgmac: Start transmit queue in bgmac_open
	net: bgmac: Remove superflous netif_carrier_on()
	powerpc/eeh: Enable IO path on permanent error
	gianfar: Do not reuse pages from emergency reserve
	Btrfs: fix truncate down when no_holes feature is enabled
	virtio_console: fix a crash in config_work_handler
	swiotlb-xen: update dev_addr after swapping pages
	xen-netfront: Fix Rx stall during network stress and OOM
	scsi: virtio_scsi: Reject commands when virtqueue is broken
	platform/x86: ideapad-laptop: handle ACPI event 1
	amd-xgbe: Check xgbe_init() return code
	net: dsa: Check return value of phy_connect_direct()
	drm/amdgpu: check ring being ready before using
	vfio/spapr: fail tce_iommu_attach_group() when iommu_data is null
	virtio_net: fix PAGE_SIZE > 64k
	vxlan: do not age static remote mac entries
	ibmveth: Add a proper check for the availability of the checksum features
	kernel/panic.c: add missing \n
	HID: i2c-hid: Add sleep between POWER ON and RESET
	scsi: lpfc: avoid double free of resource identifiers
	spi: davinci: use dma_mapping_error()
	mac80211: initialize SMPS field in HT capabilities
	x86/mpx: Use compatible types in comparison to fix sparse error
	coredump: Ensure proper size of sparse core files
	swiotlb: ensure that page-sized mappings are page-aligned
	s390/ctl_reg: make __ctl_load a full memory barrier
	be2net: fix status check in be_cmd_pmac_add()
	perf probe: Fix to show correct locations for events on modules
	net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
	sctp: check af before verify address in sctp_addr_id2transport
	ravb: Fix use-after-free on `ifconfig eth0 down`
	jump label: fix passing kbuild_cflags when checking for asm goto support
	xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY
	xfrm: NULL dereference on allocation failure
	xfrm: Oops on error in pfkey_msg2xfrm_state()
	watchdog: bcm281xx: Fix use of uninitialized spinlock.
	sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
	ARM64/ACPI: Fix BAD_MADT_GICC_ENTRY() macro implementation
	ARM: 8685/1: ensure memblock-limit is pmd-aligned
	x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
	x86/mm: Fix flush_tlb_page() on Xen
	ocfs2: o2hb: revert hb threshold to keep compatible
	iommu/vt-d: Don't over-free page table directories
	iommu: Handle default domain attach failure
	iommu/amd: Fix incorrect error handling in amd_iommu_bind_pasid()
	cpufreq: s3c2416: double free on driver init error path
	KVM: x86: fix emulation of RSM and IRET instructions
	KVM: x86/vPMU: fix undefined shift in intel_pmu_refresh()
	KVM: x86: zero base3 of unusable segments
	KVM: nVMX: Fix exception injection
	Linux 4.4.76

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-07-05 16:16:58 +02:00
Matt Fleming
6ca11db55f sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
commit 6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 upstream.

If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.

This situation on exiting NO_HZ is described by:

  this_rq->calc_load_update < jiffies < calc_load_update

In this scenario, what we should be doing is:

  this_rq->calc_load_update = calc_load_update		     [ next window ]

But what we actually do is:

  this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]

This has the effect of delaying load average updates for potentially
up to ~9seconds.

This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.

It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().

This issue is easy to reproduce before,

  commit 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")

just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:37:21 +02:00
Linux Build Service Account
af39cfe11e Merge "sched: avoid migrating when softint on tgt cpu should be short" 2017-06-22 14:00:20 -07:00
Linux Build Service Account
4dcf7a50c5 Merge "Merge branch 'android-4.4@6fc0573' into branch 'msm-4.4'" 2017-06-22 07:40:22 -07:00
Blagovest Kolenichev
c5f247dd6d Merge branch 'android-4.4@6fc0573' into branch 'msm-4.4'
* refs/heads/tmp-6fc0573:
  Linux 4.4.71
  xfs: only return -errno or success from attr ->put_listent
  xfs: in _attrlist_by_handle, copy the cursor back to userspace
  xfs: fix unaligned access in xfs_btree_visit_blocks
  xfs: bad assertion for delalloc an extent that start at i_size
  xfs: fix indlen accounting error on partial delalloc conversion
  xfs: wait on new inodes during quotaoff dquot release
  xfs: update ag iterator to support wait on new inodes
  xfs: support ability to wait on new inodes
  xfs: fix up quotacheck buffer list error handling
  xfs: prevent multi-fsb dir readahead from reading random blocks
  xfs: handle array index overrun in xfs_dir2_leaf_readbuf()
  xfs: fix over-copying of getbmap parameters from userspace
  xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
  xfs: Fix missed holes in SEEK_HOLE implementation
  mlock: fix mlock count can not decrease in race condition
  mm/migrate: fix refcount handling when !hugepage_migration_supported()
  drm/gma500/psb: Actually use VBT mode when it is found
  slub/memcg: cure the brainless abuse of sysfs attributes
  ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430
  pcmcia: remove left-over %Z format
  drm/radeon: Unbreak HPD handling for r600+
  drm/radeon/ci: disable mclk switching for high refresh rates (v2)
  scsi: mpt3sas: Force request partial completion alignment
  HID: wacom: Have wacom_tpc_irq guard against possible NULL dereference
  mmc: sdhci-iproc: suppress spurious interrupt with Multiblock read
  i2c: i2c-tiny-usb: fix buffer not being DMA capable
  vlan: Fix tcp checksum offloads in Q-in-Q vlans
  net: phy: marvell: Limit errata to 88m1101
  netem: fix skb_orphan_partial()
  ipv4: add reference counting to metrics
  sctp: fix ICMP processing if skb is non-linear
  tcp: avoid fastopen API to be used on AF_UNSPEC
  virtio-net: enable TSO/checksum offloads for Q-in-Q vlans
  be2net: Fix offload features for Q-in-Q packets
  ipv6: fix out of bound writes in __ip6_append_data()
  bridge: start hello_timer when enabling KERNEL_STP in br_stp_start
  qmi_wwan: add another Lenovo EM74xx device ID
  bridge: netlink: check vlan_default_pvid range
  ipv6: Check ip6_find_1stfragopt() return value properly.
  ipv6: Prevent overrun when parsing v6 header options
  net: Improve handling of failures on link and route dumps
  tcp: eliminate negative reordering in tcp_clean_rtx_queue
  sctp: do not inherit ipv6_{mc|ac|fl}_list from parent
  sctp: fix src address selection if using secondary addresses for ipv6
  tcp: avoid fragmenting peculiar skbs in SACK
  s390/qeth: avoid null pointer dereference on OSN
  s390/qeth: unbreak OSM and OSN support
  s390/qeth: handle sysfs error during initialization
  ipv6/dccp: do not inherit ipv6_mc_list from parent
  dccp/tcp: do not inherit mc_list from parent
  sparc: Fix -Wstringop-overflow warning
  android: base-cfg: disable CONFIG_NFS_FS and CONFIG_NFSD
  schedstats/eas: guard properly to avoid breaking non-smp schedstats users
  BACKPORT: f2fs: sanity check size of nat and sit cache
  FROMLIST: f2fs: sanity check checkpoint segno and blkoff
  sched/tune: don't use schedtune before it is ready
  sched/fair: use SCHED_CAPACITY_SCALE for energy normalization
  sched/{fair,tune}: use reciprocal_value to compute boost margin
  sched/tune: Initialize raw_spin_lock in boosted_groups
  sched/tune: report when SchedTune has not been initialized
  sched/tune: fix sched_energy_diff tracepoint
  sched/tune: increase group count to 5
  cpufreq/schedutil: use boosted_cpu_util for PELT to match WALT
  sched/fair: Fix sched_group_energy() to support per-cpu capacity states
  sched/fair: discount task contribution to find CPU with lowest utilization
  sched/fair: ensure utilization signals are synchronized before use
  sched/fair: remove task util from own cpu when placing waking task
  trace:sched: Make util_avg in load_avg trace reflect PELT/WALT as used
  sched/fair: Add eas (& cas) specific rq, sd and task stats
  sched/core: Fix PELT jump to max OPP upon util increase
  sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability
  UPSTREAM: sched/core: Fix group_entity's share update
  UPSTREAM: sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion
  UPSTREAM: sched/fair: Fix incorrect task group ->load_avg
  UPSTREAM: sched/fair: Fix effective_load() to consistently use smoothed load
  UPSTREAM: sched/fair: Propagate asynchrous detach
  UPSTREAM: sched/fair: Propagate load during synchronous attach/detach
  UPSTREAM: sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list
  BACKPORT: sched/fair: Factorize PELT update
  UPSTREAM: sched/fair: Factorize attach/detach entity
  UPSTREAM: sched/fair: Improve PELT stuff some more
  UPSTREAM: sched/fair: Apply more PELT fixes
  UPSTREAM: sched/fair: Fix post_init_entity_util_avg() serialization
  BACKPORT: sched/fair: Initiate a new task's util avg to a bounded value
  sched/fair: Simplify idle_idx handling in select_idle_sibling()
  sched/fair: refactor find_best_target() for simplicity
  sched/fair: Change cpu iteration order in find_best_target()
  sched/core: Add first cpu w/ max/min orig capacity to root domain
  sched/core: Remove remnants of commit fd5c98da1a42
  sched: Remove sysctl_sched_is_big_little
  sched/fair: Code !is_big_little path into select_energy_cpu_brute()
  EAS: sched/fair: Re-integrate 'honor sync wakeups' into wakeup path
  Fixup!: sched/fair.c: Set SchedTune specific struct energy_env.task
  sched/fair: Energy-aware wake-up task placement
  sched/fair: Add energy_diff dead-zone margin
  sched/fair: Decommission energy_aware_wake_cpu()
  sched/fair: Do not force want_affine eq. true if EAS is enabled
  arm64: Set SD_ASYM_CPUCAPACITY sched_domain flag on DIE level
  UPSTREAM: sched/fair: Fix incorrect comment for capacity_margin
  UPSTREAM: sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
  UPSTREAM: sched/fair: Add per-CPU min capacity to sched_group_capacity
  UPSTREAM: sched/fair: Consider spare capacity in find_idlest_group()
  UPSTREAM: sched/fair: Compute task/cpu utilization at wake-up correctly
  UPSTREAM: sched/fair: Let asymmetric CPU configurations balance at wake-up
  UPSTREAM: sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems
  UPSTREAM: sched/core: Pass child domain into sd_init()
  UPSTREAM: sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  UPSTREAM: sched/core: Remove unnecessary NULL-pointer check
  UPSTREAM: sched/fair: Optimize find_idlest_cpu() when there is no choice
  BACKPORT: sched/fair: Make the use of prev_cpu consistent in the wakeup path
  UPSTREAM: sched/core: Fix power to capacity renaming in comment
  Partial Revert: "WIP: sched: Add cpu capacity awareness to wakeup balancing"
  Revert "WIP: sched: Consider spare cpu capacity at task wake-up"
  FROM-LIST: cpufreq: schedutil: Redefine the rate_limit_us tunable
  cpufreq: schedutil: add up/down frequency transition rate limits
  trace/sched: add rq utilization signal for WALT
  sched/cpufreq: make schedutil use WALT signal
  sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
  cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
  BACKPORT: kthread: allow to cancel kthread work
  sched/cpufreq: fix tunables for schedfreq governor
  BACKPORT: cpufreq: schedutil: New governor based on scheduler utilization data
  sched: backport cpufreq hooks from 4.9-rc4
  ANDROID: Kconfig: add depends for UID_SYS_STATS
  ANDROID: hid: uhid: implement refcount for open and close
  Revert "ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY"
  ANDROID: mnt: Fix next_descendent

Conflicts:
	include/trace/events/sched.h
	kernel/sched/Makefile
	kernel/sched/core.c
	kernel/sched/fair.c
	kernel/sched/sched.h

Change-Id: I55318828f2c858e192ac7015bcf2bf0ec5c5b2c5
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
2017-06-19 16:59:55 -07:00
Linux Build Service Account
3d12c58f77 Merge "sched: Fix the bug in select_best_cpu() that returns -1 as target_cpu" 2017-06-09 02:45:19 -07:00
John Dias
25e8ecf9da sched: avoid migrating when softint on tgt cpu should be short
The scheduling change (bug 31501544) to avoid putting RT threads on cores that
are handling softint's was catching cases where there was no reason
to believe the softint would take a long time, resulting in unnecessary
migration overhead. This patch reduces the migration to cases where
the core has a softint that is actually likely to take a long time,
as opposed to the RCU, SCHED, and TIMER softints that are rather quick.
Bug: 31752786

Change-Id: Ib4e179f1e15c736b2fdba31070494e357e9fbbe2
Git-commit: ce05770bd37b8065b61ef650108ecef2b97b148b
Git-repo: https://android.googlesource.com/kernel/msm
[pkondeti@codeaurora.org: resolved minor merge conflicts]
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-06-09 15:14:07 +05:30
John Dias
c3544e35ef sched: avoid scheduling RT threads on cores currently handling softirqs
Bug: 31501544
Change-Id: I99dd7aaa12c11270b28dbabea484bcc8fb8ba0c1
Git-commit: 080ea011fd9f47315e1fc53185872ef813b59d00
Git-repo: https://android.googlesource.com/kernel/msm
[pkondeti@codeaurora.org: resolved minor merge conflicts and fixed
checkpatch warnings]
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-06-09 15:14:07 +05:30
Linux Build Service Account
bc22546551 Merge "Merge branch 'android-4.4@9bc4622' into branch 'msm-4.4'" 2017-06-08 19:03:18 -07:00
Pavankumar Kondeti
a761ae8501 sched: Fix the bug in select_best_cpu() that returns -1 as target_cpu
select_best_cpu() has previous CPU's cluster bias which overrides
the best_cpu with best_sibling_cpu when the power cost is same.
When the power table is configured incorrectly or static_cpu_pwr_cost/
static_cluster_pwr_cost tunables are set to a large value, the
power_cost() for all candidate CPUs can return INT_MAX. So the
stats.min_cost is never changed from it's initial value i.e INT_MAX.

In the above scenario, we find stats.best_cpu >= 0 &&  stats.min_cost =
stats.best_sibling_cpu_cost = INT_MAX && stats.best_sibling_cpu_cost = -1
and replace best_cpu with best_sibling_cpu i.e -1.

Change-Id: I09829e278e41daaaff959428ff50927aba29104c
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-06-09 06:40:02 +05:30
Blagovest Kolenichev
2025064255 Merge branch 'android-4.4@9bc4622' into branch 'msm-4.4'
* refs/heads/tmp-9bc4622:
  Linux 4.4.70
  drivers: char: mem: Check for address space wraparound with mmap()
  nfsd: encoders mustn't use unitialized values in error cases
  drm/edid: Add 10 bpc quirk for LGD 764 panel in HP zBook 17 G2
  PCI: Freeze PME scan before suspending devices
  PCI: Fix pci_mmap_fits() for HAVE_PCI_RESOURCE_TO_USER platforms
  tracing/kprobes: Enforce kprobes teardown after testing
  osf_wait4(): fix infoleak
  genirq: Fix chained interrupt data ordering
  uwb: fix device quirk on big-endian hosts
  metag/uaccess: Check access_ok in strncpy_from_user
  metag/uaccess: Fix access_ok()
  iommu/vt-d: Flush the IOTLB to get rid of the initial kdump mappings
  staging: rtl8192e: rtl92e_get_eeprom_size Fix read size of EPROM_CMD.
  staging: rtl8192e: fix 2 byte alignment of register BSSIDR.
  mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp
  xc2028: Fix use-after-free bug properly
  arm64: documentation: document tagged pointer stack constraints
  arm64: uaccess: ensure extension of access_ok() addr
  arm64: xchg: hazard against entire exchange variable
  ARM: dts: at91: sama5d3_xplained: not all ADC channels are available
  ARM: dts: at91: sama5d3_xplained: fix ADC vref
  powerpc/64e: Fix hang when debugging programs with relocated kernel
  powerpc/pseries: Fix of_node_put() underflow during DLPAR remove
  powerpc/book3s/mce: Move add_taint() later in virtual mode
  cx231xx-cards: fix NULL-deref at probe
  cx231xx-audio: fix NULL-deref at probe
  cx231xx-audio: fix init error path
  dvb-frontends/cxd2841er: define symbol_rate_min/max in T/C fe-ops
  zr364xx: enforce minimum size when reading header
  dib0700: fix NULL-deref at probe
  s5p-mfc: Fix unbalanced call to clock management
  gspca: konica: add missing endpoint sanity check
  ceph: fix recursion between ceph_set_acl() and __ceph_setattr()
  iio: proximity: as3935: fix as3935_write
  ipx: call ipxitf_put() in ioctl error path
  USB: hub: fix non-SS hub-descriptor handling
  USB: hub: fix SS hub-descriptor handling
  USB: serial: io_ti: fix div-by-zero in set_termios
  USB: serial: mct_u232: fix big-endian baud-rate handling
  USB: serial: qcserial: add more Lenovo EM74xx device IDs
  usb: serial: option: add Telit ME910 support
  USB: iowarrior: fix info ioctl on big-endian hosts
  usb: musb: tusb6010_omap: Do not reset the other direction's packet size
  ttusb2: limit messages to buffer size
  mceusb: fix NULL-deref at probe
  usbvision: fix NULL-deref at probe
  net: irda: irda-usb: fix firmware name on big-endian hosts
  usb: host: xhci-mem: allocate zeroed Scratchpad Buffer
  xhci: apply PME_STUCK_QUIRK and MISSING_CAS quirk for Denverton
  usb: host: xhci-plat: propagate return value of platform_get_irq()
  sched/fair: Initialize throttle_count for new task-groups lazily
  sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
  fscrypt: avoid collisions when presenting long encrypted filenames
  f2fs: check entire encrypted bigname when finding a dentry
  fscrypt: fix context consistency check when key(s) unavailable
  net: qmi_wwan: Add SIMCom 7230E
  ext4 crypto: fix some error handling
  ext4 crypto: don't let data integrity writebacks fail with ENOMEM
  USB: serial: ftdi_sio: add Olimex ARM-USB-TINY(H) PIDs
  USB: serial: ftdi_sio: fix setting latency for unprivileged users
  pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()
  pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
  iio: dac: ad7303: fix channel description
  of: fix sparse warning in of_pci_range_parser_one
  proc: Fix unbalanced hard link numbers
  cdc-acm: fix possible invalid access when processing notification
  drm/nouveau/tmr: handle races with hw when updating the next alarm time
  drm/nouveau/tmr: avoid processing completed alarms when adding a new one
  drm/nouveau/tmr: fix corruption of the pending list when rescheduling an alarm
  drm/nouveau/tmr: ack interrupt before processing alarms
  drm/nouveau/therm: remove ineffective workarounds for alarm bugs
  drm/amdgpu: Make display watermark calculations more accurate
  drm/amdgpu: Avoid overflows/divide-by-zero in latency_watermark calculations.
  ath9k_htc: fix NULL-deref at probe
  ath9k_htc: Add support of AirTies 1eda:2315 AR9271 device
  s390/cputime: fix incorrect system time
  s390/kdump: Add final note
  regulator: tps65023: Fix inverted core enable logic.
  KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulation
  KVM: x86: Fix load damaged SSEx MXCSR register
  ima: accept previously set IMA_NEW_FILE
  mwifiex: pcie: fix cmd_buf use-after-free in remove/reset
  rtlwifi: rtl8821ae: setup 8812ae RFE according to device type
  md: update slab_cache before releasing new stripes when stripes resizing
  dm space map disk: fix some book keeping in the disk space map
  dm thin metadata: call precommit before saving the roots
  dm bufio: make the parameter "retain_bytes" unsigned long
  dm cache metadata: fail operations if fail_io mode has been established
  dm bufio: check new buffer allocation watermark every 30 seconds
  dm bufio: avoid a possible ABBA deadlock
  dm raid: select the Kconfig option CONFIG_MD_RAID0
  dm btree: fix for dm_btree_find_lowest_key()
  infiniband: call ipv6 route lookup via the stub interface
  tpm_crb: check for bad response size
  ARM: tegra: paz00: Mark panel regulator as enabled on boot
  USB: core: replace %p with %pK
  char: lp: fix possible integer overflow in lp_setup()
  watchdog: pcwd_usb: fix NULL-deref at probe
  USB: ene_usb6250: fix DMA to the stack
  usb: misc: legousbtower: Fix memory leak
  usb: misc: legousbtower: Fix buffers on stack
  ANDROID: uid_sys_stats: defer io stats calulation for dead tasks
  ANDROID: AVB: Fix linter errors.
  ANDROID: AVB: Fix invalidate_vbmeta_submit().
  ANDROID: sdcardfs: Check for NULL in revalidate
  Linux 4.4.69
  ipmi: Fix kernel panic at ipmi_ssif_thread()
  wlcore: Add RX_BA_WIN_SIZE_CHANGE_EVENT event
  wlcore: Pass win_size taken from ieee80211_sta to FW
  mac80211: RX BA support for sta max_rx_aggregation_subframes
  mac80211: pass block ack session timeout to to driver
  mac80211: pass RX aggregation window size to driver
  Bluetooth: hci_intel: add missing tty-device sanity check
  Bluetooth: hci_bcm: add missing tty-device sanity check
  Bluetooth: Fix user channel for 32bit userspace on 64bit kernel
  tty: pty: Fix ldisc flush after userspace become aware of the data already
  serial: omap: suspend device on probe errors
  serial: omap: fix runtime-pm handling on unbind
  serial: samsung: Use right device for DMA-mapping calls
  arm64: KVM: Fix decoding of Rt/Rt2 when trapping AArch32 CP accesses
  padata: free correct variable
  CIFS: add misssing SFM mapping for doublequote
  cifs: fix CIFS_IOC_GET_MNT_INFO oops
  CIFS: fix mapping of SFM_SPACE and SFM_PERIOD
  SMB3: Work around mount failure when using SMB3 dialect to Macs
  Set unicode flag on cifs echo request to avoid Mac error
  fs/block_dev: always invalidate cleancache in invalidate_bdev()
  ceph: fix memory leak in __ceph_setxattr()
  fs/xattr.c: zero out memory copied to userspace in getxattr
  ext4: evict inline data when writing to memory map
  IB/mlx4: Reduce SRIOV multicast cleanup warning message to debug level
  IB/mlx4: Fix ib device initialization error flow
  IB/IPoIB: ibX: failed to create mcg debug file
  IB/core: Fix sysfs registration error flow
  vfio/type1: Remove locked page accounting workqueue
  dm era: save spacemap metadata root after the pre-commit
  crypto: algif_aead - Require setkey before accept(2)
  block: fix blk_integrity_register to use template's interval_exp if not 0
  KVM: arm/arm64: fix races in kvm_psci_vcpu_on
  KVM: x86: fix user triggerable warning in kvm_apic_accept_events()
  um: Fix PTRACE_POKEUSER on x86_64
  x86, pmem: Fix cache flushing for iovec write < 8 bytes
  selftests/x86/ldt_gdt_32: Work around a glibc sigaction() bug
  x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup
  usb: hub: Do not attempt to autosuspend disconnected devices
  usb: hub: Fix error loop seen after hub communication errors
  usb: Make sure usb/phy/of gets built-in
  usb: misc: add missing continue in switch
  staging: comedi: jr3_pci: cope with jiffies wraparound
  staging: comedi: jr3_pci: fix possible null pointer dereference
  staging: gdm724x: gdm_mux: fix use-after-free on module unload
  staging: vt6656: use off stack for out buffer USB transfers.
  staging: vt6656: use off stack for in buffer USB transfers.
  USB: Proper handling of Race Condition when two USB class drivers try to call init_usb_class simultaneously
  USB: serial: ftdi_sio: add device ID for Microsemi/Arrow SF2PLUS Dev Kit
  usb: host: xhci: print correct command ring address
  iscsi-target: Set session_fall_back_to_erl0 when forcing reinstatement
  target: Convert ACL change queue_depth se_session reference usage
  target/fileio: Fix zero-length READ and WRITE handling
  target: Fix compare_and_write_callback handling for non GOOD status
  xen: adjust early dom0 p2m handling to xen hypervisor behavior
  ANDROID: AVB: Only invalidate vbmeta when told to do so.
  ANDROID: sdcardfs: Move top to its own struct
  ANDROID: lowmemorykiller: account for unevictable pages
  ANDROID: usb: gadget: fix NULL pointer issue in mtp_read()
  ANDROID: usb: f_mtp: return error code if transfer error in receive_file_work function

Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>

Conflicts:
	drivers/usb/gadget/function/f_mtp.c
	fs/ext4/page-io.c
	net/mac80211/agg-rx.c

Change-Id: Id65e75bf3bcee4114eb5d00730a9ef2444ad58eb
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
2017-06-07 09:31:32 -07:00
Linux Build Service Account
1d5844ba9d Merge "sched: hmp: Optimize cycle counter reads" 2017-06-06 13:21:50 -07:00
Linux Build Service Account
6ed51e0bab Merge "sched: Don't active migrate tasks to CPUs in the same cluster" 2017-06-06 13:21:48 -07:00
Linux Build Service Account
0d1b465cb8 Merge "sched: Fix load tracking bug to avoid adding phantom task demand" 2017-06-06 13:21:39 -07:00
Chris Redpath
fce0ecf04a schedstats/eas: guard properly to avoid breaking non-smp schedstats users
Add appropriate #ifdef guards to ensure the smp-only easstats structs
are not used when smp is not enabled. Arnd got a report from buildbot,
analysed it, and pointed out exactly what the issue was.

Reported-by: "Arnd Bergmann" <arnd@arndb.de>
Suggested-by: "Arnd Bergmann" <arnd@arndb.de>
Fixes: 4b85765a3d ("sched/fair: Add eas (& cas)
 specific rq, sd and task stats")
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I60554dea20137f6774db3f59b4afd40a06554cfc
2017-06-03 15:03:03 +01:00
Chris Redpath
c47d00b57b sched/tune: don't use schedtune before it is ready
When EAS is enabled during boot, we have to be careful not to use
schedtune from fair.c before it is ready or it will warn us and we'll
get a traceback in the console.

Change-Id: I1a5cf29b18af626545c636c51219f9ed497c19fa
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:55 -07:00
Patrick Bellasi
9e3c04bef7 sched/fair: use SCHED_CAPACITY_SCALE for energy normalization
Change-Id: I686d26975f4a7dd830ff8441ff986e35461a7d55
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Patrick Bellasi
7b8577d94c sched/{fair,tune}: use reciprocal_value to compute boost margin
Change-Id: I493b07360c46eee0b72c2a046dab9ec6cb3427ef
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Srinath Sridharan
41d9288e3e sched/tune: Initialize raw_spin_lock in boosted_groups
bug: 32668852
Change-Id: Ice96230d88939d5973b1b6310085d1b3df9c47d9
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Patrick Bellasi
3757f95741 sched/tune: report when SchedTune has not been initialized
Change-Id: Iba4e5e3d220451f04272d555e6b8e0af83a7f09d
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Chris Redpath
f9b83b3e6e sched/tune: fix sched_energy_diff tracepoint
sched_energy_diff tracepoint is in a place where it can never trace
payoff or nrg.delta. If CONFIG_SCHED_TUNE is enabled, put it in
a place where those values exist. If it is not enabled, trace from
the current location

Change-Id: Id5442f2b34ec76625491d27c0f4285433ca12699
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:55 -07:00
Chris Redpath
2e829cf17f sched/tune: increase group count to 5
We use 5 groups everywhere else, this should default to the same.

Change-Id: I05a20bdcf8046ea90a2e36979940cef11246e735
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2017-06-02 08:01:55 -07:00
Chris Redpath
4c031f0e6f cpufreq/schedutil: use boosted_cpu_util for PELT to match WALT
When using WALT we always used boosted cpu util for OPP selection.
This is the primary purpose for boosted cpu util, but we hadn't
changed the PELT utilization check to do the same thing.

Fix that here.

Change-Id: Id5ffb26eac23b25fe754255221f6d21b8cededfd
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:55 -07:00
Morten Rasmussen
fc969e3bfa sched/fair: Fix sched_group_energy() to support per-cpu capacity states
sched_group_energy() was supposed to support per-cpu capacity states
(DVFS), however, while fixing a hotplug issue this was broken as we bail
out if there is no SD_SHARE_CAP_STATES flag set.

This patch implements the hotplug race check differently and should
therefore reinstate support for per-cpu capacity states.

Change-Id: I5b865666c9ce833dcfa6514c574580d75aa0a195
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2017-06-02 08:01:55 -07:00
Valentin Schneider
fef0112a63 sched/fair: discount task contribution to find CPU with lowest utilization
In some cases, the new_util of a task can be the same on several
CPUs. This causes an issue because the target_util is only updated
if the current new_util is strictly smaller than target_util.

To fix that, the cpu_util_wake() return value is used alongside the
new_util value. If two CPUs compute the same new_util value,
we'll now also look at their cpu_util_wake() return value. In this
case, the CPU that last ran the task will be chosen in priority.

Change-Id: Ia1ea2c4b3ec39621372c2f748862317d5b497723
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
2017-06-02 08:01:54 -07:00
Chris Redpath
83f462daa3 sched/fair: ensure utilization signals are synchronized before use
wake_cap performs task and cpu utilization synchronization which is
what allows us to subtract current task util from prev_cpu util and
have a sensible number to work with.

It looks as though if wake_wide returns 0, we could potentially not
execute wake_cap, which would result in unsynced signals we then use
for energy calculations.

This is not necessarily an issue we've seen in traces, but it looks
as though it should be changed.

Change-Id: Ic54a3cba2a10d946ea20113a04371dea04115e82
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Chris Redpath
8865f07600 sched/fair: remove task util from own cpu when placing waking task
When we place a waking task with find_best_target, we calculate the
existing and new utilisation of each candidate cpu. However, we do
not remove any blocked load resulting from the waking task on the
previous cpu which might cause unnecessary migrations.

Switch to using cpu_util_wake which does this for us, which requires
moving cpu_util_wake a few functions earlier.

Also, we have multiple potential cpu utilization signals here, so
update the necessary bits to allow WALT to work properly (including
not subtracting task util for WALT).

When WALT is in use, cpu utilization is the utilization
in the previous completed window, whilst the task utilization
ignores fully idle windows. There seems to be no way to have a
decently accurate estimate of how much (if any) utilization from
this task remains on the prev cpu.

Instead, just return cpu_util when we're using WALT.

Change-Id: I448203ab98ffb5c020dfb6b218581eef1f5601f7
Reported-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Chris Redpath
8ac52cbaf4 trace:sched: Make util_avg in load_avg trace reflect PELT/WALT as used
With the ability to choose between WALT and PELT for utilisation tracking
we can have the situation where we're using WALT to make all the
decisions and reporting PELT figures in the sched_load_avg_(cpu|task)
trace points. This is not too much of an issue, but when analysing trace
it is nice to see numbers representing what the scheduler is using rather
than needing to add in additional sched_walt_* traces to figure it out.

Add reporting for both types, and make the util_avg member reflect what
will be seen from cpu or task_util functions in the scheduler.

Change-Id: I2abbd2c5fa70822096d0f3372b4c12b1c6af1590
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Dietmar Eggemann
4b85765a3d sched/fair: Add eas (& cas) specific rq, sd and task stats
The statistic counter are placed in the eas (& cas) wakeup path. Each
of them has one representation for the runqueue (rq), the sched_domain
(sd) and the task.
A task counter is always incremented. A rq counter is always
incremented for the rq the scheduler is currently running on. A sd
counter is only incremented if a relation to a sd exists.

The counters are exposed:

(1) In /proc/schedstat for rq's and sd's:

$ cat /proc/schedstat
...
cpu0 71422 0 2321254 ...
eas  44144 0 0 19446 0 24698 568435 51621 156932 133 222011 17459 120279 516814 83 0 156962 359235 176439 139981
  <- runqueue for cpu0
...
domain0 3 42430 42331 ...
eas 0 0 0 14200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 66355 0  <- MC sched domain for cpu0
...

The per-cpu eas vector has the following elements:

sis_attempts  sis_idle   sis_cache_affine sis_suff_cap    sis_idle_cpu    sis_count               ||
secb_attempts secb_sync  secb_idle_bt     secb_insuff_cap secb_no_nrg_sav secb_nrg_sav secb_count ||
fbt_attempts  fbt_no_cpu fbt_no_sd        fbt_pref_idle   fbt_count                               ||
cas_attempts  cas_count

The following relations exist between these counters (from cpu0 eas
vector above):

sis_attempts = sis_idle + sis_cache_affine + sis_suff_cap + sis_idle_cpu + sis_count

44144        = 0        + 0                + 19446        + 0            + 24698

secb_attempts = secb_sync + secb_idle_bt + secb_insuff_cap + secb_no_nrg_sav + secb_nrg_sav + secb_count

568435        = 51621     + 156932       + 133             + 222011          + 17459        + 120279

fbt_attempts = fbt_no_cpu + fbt_no_sd + fbt_pref_idle + fbt_count + (return -1)

516814       = 83         + 0         + 156962        + 359235    + (534)

cas_attempts = cas_count + (return -1 or smp_processor_id())

176439       = 139981    + (36458)

(2) In /proc/$PROCESS_PID/task/$TASK_PID/sched for a task.

example: main thread of system_server

$ cat /proc/1083/task/1083/sched

...
se.statistics.nr_wakeups_sis_attempts        :                  945
se.statistics.nr_wakeups_sis_idle            :                    0
se.statistics.nr_wakeups_sis_cache_affine    :                    0
se.statistics.nr_wakeups_sis_suff_cap        :                  219
se.statistics.nr_wakeups_sis_idle_cpu        :                    0
se.statistics.nr_wakeups_sis_count           :                  726
se.statistics.nr_wakeups_secb_attempts       :                10376
se.statistics.nr_wakeups_secb_sync           :                 1462
se.statistics.nr_wakeups_secb_idle_bt        :                 6984
se.statistics.nr_wakeups_secb_insuff_cap     :                    3
se.statistics.nr_wakeups_secb_no_nrg_sav     :                  927
se.statistics.nr_wakeups_secb_nrg_sav        :                  206
se.statistics.nr_wakeups_secb_count          :                  794
se.statistics.nr_wakeups_fbt_attempts        :                 8914
se.statistics.nr_wakeups_fbt_no_cpu          :                    0
se.statistics.nr_wakeups_fbt_no_sd           :                    0
se.statistics.nr_wakeups_fbt_pref_idle       :                 6987
se.statistics.nr_wakeups_fbt_count           :                 1554
se.statistics.nr_wakeups_cas_attempts        :                 3107
se.statistics.nr_wakeups_cas_count           :                 1195
...

The same relation between the counters as in the per-cpu case apply.

Change-Id: Ie7d01267c78a3f41f60a3ef52917d5a5d463f195
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Andres Oportus
aa8882923a sched/core: Fix PELT jump to max OPP upon util increase
Change-Id: Ic80b588ec466ef707f658dcea039fd0d6b384b63
Signed-off-by: Andres Oportus <andresoportus@google.com>
2017-06-02 08:01:54 -07:00
Dietmar Eggemann
55af384815 sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability
For Energy-Aware Scheduling (EAS) to work properly, even in the
case that there is only one cpu per cluster or that cpus are hot-plugged
out, the Energy Model (EM) data on all energy-aware sched domains (sd)
has to be present for all online cpus.

Mainline sd hierarchy setup code will remove sd's which are not useful
for task scheduling e.g. in the following situations:

1. Only 1 cpu is/remains in one cluster of a multi cluster system.

   This remaining cpu only has DIE and no MC sd.

2. A complete cluster in a two cluster system is hot-plugged out.

   The cpus of the remaining cluster only have MC and no DIE sd.

To make sure that all online cpus keep all their energy-aware sd's,
the sd degenerate functionality has been changed to not free a sd if
its first sched group (sg) contains EM data in case:

1. There is only 1 cpu left in the sd.

2. There have to be at least 2 sg's if certain sd flags are set.

Instead of freeing such a sd it now clears only its SD_LOAD_BALANCE
flag. This will make sure that the EAS functionality will always see
all energy-aware sd's for all online cpus.

It will introduce a tiny performance degradation for operations on
affected cpus since the hot-path macro for_each_domain() has to deal
with sd's not contributing to task scheduling at all now.

In most cases the exisiting code makes sure that task scheduling is not
invoked on a sd with !SD_LOAD_BALANCE.

However, a small change is necessary in update_sd_lb_stats() to make
sure that sd->parent is only initialized to !NULL in case the parent sd
contains more than 1 sg.

The handling of newidle decay values before the SD_LOAD_BALANCE check in
rebalance_domains() stays unchanged.

Test (w/ CONFIG_SCHED_DEBUG):

JUNO r0 default system:

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd07
CPU part        : 0xd07
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags:

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f

Test 1: Hotplug-out one A57 (CPU part 0xd07) cpu:

$ echo 0 > /sys/devices/system/cpu/cpu1/online

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd07
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags for remaining A57 (cpu2) cpu:

$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags`
832e <-- MC SD with !SD_LOAD_BALANCE
102f

Test 2: Hotplug-out the entire A57 cluster:

$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 0 > /sys/devices/system/cpu/cpu2/online

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags for the remaining A53 (CPU part 0xd03) cluster:

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102e <-- DIE SD with !SD_LOAD_BALANCE
832f
102e
832f
102e
832f
102e

Change-Id: If24aa2b2628f334abbf0207d39e2a86168d9d673
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2017-06-02 08:01:54 -07:00
Vincent Guittot
e62a1ca36b UPSTREAM: sched/core: Fix group_entity's share update
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.

Let take the example of a task a that is dequeued of its task group A:
   root
  (cfs_rq)
    \
    (se)
     A
    (cfs_rq)
      \
      (se)
       a

Task "a" was the only task in task group A which becomes idle when a is
dequeued.

We have the sequence:

- dequeue_entity a->se
    - update_load_avg(a->se)
    - dequeue_entity_load_avg(A->cfs_rq, a->se)
    - update_cfs_shares(A->cfs_rq)
	A->cfs_rq->load.weight == 0
        A->se->load.weight is updated with the new share (0 in this case)
- dequeue_entity A->se
    - update_load_avg(A->se) but its weight is now null so the last time
      slot (up to a tick) will be accounted with a weight of 0 instead of
      its real weight during the time slot. The last time slot will be
      accounted as an idle one whereas it was a running one.

If the running time of task a is short enough that no tick happens when it
runs, all running time of group entity A->se will be accounted as idle
time.

Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.

update_cfs_shares() now takes the sched_entity as a parameter instead of the
cfs_rq, and the weight of the group_entity is updated only once its load_avg
has been synced with current time.

Change-Id: Id6ce3be1767b44b444ce2a77ed1ba063e57c0664
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 89ee048f3cc796db6f26906c6bef4edf0bee70fd)
[minor cherry pick stuff]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Peter Zijlstra
baaa21b59b UPSTREAM: sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion
Commit:

  fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")

did something non-obvious but also did it buggy yet latent.

The problem was exposed for real by a later commit in the v4.7 merge window:

  2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")

... after which tg->load_avg and cfs_rq->load.weight had different
units (10 bit fixed point and 20 bit fixed point resp.).

Add a comment to explain the use of cfs_rq->load.weight over the
'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct
for the difference in unit.

Since this is (now, as per a previous commit) the only user of
calc_tg_weight(), collapse it.

The effects of this bug should be randomly inconsistent SMP-balancing
of cgroups workloads.

Change-Id: If1e565662ea163485edd94a12aef644d0e0dfe7a
Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit ea1dc6fc6242f991656e35e2ed3d90ec1cd13418)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Vincent Guittot
20bbd92679 UPSTREAM: sched/fair: Fix incorrect task group ->load_avg
A scheduler performance regression has been reported by Joseph Salisbury,
which he bisected back to:

  3d30544f0212 ("sched/fair: Apply more PELT fixes)

The regression triggers when several levels of task groups are involved
(read: SystemD) and cpu_possible_mask != cpu_present_mask.

The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
is initialized to scale_load_down(se->load.weight). During the creation of
a child task group, its group entities on possible CPUs are attached to
parent's cfs_rq (tg_parent) and their loads are added to the parent's load
(tg_parent->load_avg) with update_tg_load_avg().

But only the load on online CPUs will then be updated to reflect real load,
whereas load on other CPUs will stay at the initial value.

The result is a tg_parent->load_avg that is higher than the real load, the
weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
smaller than it should be, and the task group gets a less running time than
what it could expect.

( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
  of the task group will be much higher than sum of ".tg_load_avg_contrib"
  of online cfs_rqs of the task group. )

The load of group entities don't have to be intialized to something else
than 0 because their load will increase when an entity is attached.

Change-Id: Ie55021ff98ba49016adfddb2444e9c9709939226
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org> # 4.8.x
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: joonwoop@codeaurora.org
Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes)
Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b5a9b340789b2b24c6896bcf7a065c31a4db671c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Peter Zijlstra
640c909c34 UPSTREAM: sched/fair: Fix effective_load() to consistently use smoothed load
Starting with the following commit:

  fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")

calc_tg_weight() doesn't compute the right value as expected by effective_load().

The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.

Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.

So stop using calc_tg_weight() and do it explicitly.

The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.

Change-Id: I1c0058ff674650cf295c8dc3b88a5a3de4bddab0
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v4.3+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7dd4912594daf769a46744848b05bd5bc6d62469)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Vincent Guittot
89e4d18a67 UPSTREAM: sched/fair: Propagate asynchrous detach
A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.

During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.

The propagation relies on patch:

  "sched: Fix hierarchical order in rq->leaf_cfs_rq_list"

... which orders children and parents, to ensure that it's done in one pass.

Change-Id: I33782e35fc4711f5901e8c23d6aa7ec5f2ff7ee5
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4e5160766fcc9f41bbd38bac11f92dce993644aa)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Vincent Guittot
e875665411 UPSTREAM: sched/fair: Propagate load during synchronous attach/detach
When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.

For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.

For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.

Change-Id: Id34a9888484716961c9027299c0b4d82881a39d1
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 09a43ace1f986b003c118fdf6ddf1fd685692d49)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Vincent Guittot
8370e07d82 UPSTREAM: sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.

The hierarchical order in shares update list has been introduced by
commit:

  67e86250f8 ("sched: Introduce hierarchal order on shares update list")

With the current implementation a child can be still put after its
parent.

Lets take the example of:

       root
        \
         b
         /\
         c d*
           |
           e*

with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail

The branch d -> e will be added the first time that they are enqueued,
starting with e then d.

When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail

Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail

e is not placed at the right position and will be called the last
whereas it should be called at the beginning.

Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.

Change-Id: I4fe0b8502ea628c13d14e8e5c5279bce67fb8845
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9c2791f936ef5fd04a118b5c284f2c9a95f4a647)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00