Commit graph

1654 commits

Author SHA1 Message Date
Joonwoo Park
ee4cebd75e sched: EAS/WALT: use cr_avg instead of prev_runnable_sum
WALT accounts two major statistics; CPU load and cumulative tasks
demand.

The CPU load which is account of accumulated each CPU's absolute
execution time is for CPU frequency guidance.  Whereas cumulative
tasks demand which is each CPU's instantaneous load to reflect
CPU's load at given time is for task placement decision.

Use cumulative tasks demand for cpu_util() for task placement and
introduce cpu_util_freq() for frequency guidance.

Change-Id: Id928f01dbc8cb2a617cdadc584c1f658022565c5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2017-09-01 17:20:59 -07:00
Joonwoo Park
48f67ea85d sched: WALT: fix broken cumulative runnable average accounting
When running tasks's ravg.demand is changed update_history() adjusts
rq->cumulative_runnable_avg to reflect change of CPU load.  Currently
this fixup is broken by accumulating task's new demand without
subtracting the task's old demand.

Fix the fixup logic to subtract the task's old demand.

Change-Id: I61beb32a4850879ccb39b733f5564251e465bfeb
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2017-09-01 17:20:51 -07:00
Joonwoo Park
26b37261ea sched: deadline: WALT: account cumulative runnable avg
Account cumulative runnable average for WALT CPU utilization accounting.

Change-Id: I56934894e626dec183740eeaf89a57d2ef638143
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2017-09-01 17:20:39 -07:00
Joel Fernandes
7842de4545 UPSTREAM: cpufreq: schedutil: Use unsigned int for iowait boost
Make iowait_boost and iowait_boost_max as unsigned int since its unit is kHz
and this is consistent with struct cpufreq_policy. Also change the local
variables in sugov_iowait_boost to match this.

Change-Id: I6c67ed94c57c4bdb24bada4b97045593fcb95d2e
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes <joelaf@google.com>
2017-08-31 12:43:50 +00:00
Joel Fernandes
1ed33cf954 UPSTREAM: cpufreq: schedutil: Make iowait boost more energy efficient
Currently the iowait_boost feature in schedutil makes the frequency go
to max on iowait wakeups.  This feature was added to handle a case that
Peter described where the throughput of operations involving continuous
I/O requests [1] is reduced due to running at a lower frequency, however
the lower throughput itself causes utilization to be low and hence
causing frequency to be low hence its "stuck".

Instead of going to max, its also possible to achieve the same effect by
ramping up to max if there are repeated in_iowait wakeups happening.
This patch is an attempt to do that. We start from a lower frequency
(policy->min) and double the boost for every consecutive iowait update
until we reach the maximum iowait boost frequency (iowait_boost_max).

I ran a synthetic test (continuous O_DIRECT writes in a loop) on an x86
machine with intel_pstate in passive mode using schedutil. In this test
the iowait_boost value ramped from 800MHz to 4GHz in 60ms. The patch
achieves the desired improved throughput as the existing behavior.

[1] https://patchwork.kernel.org/patch/9735885/

Change-Id: I4a018434a50f4ca29ec15b03465f6dc212e54423
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Joel Fernandes <joelaf@google.com>
2017-08-31 12:43:42 +00:00
Andy Lutomirski
8bc69d462a UPSTREAM: sched/core: Allow putting thread_info into task_struct
If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
then thread_info is defined as a single 'u32 flags' and is the first
entry of task_struct.  thread_info::task is removed (it serves no
purpose if thread_info is embedded in task_struct), and
thread_info::cpu gets its own slot in task_struct.

This is heavily based on a patch written by Linus.

Originally-from: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Bug: 38331309
Change-Id: I25e5a830f2ada5e74fa93661e97e5e701b1b70d2
(cherry picked from commit c65eacbe290b8141554c71b2c94489e73ade8c8d)
Signed-off-by: Zubin Mithra <zsm@google.com>
2017-08-09 15:23:22 +01:00
Greg Kroah-Hartman
9f764bbe06 This is the 4.4.80 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlmHzogACgkQONu9yGCS
 aT72Kg/9Ea02hrf7SCaEmReH0CNBsZiWBp0u/4b6QtXt3TrPDXK0oteIB4SUIVi/
 zOzjU5SkssMLL9RoRQob81DLFJlL0b9ME5nLXxAACe2P74DaRSxA3DDmrYILgerH
 Gnv4k9xjbVMXMjdk6qAZ/SahCFfYPfnPCRO/zPeb3+6EZk8UQpaaB/GNxVCsGFTZ
 AfThsAHYzfFOg2fYdK0T09eDtAFqAokwGY6O8uaigkJt3u5mbMXcgxSp4o322OcG
 V3jxCUPzSk/78QtoSqQErXDCj/30451oLVByMBuRpBJAilsDf6VaURuz1dVfKFW8
 PdkLiy397sir696HwPU0HwHz++kRnZK2u2z//TRDE5wmgsC9VSq9fkggZdmNBol5
 N4ekCWjhYyyJzxf9hTxK/fA4t4KRFtOcdRiEkJj9RDIhT9jxsxPMr3TGJ25LJaUH
 8Qae+nNlYVe7lmaojckGa+AjIMm5HRB7LZnf4VQr1E8kvWpWpwA/0YtnduzPsXhH
 6xqT0rL/1/Z1Jz63/zPAtZ9OSL/ne0hJs+xOuUhKHGwH3oWBKrgmxAH8CAxYq0x9
 Y6ALkDweS3e+vVt+4BcHpUz8JTNTlspMcebt4VvjqvmERpKwmVsl7tEY242Uw4LQ
 wMF50vA9Cc0bVkVS7w2Ns/dn6XEWYpqS4a/MninjaBOMbtMia78=
 =l+tE
 -----END PGP SIGNATURE-----

Merge 4.4.80 into android-4.4

Changes in 4.4.80
	af_key: Add lock to key dump
	pstore: Make spinlock per zone instead of global
	net: reduce skb_warn_bad_offload() noise
	powerpc/pseries: Fix of_node_put() underflow during reconfig remove
	crypto: authencesn - Fix digest_null crash
	md/raid5: add thread_group worker async_tx_issue_pending_all
	drm/vmwgfx: Fix gcc-7.1.1 warning
	drm/nouveau/bar/gf100: fix access to upper half of BAR2
	KVM: PPC: Book3S HV: Context-switch EBB registers properly
	KVM: PPC: Book3S HV: Restore critical SPRs to host values on guest exit
	KVM: PPC: Book3S HV: Reload HTM registers explicitly
	KVM: PPC: Book3S HV: Save/restore host values of debug registers
	Revert "powerpc/numa: Fix percpu allocations to be NUMA aware"
	Staging: comedi: comedi_fops: Avoid orphaned proc entry
	drm/rcar: Nuke preclose hook
	drm: rcar-du: Perform initialization/cleanup at probe/remove time
	drm: rcar-du: Simplify and fix probe error handling
	perf intel-pt: Fix ip compression
	perf intel-pt: Fix last_ip usage
	perf intel-pt: Use FUP always when scanning for an IP
	perf intel-pt: Ensure never to set 'last_ip' when packet 'count' is zero
	xfs: don't BUG() on mixed direct and mapped I/O
	nfc: fdp: fix NULL pointer dereference
	net: phy: Do not perform software reset for Generic PHY
	isdn: Fix a sleep-in-atomic bug
	isdn/i4l: fix buffer overflow
	ath10k: fix null deref on wmi-tlv when trying spectral scan
	wil6210: fix deadlock when using fw_no_recovery option
	mailbox: always wait in mbox_send_message for blocking Tx mode
	mailbox: skip complete wait event if timer expired
	mailbox: handle empty message in tx_tick
	mpt3sas: Don't overreach ioc->reply_post[] during initialization
	kaweth: fix firmware download
	kaweth: fix oops upon failed memory allocation
	sched/cgroup: Move sched_online_group() back into css_online() to fix crash
	PM / Domains: defer dev_pm_domain_set() until genpd->attach_dev succeeds if present
	RDMA/uverbs: Fix the check for port number
	libnvdimm, btt: fix btt_rw_page not returning errors
	ipmi/watchdog: fix watchdog timeout set on reboot
	dentry name snapshots
	v4l: s5c73m3: fix negation operator
	Make file credentials available to the seqfile interfaces
	/proc/iomem: only expose physical resource addresses to privileged users
	vlan: Propagate MAC address to VLANs
	pstore: Allow prz to control need for locking
	pstore: Correctly initialize spinlock and flags
	pstore: Use dynamic spinlock initializer
	net: skb_needs_check() accepts CHECKSUM_NONE for tx
	sched/cputime: Fix prev steal time accouting during CPU hotplug
	xen/blkback: don't free be structure too early
	xen/blkback: don't use xen_blkif_get() in xen-blkback kthread
	tpm: fix a kernel memory leak in tpm-sysfs.c
	tpm: Replace device number bitmap with IDR
	x86/mce/AMD: Make the init code more robust
	r8169: add support for RTL8168 series add-on card.
	ARM: dts: n900: Mark eMMC slot with no-sdio and no-sd flags
	ipv6: Should use consistent conditional judgement for ip6 fragment between __ip6_append_data and ip6_finish_output
	net/mlx4: Remove BUG_ON from ICM allocation routine
	drm/msm: Ensure that the hardware write pointer is valid
	drm/msm: Verify that MSM_SUBMIT_BO_FLAGS are set
	vfio-pci: use 32-bit comparisons for register address for gcc-4.5
	irqchip/keystone: Fix "scheduling while atomic" on rt
	ASoC: tlv320aic3x: Mark the RESET register as volatile
	spi: dw: Make debugfs name unique between instances
	ASoC: nau8825: fix invalid configuration in Pre-Scalar of FLL
	irqchip/mxs: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
	openrisc: Add _text symbol to fix ksym build error
	dmaengine: ioatdma: Add Skylake PCI Dev ID
	dmaengine: ioatdma: workaround SKX ioatdma version
	dmaengine: ti-dma-crossbar: Add some 'of_node_put()' in error path.
	ARM64: zynqmp: Fix W=1 dtc 1.4 warnings
	ARM64: zynqmp: Fix i2c node's compatible string
	ARM: s3c2410_defconfig: Fix invalid values for NF_CT_PROTO_*
	ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
	usb: gadget: Fix copy/pasted error message
	Btrfs: adjust outstanding_extents counter properly when dio write is split
	tools lib traceevent: Fix prev/next_prio for deadline tasks
	xfrm: Don't use sk_family for socket policy lookups
	perf tools: Install tools/lib/traceevent plugins with install-bin
	perf symbols: Robustify reading of build-id from sysfs
	video: fbdev: cobalt_lcdfb: Handle return NULL error from devm_ioremap
	vfio-pci: Handle error from pci_iomap
	arm64: mm: fix show_pte KERN_CONT fallout
	nvmem: imx-ocotp: Fix wrong register size
	sh_eth: enable RX descriptor word 0 shift on SH7734
	ALSA: usb-audio: test EP_FLAG_RUNNING at urb completion
	HID: ignore Petzl USB headlamp
	scsi: fnic: Avoid sending reset to firmware when another reset is in progress
	scsi: snic: Return error code on memory allocation failure
	ASoC: dpcm: Avoid putting stream state to STOP when FE stream is paused
	Linux 4.4.80

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-08-07 14:29:16 -07:00
Wanpeng Li
62208707b4 sched/cputime: Fix prev steal time accouting during CPU hotplug
commit 3d89e5478bf550a50c99e93adf659369798263b0 upstream.

Commit:

  e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")

... set rq->prev_* to 0 after a CPU hotplug comes back, in order to
fix the case where (after CPU hotplug) steal time is smaller than
rq->prev_steal_time.

However, this should never happen. Steal time was only smaller because of the
KVM-specific bug fixed by the previous patch.  Worse, the previous patch
triggers a bug on CPU hot-unplug/plug operation: because
rq->prev_steal_time is cleared, all of the CPU's past steal time will be
accounted again on hot-plug.

Since the root cause has been fixed, we can just revert commit e9532e69b8d1.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")'
Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andres Oportus <andresoportus@google.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-06 19:19:43 -07:00
Konstantin Khlebnikov
0e0967e262 sched/cgroup: Move sched_online_group() back into css_online() to fix crash
commit 96b777452d8881480fd5be50112f791c17db4b6b upstream.

Commit:

  2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")

.. moved sched_online_group() from css_online() to css_alloc().
It exposes half-baked task group into global lists before initializing
generic cgroup stuff.

LTP testcase (third in cgroup_regression_test) written for testing
similar race in kernels 2.6.26-2.6.28 easily triggers this oops:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  IP: kernfs_path_from_node_locked+0x260/0x320
  CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
  Call Trace:
  ? kernfs_path_from_node+0x4f/0x60
  kernfs_path_from_node+0x3e/0x60
  print_rt_rq+0x44/0x2b0
  print_rt_stats+0x7a/0xd0
  print_cpu+0x2fc/0xe80
  ? __might_sleep+0x4a/0x80
  sched_debug_show+0x17/0x30
  seq_read+0xf2/0x3b0
  proc_reg_read+0x42/0x70
  __vfs_read+0x28/0x130
  ? security_file_permission+0x9b/0xc0
  ? rw_verify_area+0x4e/0xb0
  vfs_read+0xa5/0x170
  SyS_read+0x46/0xa0
  entry_SYSCALL_64_fastpath+0x1e/0xad

Here the task group is already linked into the global RCU-protected 'task_groups'
list, but the css->cgroup pointer is still NULL.

This patch reverts this chunk and moves online back to css_online().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
Link: http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-06 19:19:42 -07:00
Chris Redpath
2e13f308a9 sched/fair: Add a backup_cpu to find_best_target
Sometimes we find a target cpu but then we do not use it as
the energy_diff indicates that we would increase energy usage
or not save anything. To offer an additional option for those
cases, we return a second option which is what we would have
selected if the target CPU had not been found. This gives us
another chance to try to save some energy.

Change-Id: I42c4f20aba10e4cf65b51ac4153e2e00e534c8c7
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2017-07-25 16:31:00 +01:00
Chris Redpath
6cb8fcccb2 sched/fair: Try to estimate possible idle states.
In the current EAS group energy calculations, we only use
the idle state of the group as it is right now. This means
that there are times when EAS cannot see that we are about
to remove all utilization from a group which is likely to
result in us being able to idle that entire group.

This is an attempt to detect that situation and at least
allow the energy calculation to include savings in that
scenario, regardless of what we might be able to actually
achieve in the real world. If a cluster or cpu looks like
it will have some idle time available to it, we try to
map the utilization onto an idle state.

Change-Id: I8fcb1e507f65ae6a2c5647eeef75a4bf28c7a0c0
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-07-25 16:31:00 +01:00
Brendan Jackman
d96e404728 sched/fair: Sync task util before EAS wakeup
Before using a task's util_avg signal in EAS, we need to ensure that
it has been synced up to the last_update_time of prev_cpu's root
cfs_rq.

We previously relied on the side effect of wake_cap to do that,
however that does not happen when the waking CPU has the same
capacity as the prev_cpu. Therefore just explicitly call
sync_entity_load_avg. This may result in calling that function twice
within the same select_task_rq_fair, but since last_update_time
hasn't changed the second call will bail out very quickly.

Change-Id: I91f1fcd71dfeb96b7f5b73418f1cf9ac311d4655
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2017-07-25 16:31:00 +01:00
Brendan Jackman
e76348ec5f Revert "sched/fair: ensure utilization signals are synchronized before use"
This reverts commit 83f462daa3.

Change-Id: I37ba36da61df2beb3a005557d9b673027f446916
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
2017-07-25 16:31:00 +01:00
Leo Yan
ebc28671a5 sched/fair: kick nohz idle balance for misfit task
If there have misfit task on one CPU, current code does not handle this
situation for nohz idle balance. As result, we can see the misfit task
stays run on little core for long time.

So this patch check if the CPU has misfit task or not. If has misfit
task then kick nohz idle balance so finally can execute active balance.

Change-Id: I117d3b7404296f8de11cb960a87a6b9a54a9f348
Signed-off-by: Leo Yan <leo.yan at linaro.org>
[taken from https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2017-07-25 16:31:00 +01:00
Chris Redpath
7b63e1ff52 sched/fair: Update signals of nohz cpus if we are going idle
Stale cpu utilization signals can cause havoc for energy-aware systems,
and they are caused by no updates being performed for cpus which have
no tick running. There is open debate about when is the correct time to
update these cpus, and general recognition that something needs to be
done.

This is an attempt to do something useful.

When we are looking for a task to pull for a newly-idle cpu, we have
an opportunity to update the stats for any cpu which has no tick running
without causing too much disturbance to the system or waking it up.

Change-Id: I0280104ea9c53e56c26f1c56a62bacab5d3e951b
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
2017-07-25 16:31:00 +01:00
Patrick Bellasi
bf6cd4d156 events: add tracepoint for find_best_target
Change-Id: I4c245ffacb207d7ea826c5763a426efe5399e0a2
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2017-07-25 16:31:00 +01:00
Patrick Bellasi
5680f23f20 sched/fair: streamline find_best_target heuristics
The find_best_target() code has evolved over time to integrate different
micro-optimizations to the point to be quite difficult now to follow
exactly what it's doing.

This patch rafactors the existing code to make it more readable and easy
to maintain. It does that by properly identifying the three main
use-cases and addressing them in priority order:
 A) latency sensitive tasks
 B) non latency sensitive tasks on IDLE CPUs
 C) non latency sensitive tasks on ACTIVE CPUs

The original behaviors are preserved. Some tests to compare
power/performances before and after this patch have been done using
Jankbench and YouTube and we did not noticed sensible differences.

The only difference with respect of the original code is a small update
to favor lower-capacity idle CPUs in case B. The same preference is not
enforce in case A since this can lead to a selection of a non-reserved
CPU for TOP_APP tasks, which ultimately can lead to non desirable
co-scheduling side-effects.

Change-Id: I871e5d95af89176217e4e239b64d44a420baabe8
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
(removed checkpatch whitespace error)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-07-25 16:31:00 +01:00
Greg Kroah-Hartman
59ff2e15be This is the 4.4.78 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAllxlOsACgkQONu9yGCS
 aT4naw//VrWIuIf523IP6egcS+iprNM1lt8HZIzTQ0b2x0qv82UFA9FSs3e97dbG
 LJAUEmHW+oBbm6AekAUrNFF62qaMM5HYxVgGYeiniA1dRvLCDG/OxzxPc0hv6Kmm
 VsbZBlPQEj2B5tqUOkpvQNcqbKpIVglBsK4tjeHE/mRziUkIoPXWKS6Tt7VKGtBa
 gIGY7VASyKhPtxGCdbKzHsO/IGi3cmpwAyhqQDRBR/5Dxy3p9xlQ4gTpobeL+x8z
 A7WHqVNbgfdz21LBii5xE8+GUwiRjlYhxeFJTjhM6wo2/XCtw1FJgc7EMmsbVtbZ
 xYu1/tkYaZGoxOQ7sH5jNgVjE8IcyVimDa5eJ7p7fbh3AzsyXzAJIRc8cS7cL40F
 jLDWrYDnm5t7ziITrAf0uoMmRZLHqS2Bv5sqaoCxR3D51r6LVaNdavD6hYN1CLRA
 fjlDvnvwxbLPI4YPr7PZNu4oSJiawKx9jEBOTFSYu9L1XLvdvdcVU6ULAdpShnLn
 80a+YJmYsNg54im7sxFUw6z87AScznzirIpXEJoUO+Hs0SN9Rq/BQx8cKvVeaMEI
 z53c4Hci45Go/ozZVqCTxGbkQ6tKWIKBSo6kl3xchwms4lmijpWuwbAkDUzECNvv
 0RgyQvNx4d2cRJ8hIVcdVFzp+h+kgNqVQQ2nDld7/1QSZHLPhFo=
 =vrgw
 -----END PGP SIGNATURE-----

Merge 4.4.78 into android-4.4

Changes in 4.4.78
	net_sched: fix error recovery at qdisc creation
	net: sched: Fix one possible panic when no destroy callback
	net/phy: micrel: configure intterupts after autoneg workaround
	ipv6: avoid unregistering inet6_dev for loopback
	net: dp83640: Avoid NULL pointer dereference.
	tcp: reset sk_rx_dst in tcp_disconnect()
	net: prevent sign extension in dev_get_stats()
	bpf: prevent leaking pointer via xadd on unpriviledged
	net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()
	ipv6: dad: don't remove dynamic addresses if link is down
	net: ipv6: Compare lwstate in detecting duplicate nexthops
	vrf: fix bug_on triggered by rx when destroying a vrf
	rds: tcp: use sock_create_lite() to create the accept socket
	brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx()
	cfg80211: Define nla_policy for NL80211_ATTR_LOCAL_MESH_POWER_MODE
	cfg80211: Validate frequencies nested in NL80211_ATTR_SCAN_FREQUENCIES
	cfg80211: Check if PMKID attribute is of expected size
	irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
	parisc: Report SIGSEGV instead of SIGBUS when running out of stack
	parisc: use compat_sys_keyctl()
	parisc: DMA API: return error instead of BUG_ON for dma ops on non dma devs
	parisc/mm: Ensure IRQs are off in switch_mm()
	tools/lib/lockdep: Reduce MAX_LOCK_DEPTH to avoid overflowing lock_chain/: Depth
	kernel/extable.c: mark core_kernel_text notrace
	mm/list_lru.c: fix list_lru_count_node() to be race free
	fs/dcache.c: fix spin lockup issue on nlru->lock
	checkpatch: silence perl 5.26.0 unescaped left brace warnings
	binfmt_elf: use ELF_ET_DYN_BASE only for PIE
	arm: move ELF_ET_DYN_BASE to 4MB
	arm64: move ELF_ET_DYN_BASE to 4GB / 4MB
	powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB
	s390: reduce ELF_ET_DYN_BASE
	exec: Limit arg stack to at most 75% of _STK_LIM
	vt: fix unchecked __put_user() in tioclinux ioctls
	mnt: In umount propagation reparent in a separate pass
	mnt: In propgate_umount handle visiting mounts in any order
	mnt: Make propagate_umount less slow for overlapping mount propagation trees
	selftests/capabilities: Fix the test_execve test
	tpm: Get rid of chip->pdev
	tpm: Provide strong locking for device removal
	Add "shutdown" to "struct class".
	tpm: Issue a TPM2_Shutdown for TPM2 devices.
	mm: fix overflow check in expand_upwards()
	crypto: talitos - Extend max key length for SHA384/512-HMAC and AEAD
	crypto: atmel - only treat EBUSY as transient if backlog
	crypto: sha1-ssse3 - Disable avx2
	crypto: caam - fix signals handling
	sched/topology: Fix overlapping sched_group_mask
	sched/topology: Optimize build_group_mask()
	PM / wakeirq: Convert to SRCU
	PM / QoS: return -EINVAL for bogus strings
	tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
	KVM: x86: disable MPX if host did not enable MPX XSAVE features
	kvm: vmx: Do not disable intercepts for BNDCFGS
	kvm: x86: Guest BNDCFGS requires guest MPX support
	kvm: vmx: Check value written to IA32_BNDCFGS
	kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS
	Linux 4.4.78

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-07-21 09:14:57 +02:00
Lauro Ramos Venancio
988067ec96 sched/topology: Optimize build_group_mask()
commit f32d782e31bf079f600dcec126ed117b0577e85c upstream.

The group mask is always used in intersection with the group CPUs. So,
when building the group mask, we don't have to care about CPUs that are
not part of the group.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:59 +02:00
Peter Zijlstra
5c34f49776 sched/topology: Fix overlapping sched_group_mask
commit 73bb059f9b8a00c5e1bf2f7ca83138c05d05e600 upstream.

The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.

The current code gets it wrong, as can be readily demonstrated with a
topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.

The fixed topology looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

(note: this relies on OVERLAP domains to always have children, this is
 true because the regular topology domains are still here -- this is
 before degenerate trimming)

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:59 +02:00
Chris Redpath
2ee9941b0b UPSTREAM: cpufreq: schedutil: Trace frequency only if it has changed
sugov_update_commit() calls trace_cpu_frequency() to record the
current CPU frequency if it has not changed in the fast switch case
to prevent utilities from getting confused (they may report that the
CPU is idle if the frequency has not been recorded for too long, for
example).

However, that may cause the tracepoint to be triggered quite often
for no real reason (if the frequency doesn't change, we will not
modify the last update time stamp and governor computations may
run again shortly when that happens), so don't do that (arguably, it
is done to work around a utilities bug anyway).

That allows code duplication in sugov_update_commit() to be reduced
somewhat too.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
(cherry picked from commit 38d4ea229d25d30be6bf41bcd6cd663a587866ca)
(conflicts with sugov_up_down_rate_limit resolved)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ia019dda29b8c1c4cf3553da75c88d066eb5674e9
2017-07-18 18:18:53 +00:00
Chris Redpath
537d19226a UPSTREAM: cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register.  Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again.  That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task.  If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away.  That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.

This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway.  On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.

On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.

To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes <joelaf@google.com>
(cherry picked from commit b7eaf1aab9f8bd2e49fceed77ebc66c1b5800718)
(simple CPUFREQ_RT_DL vs CPUFREQ_DL usage conflicts)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I531ec02c052944ee07a904dc2a25c59948ee762b
2017-07-18 18:18:46 +00:00
Chris Redpath
a8a200d83b UPSTREAM: cpufreq: schedutil: Refactor sugov_next_freq_shared()
The loop in sugov_next_freq_shared() contains an if block to skip the
loop for the current CPU. This turns out to be an unnecessary
conditional in the scheduler's hot-path for every CPU in the policy.

It would be better to drop the conditional and make the loop treat all
the CPUs in the same way. That would eliminate the need of calling
sugov_iowait_boost() at the top of the routine.

To keep the code optimized to return early if the current CPU has RT/DL
flags set, move the flags check to sugov_update_shared() instead in
order to avoid the function call entirely.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit cba1dfb57b94c234728b689d9b00d4267fa1a879)
(modified for SCHED_CPUFREQ_DL vs SCHED_CPUFREQ_RT)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ie046fdc8eda46821356750edd0fb6f7d077af363
2017-07-18 18:18:40 +00:00
Chris Redpath
7378c38a80 UPSTREAM: cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
sugov_start() only initializes struct sugov_cpu per-CPU structures
for shared policies, but it should do that for single-CPU policies too.

That in particular makes the IO-wait boost mechanism work in the
cases when cpufreq policies correspond to individual CPUs.

Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
(cherry picked from commit 4296f23ed49a15d36949458adcc66ff993dee2a8)
(we use SCHED_CPUFREQ_DL instead of SCHED_CPUFREQ_RT in cpu->flags)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I5b837a0ee4432115d85caa1a9808ea61e1e1b07f
2017-07-18 18:18:33 +00:00
Viresh Kumar
cbaccedead UPSTREAM: cpufreq: schedutil: Pass sg_policy to get_next_freq()
get_next_freq() uses sg_cpu only to get sg_policy, which the callers of
get_next_freq() already have. Pass sg_policy instead of sg_cpu to
get_next_freq(), to make it more efficient.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 655cb1ebff4b7918fc560502c3297af2d3c7d114)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ia210058da32930a6cdb18258aa679cd1a44a747e
2017-07-18 18:18:27 +00:00
Chris Redpath
0646dd3592 UPSTREAM: cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
cached_raw_freq applies to the entire cpufreq policy and not individual
CPUs. Apart from wasting per-cpu memory, it is actually wrong to keep it
in struct sugov_cpu as we may end up comparing next_freq with a stale
cached_raw_freq of a random CPU.

Move cached_raw_freq to struct sugov_policy.

Fixes: 5cbea46984d6 (cpufreq: schedutil: map raw required frequency to driver frequency)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry-picked from 6c4f0fa643cb9e775dcc976e3db00d649468ff1d)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ie91420f710819b383947f9031da9be1f3bb7f636
2017-07-18 18:18:20 +00:00
Viresh Kumar
69fc75780d UPSTREAM: cpufreq: schedutil: Rectify comment in sugov_irq_work() function
This patch rectifies a comment present in sugov_irq_work() function to
follow proper grammar.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit d06e622d3d9206e6a2cc45a0f9a3256da8773ff4)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Iaf996445d411725639d511432cc424086892a146
2017-07-18 18:18:13 +00:00
Chris Redpath
d9e7d036e7 UPSTREAM: cpufreq: schedutil: irq-work and mutex are only used in slow path
Execute the irq-work specific initialization/exit code only when the
fast path isn't available.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 21ef57297b15a49b0c4dd4e7135c1a08e9a29a1c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Icfd68f455ef71846d799fcd2d8ec6aa1bf59573e
2017-07-18 18:18:03 +00:00
Chris Redpath
ceed1eb2b4 UPSTREAM: cpufreq: schedutil: enable fast switch earlier
The fast_switch_enabled flag will be used by both sugov_policy_alloc()
and sugov_policy_free() with a later patch.

Prepare for that by moving the calls to enable and disable it to the
beginning of sugov_init() and end of sugov_exit().

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 4a71ce4348bb61740d411822357061f8bf870f4c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ia174f423ca02d59360657ac2e77a5098ce5cf99c
2017-07-18 18:13:58 +00:00
Chris Redpath
bab9c2fbe4 UPSTREAM: cpufreq: schedutil: Avoid indented labels
Switch to the more common practice of writing labels.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 8e2ddb03643eb9d0bc4926946d7ce0d308eef0a5)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ida75c99cf3dff5cae24d3866454c83bcdb3385b9
2017-07-18 18:09:20 +00:00
Joonwoo Park
d368c6faa1 sched: walt: fix window misalignment when HZ=300
Due to rounding error hrtimer tick interval becomes 3333333 ns when HZ=300.
Consequently the tick time stamp nearest to the WALT's default window size
20ms will be also 19999998 (3333333 * 6).

Change-Id: I08f9bd2dbecccbb683e4490d06d8b0da703d3ab2
Suggested-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2017-07-12 21:01:07 +00:00
Greg Kroah-Hartman
64a73ff728 This is the 4.4.76 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAllc3f0ACgkQONu9yGCS
 aT4fmA/+OHeYbhpaMRKqrUpsxB3NpROr2Z47ow6vaVjYZzd0irrODLlfIfDQ6EEo
 N3v28povu16VeYXk+4h8bsAP2K2j6/BlRaSi2hB6dmnY8GDMaXEfRojPYAlzVz50
 qnK/6152siDDarUx1h5Zc8GcmX/tEl6h3bOOxDcwLR+RvyIcWxenuR+uqRM/AV6o
 BPEiOuMu7P6LjID7KYgBTFNajVBMLrDXt4SCWdzOZmlNt0QXgKB9yw68vTcc+edC
 ZcXqa0M6nEWSDvwobbwBZhFL8H2dJjzweyjeFBgxnxgmOrRh6kvZG2wsz2c8O3/P
 g8TuMxU7siu+I3lFwKy+dgZ/1REz+6Q3oFBqXsuddrcPYu23rV6mz/GxqWy4cerb
 M4eTWz6L9vA2GoYpvBaWi0tKC9tkNM49g48Y24a6CW1O4dJWlz3RrpTiZmequbNF
 mo8EKomSXn4kYAm1xT03DGljQkK/i2JtyI5sk2hLEqqxKvZ/3q9xxLLKOVx8dPvs
 PIbfpapfYMXXMWgR6e+UKueNLgevfWE12X/OU4SgvSY4n/07/mH40XEd3zd82IsZ
 1Mw0qj3JnqCAFDBBMsDYa+OvABaGD1dHARuiv+aeqW8tqoBglFHxWqF+SQVNXLIE
 qTLiKz78vjQpH0zGpkA3HEOh/h4L7a0y3qRMECsk5SUxXsgu1gg=
 =bwNU
 -----END PGP SIGNATURE-----

Merge 4.4.76 into android-4.4

Changes in 4.4.76
	ipv6: release dst on error in ip6_dst_lookup_tail
	net: don't call strlen on non-terminated string in dev_set_alias()
	decnet: dn_rtmsg: Improve input length sanitization in dnrmg_receive_user_skb
	net: Zero ifla_vf_info in rtnl_fill_vfinfo()
	af_unix: Add sockaddr length checks before accessing sa_family in bind and connect handlers
	Fix an intermittent pr_emerg warning about lo becoming free.
	net: caif: Fix a sleep-in-atomic bug in cfpkt_create_pfx
	igmp: acquire pmc lock for ip_mc_clear_src()
	igmp: add a missing spin_lock_init()
	ipv6: fix calling in6_ifa_hold incorrectly for dad work
	net/mlx5: Wait for FW readiness before initializing command interface
	decnet: always not take dst->__refcnt when inserting dst into hash table
	net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev
	sfc: provide dummy definitions of vswitch functions
	ipv6: Do not leak throw route references
	rtnetlink: add IFLA_GROUP to ifla_policy
	netfilter: xt_TCPMSS: add more sanity tests on tcph->doff
	netfilter: synproxy: fix conntrackd interaction
	NFSv4: fix a reference leak caused WARNING messages
	drm/ast: Handle configuration without P2A bridge
	mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()
	MIPS: Avoid accidental raw backtrace
	MIPS: pm-cps: Drop manual cache-line alignment of ready_count
	MIPS: Fix IRQ tracing & lockdep when rescheduling
	ALSA: hda - Fix endless loop of codec configure
	ALSA: hda - set input_path bitmap to zero after moving it to new place
	drm/vmwgfx: Free hash table allocated by cmdbuf managed res mgr
	usb: gadget: f_fs: Fix possibe deadlock
	sysctl: enable strict writes
	block: fix module reference leak on put_disk() call for cgroups throttle
	mm: numa: avoid waiting on freed migrated pages
	KVM: x86: fix fixing of hypercalls
	scsi: sd: Fix wrong DPOFUA disable in sd_read_cache_type
	scsi: lpfc: Set elsiocb contexts to NULL after freeing it
	qla2xxx: Fix erroneous invalid handle message
	ARM: dts: BCM5301X: Correct GIC_PPI interrupt flags
	net: mvneta: Fix for_each_present_cpu usage
	MIPS: ath79: fix regression in PCI window initialization
	net: korina: Fix NAPI versus resources freeing
	MIPS: ralink: MT7688 pinmux fixes
	MIPS: ralink: fix USB frequency scaling
	MIPS: ralink: Fix invalid assignment of SoC type
	MIPS: ralink: fix MT7628 pinmux typos
	MIPS: ralink: fix MT7628 wled_an pinmux gpio
	mtd: bcm47xxpart: limit scanned flash area on BCM47XX (MIPS) only
	bgmac: fix a missing check for build_skb
	mtd: bcm47xxpart: don't fail because of bit-flips
	bgmac: Fix reversed test of build_skb() return value.
	net: bgmac: Fix SOF bit checking
	net: bgmac: Start transmit queue in bgmac_open
	net: bgmac: Remove superflous netif_carrier_on()
	powerpc/eeh: Enable IO path on permanent error
	gianfar: Do not reuse pages from emergency reserve
	Btrfs: fix truncate down when no_holes feature is enabled
	virtio_console: fix a crash in config_work_handler
	swiotlb-xen: update dev_addr after swapping pages
	xen-netfront: Fix Rx stall during network stress and OOM
	scsi: virtio_scsi: Reject commands when virtqueue is broken
	platform/x86: ideapad-laptop: handle ACPI event 1
	amd-xgbe: Check xgbe_init() return code
	net: dsa: Check return value of phy_connect_direct()
	drm/amdgpu: check ring being ready before using
	vfio/spapr: fail tce_iommu_attach_group() when iommu_data is null
	virtio_net: fix PAGE_SIZE > 64k
	vxlan: do not age static remote mac entries
	ibmveth: Add a proper check for the availability of the checksum features
	kernel/panic.c: add missing \n
	HID: i2c-hid: Add sleep between POWER ON and RESET
	scsi: lpfc: avoid double free of resource identifiers
	spi: davinci: use dma_mapping_error()
	mac80211: initialize SMPS field in HT capabilities
	x86/mpx: Use compatible types in comparison to fix sparse error
	coredump: Ensure proper size of sparse core files
	swiotlb: ensure that page-sized mappings are page-aligned
	s390/ctl_reg: make __ctl_load a full memory barrier
	be2net: fix status check in be_cmd_pmac_add()
	perf probe: Fix to show correct locations for events on modules
	net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
	sctp: check af before verify address in sctp_addr_id2transport
	ravb: Fix use-after-free on `ifconfig eth0 down`
	jump label: fix passing kbuild_cflags when checking for asm goto support
	xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY
	xfrm: NULL dereference on allocation failure
	xfrm: Oops on error in pfkey_msg2xfrm_state()
	watchdog: bcm281xx: Fix use of uninitialized spinlock.
	sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
	ARM64/ACPI: Fix BAD_MADT_GICC_ENTRY() macro implementation
	ARM: 8685/1: ensure memblock-limit is pmd-aligned
	x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
	x86/mm: Fix flush_tlb_page() on Xen
	ocfs2: o2hb: revert hb threshold to keep compatible
	iommu/vt-d: Don't over-free page table directories
	iommu: Handle default domain attach failure
	iommu/amd: Fix incorrect error handling in amd_iommu_bind_pasid()
	cpufreq: s3c2416: double free on driver init error path
	KVM: x86: fix emulation of RSM and IRET instructions
	KVM: x86/vPMU: fix undefined shift in intel_pmu_refresh()
	KVM: x86: zero base3 of unusable segments
	KVM: nVMX: Fix exception injection
	Linux 4.4.76

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-07-05 16:16:58 +02:00
Matt Fleming
6ca11db55f sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
commit 6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 upstream.

If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.

This situation on exiting NO_HZ is described by:

  this_rq->calc_load_update < jiffies < calc_load_update

In this scenario, what we should be doing is:

  this_rq->calc_load_update = calc_load_update		     [ next window ]

But what we actually do is:

  this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]

This has the effect of delaying load average updates for potentially
up to ~9seconds.

This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.

It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().

This issue is easy to reproduce before,

  commit 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")

just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:37:21 +02:00
Chris Redpath
fce0ecf04a schedstats/eas: guard properly to avoid breaking non-smp schedstats users
Add appropriate #ifdef guards to ensure the smp-only easstats structs
are not used when smp is not enabled. Arnd got a report from buildbot,
analysed it, and pointed out exactly what the issue was.

Reported-by: "Arnd Bergmann" <arnd@arndb.de>
Suggested-by: "Arnd Bergmann" <arnd@arndb.de>
Fixes: 4b85765a3d ("sched/fair: Add eas (& cas)
 specific rq, sd and task stats")
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I60554dea20137f6774db3f59b4afd40a06554cfc
2017-06-03 15:03:03 +01:00
Chris Redpath
c47d00b57b sched/tune: don't use schedtune before it is ready
When EAS is enabled during boot, we have to be careful not to use
schedtune from fair.c before it is ready or it will warn us and we'll
get a traceback in the console.

Change-Id: I1a5cf29b18af626545c636c51219f9ed497c19fa
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:55 -07:00
Patrick Bellasi
9e3c04bef7 sched/fair: use SCHED_CAPACITY_SCALE for energy normalization
Change-Id: I686d26975f4a7dd830ff8441ff986e35461a7d55
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Patrick Bellasi
7b8577d94c sched/{fair,tune}: use reciprocal_value to compute boost margin
Change-Id: I493b07360c46eee0b72c2a046dab9ec6cb3427ef
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Srinath Sridharan
41d9288e3e sched/tune: Initialize raw_spin_lock in boosted_groups
bug: 32668852
Change-Id: Ice96230d88939d5973b1b6310085d1b3df9c47d9
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Patrick Bellasi
3757f95741 sched/tune: report when SchedTune has not been initialized
Change-Id: Iba4e5e3d220451f04272d555e6b8e0af83a7f09d
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
2017-06-02 08:01:55 -07:00
Chris Redpath
f9b83b3e6e sched/tune: fix sched_energy_diff tracepoint
sched_energy_diff tracepoint is in a place where it can never trace
payoff or nrg.delta. If CONFIG_SCHED_TUNE is enabled, put it in
a place where those values exist. If it is not enabled, trace from
the current location

Change-Id: Id5442f2b34ec76625491d27c0f4285433ca12699
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:55 -07:00
Chris Redpath
2e829cf17f sched/tune: increase group count to 5
We use 5 groups everywhere else, this should default to the same.

Change-Id: I05a20bdcf8046ea90a2e36979940cef11246e735
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2017-06-02 08:01:55 -07:00
Chris Redpath
4c031f0e6f cpufreq/schedutil: use boosted_cpu_util for PELT to match WALT
When using WALT we always used boosted cpu util for OPP selection.
This is the primary purpose for boosted cpu util, but we hadn't
changed the PELT utilization check to do the same thing.

Fix that here.

Change-Id: Id5ffb26eac23b25fe754255221f6d21b8cededfd
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:55 -07:00
Morten Rasmussen
fc969e3bfa sched/fair: Fix sched_group_energy() to support per-cpu capacity states
sched_group_energy() was supposed to support per-cpu capacity states
(DVFS), however, while fixing a hotplug issue this was broken as we bail
out if there is no SD_SHARE_CAP_STATES flag set.

This patch implements the hotplug race check differently and should
therefore reinstate support for per-cpu capacity states.

Change-Id: I5b865666c9ce833dcfa6514c574580d75aa0a195
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2017-06-02 08:01:55 -07:00
Valentin Schneider
fef0112a63 sched/fair: discount task contribution to find CPU with lowest utilization
In some cases, the new_util of a task can be the same on several
CPUs. This causes an issue because the target_util is only updated
if the current new_util is strictly smaller than target_util.

To fix that, the cpu_util_wake() return value is used alongside the
new_util value. If two CPUs compute the same new_util value,
we'll now also look at their cpu_util_wake() return value. In this
case, the CPU that last ran the task will be chosen in priority.

Change-Id: Ia1ea2c4b3ec39621372c2f748862317d5b497723
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
2017-06-02 08:01:54 -07:00
Chris Redpath
83f462daa3 sched/fair: ensure utilization signals are synchronized before use
wake_cap performs task and cpu utilization synchronization which is
what allows us to subtract current task util from prev_cpu util and
have a sensible number to work with.

It looks as though if wake_wide returns 0, we could potentially not
execute wake_cap, which would result in unsynced signals we then use
for energy calculations.

This is not necessarily an issue we've seen in traces, but it looks
as though it should be changed.

Change-Id: Ic54a3cba2a10d946ea20113a04371dea04115e82
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Chris Redpath
8865f07600 sched/fair: remove task util from own cpu when placing waking task
When we place a waking task with find_best_target, we calculate the
existing and new utilisation of each candidate cpu. However, we do
not remove any blocked load resulting from the waking task on the
previous cpu which might cause unnecessary migrations.

Switch to using cpu_util_wake which does this for us, which requires
moving cpu_util_wake a few functions earlier.

Also, we have multiple potential cpu utilization signals here, so
update the necessary bits to allow WALT to work properly (including
not subtracting task util for WALT).

When WALT is in use, cpu utilization is the utilization
in the previous completed window, whilst the task utilization
ignores fully idle windows. There seems to be no way to have a
decently accurate estimate of how much (if any) utilization from
this task remains on the prev cpu.

Instead, just return cpu_util when we're using WALT.

Change-Id: I448203ab98ffb5c020dfb6b218581eef1f5601f7
Reported-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Chris Redpath
8ac52cbaf4 trace:sched: Make util_avg in load_avg trace reflect PELT/WALT as used
With the ability to choose between WALT and PELT for utilisation tracking
we can have the situation where we're using WALT to make all the
decisions and reporting PELT figures in the sched_load_avg_(cpu|task)
trace points. This is not too much of an issue, but when analysing trace
it is nice to see numbers representing what the scheduler is using rather
than needing to add in additional sched_walt_* traces to figure it out.

Add reporting for both types, and make the util_avg member reflect what
will be seen from cpu or task_util functions in the scheduler.

Change-Id: I2abbd2c5fa70822096d0f3372b4c12b1c6af1590
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Dietmar Eggemann
4b85765a3d sched/fair: Add eas (& cas) specific rq, sd and task stats
The statistic counter are placed in the eas (& cas) wakeup path. Each
of them has one representation for the runqueue (rq), the sched_domain
(sd) and the task.
A task counter is always incremented. A rq counter is always
incremented for the rq the scheduler is currently running on. A sd
counter is only incremented if a relation to a sd exists.

The counters are exposed:

(1) In /proc/schedstat for rq's and sd's:

$ cat /proc/schedstat
...
cpu0 71422 0 2321254 ...
eas  44144 0 0 19446 0 24698 568435 51621 156932 133 222011 17459 120279 516814 83 0 156962 359235 176439 139981
  <- runqueue for cpu0
...
domain0 3 42430 42331 ...
eas 0 0 0 14200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 66355 0  <- MC sched domain for cpu0
...

The per-cpu eas vector has the following elements:

sis_attempts  sis_idle   sis_cache_affine sis_suff_cap    sis_idle_cpu    sis_count               ||
secb_attempts secb_sync  secb_idle_bt     secb_insuff_cap secb_no_nrg_sav secb_nrg_sav secb_count ||
fbt_attempts  fbt_no_cpu fbt_no_sd        fbt_pref_idle   fbt_count                               ||
cas_attempts  cas_count

The following relations exist between these counters (from cpu0 eas
vector above):

sis_attempts = sis_idle + sis_cache_affine + sis_suff_cap + sis_idle_cpu + sis_count

44144        = 0        + 0                + 19446        + 0            + 24698

secb_attempts = secb_sync + secb_idle_bt + secb_insuff_cap + secb_no_nrg_sav + secb_nrg_sav + secb_count

568435        = 51621     + 156932       + 133             + 222011          + 17459        + 120279

fbt_attempts = fbt_no_cpu + fbt_no_sd + fbt_pref_idle + fbt_count + (return -1)

516814       = 83         + 0         + 156962        + 359235    + (534)

cas_attempts = cas_count + (return -1 or smp_processor_id())

176439       = 139981    + (36458)

(2) In /proc/$PROCESS_PID/task/$TASK_PID/sched for a task.

example: main thread of system_server

$ cat /proc/1083/task/1083/sched

...
se.statistics.nr_wakeups_sis_attempts        :                  945
se.statistics.nr_wakeups_sis_idle            :                    0
se.statistics.nr_wakeups_sis_cache_affine    :                    0
se.statistics.nr_wakeups_sis_suff_cap        :                  219
se.statistics.nr_wakeups_sis_idle_cpu        :                    0
se.statistics.nr_wakeups_sis_count           :                  726
se.statistics.nr_wakeups_secb_attempts       :                10376
se.statistics.nr_wakeups_secb_sync           :                 1462
se.statistics.nr_wakeups_secb_idle_bt        :                 6984
se.statistics.nr_wakeups_secb_insuff_cap     :                    3
se.statistics.nr_wakeups_secb_no_nrg_sav     :                  927
se.statistics.nr_wakeups_secb_nrg_sav        :                  206
se.statistics.nr_wakeups_secb_count          :                  794
se.statistics.nr_wakeups_fbt_attempts        :                 8914
se.statistics.nr_wakeups_fbt_no_cpu          :                    0
se.statistics.nr_wakeups_fbt_no_sd           :                    0
se.statistics.nr_wakeups_fbt_pref_idle       :                 6987
se.statistics.nr_wakeups_fbt_count           :                 1554
se.statistics.nr_wakeups_cas_attempts        :                 3107
se.statistics.nr_wakeups_cas_count           :                 1195
...

The same relation between the counters as in the per-cpu case apply.

Change-Id: Ie7d01267c78a3f41f60a3ef52917d5a5d463f195
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:54 -07:00
Andres Oportus
aa8882923a sched/core: Fix PELT jump to max OPP upon util increase
Change-Id: Ic80b588ec466ef707f658dcea039fd0d6b384b63
Signed-off-by: Andres Oportus <andresoportus@google.com>
2017-06-02 08:01:54 -07:00
Dietmar Eggemann
55af384815 sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability
For Energy-Aware Scheduling (EAS) to work properly, even in the
case that there is only one cpu per cluster or that cpus are hot-plugged
out, the Energy Model (EM) data on all energy-aware sched domains (sd)
has to be present for all online cpus.

Mainline sd hierarchy setup code will remove sd's which are not useful
for task scheduling e.g. in the following situations:

1. Only 1 cpu is/remains in one cluster of a multi cluster system.

   This remaining cpu only has DIE and no MC sd.

2. A complete cluster in a two cluster system is hot-plugged out.

   The cpus of the remaining cluster only have MC and no DIE sd.

To make sure that all online cpus keep all their energy-aware sd's,
the sd degenerate functionality has been changed to not free a sd if
its first sched group (sg) contains EM data in case:

1. There is only 1 cpu left in the sd.

2. There have to be at least 2 sg's if certain sd flags are set.

Instead of freeing such a sd it now clears only its SD_LOAD_BALANCE
flag. This will make sure that the EAS functionality will always see
all energy-aware sd's for all online cpus.

It will introduce a tiny performance degradation for operations on
affected cpus since the hot-path macro for_each_domain() has to deal
with sd's not contributing to task scheduling at all now.

In most cases the exisiting code makes sure that task scheduling is not
invoked on a sd with !SD_LOAD_BALANCE.

However, a small change is necessary in update_sd_lb_stats() to make
sure that sd->parent is only initialized to !NULL in case the parent sd
contains more than 1 sg.

The handling of newidle decay values before the SD_LOAD_BALANCE check in
rebalance_domains() stays unchanged.

Test (w/ CONFIG_SCHED_DEBUG):

JUNO r0 default system:

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd07
CPU part        : 0xd07
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags:

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f

Test 1: Hotplug-out one A57 (CPU part 0xd07) cpu:

$ echo 0 > /sys/devices/system/cpu/cpu1/online

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd07
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags for remaining A57 (cpu2) cpu:

$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags`
832e <-- MC SD with !SD_LOAD_BALANCE
102f

Test 2: Hotplug-out the entire A57 cluster:

$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 0 > /sys/devices/system/cpu/cpu2/online

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags for the remaining A53 (CPU part 0xd03) cluster:

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102e <-- DIE SD with !SD_LOAD_BALANCE
832f
102e
832f
102e
832f
102e

Change-Id: If24aa2b2628f334abbf0207d39e2a86168d9d673
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2017-06-02 08:01:54 -07:00