in_interrupt() semantics are confusing and wrong for most users as it
also returns true when bh is disabled. Thus we open coded a proper
check for interrupts in __sanitizer_cov_trace_pc() with a lengthy
explanatory comment.
Use the new in_task() predicate instead.
Link: http://lkml.kernel.org/r/20170321091026.139655-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: James Morse <james.morse@arm.com>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 64145065
(cherry-picked from f61e869d519c0c11a8d80a503cfdfb4897df855a)
Change-Id: Ice260535314238c8f82ddc578ecaeea6177d28fc
Signed-off-by: Paul Lawrence <paullawrence@google.com>
It is fragile that some definitions acquired via transitive
dependencies, as shown in below:
atomic_* (<linux/atomic.h>)
ENOMEM/EN* (<linux/errno.h>)
EXPORT_SYMBOL (<linux/export.h>)
device_initcall (<linux/init.h>)
preempt_* (<linux/preempt.h>)
Include them to prevent possible issues.
Link: http://lkml.kernel.org/r/1481163221-40170-1-git-send-email-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 64145065
(cherry-picked from db862358a4a96f52d3b0c713c703828f90d97de9)
Change-Id: Ia529631d2072cc795c46ae0276e51592318cd40f
Signed-off-by: Paul Lawrence <paullawrence@google.com>
In __sanitizer_cov_trace_pc we use task_struct and fields within it, but
as we haven't included <linux/sched.h>, it is not guaranteed to be
defined. While we usually happen to acquire the definition through a
transitive include, this is fragile (and hasn't been true in the past,
causing issues with backports).
Include <linux/sched.h> to avoid any fragility.
[mark.rutland@arm.com: rewrote changelog]
Link: http://lkml.kernel.org/r/1481007384-27529-1-git-send-email-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 64145065
(cherry-picked from 166ad0e1e2132ff0cda08b94af8301655fcabbcd)
Change-Id: Id5a06b927687ace8788f623fc91cc8305fde5f2d
Signed-off-by: Paul Lawrence <paullawrence@google.com>
in_interrupt() returns a nonzero value when we are either in an
interrupt or have bh disabled via local_bh_disable(). Since we are
interested in only ignoring coverage from actual interrupts, do a proper
check instead of just calling in_interrupt().
As a result of this change, kcov will start to collect coverage from
within local_bh_disable()/local_bh_enable() sections.
Link: http://lkml.kernel.org/r/1476115803-20712-1-git-send-email-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: James Morse <james.morse@arm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 64145065
(cherry-picked from b274c0bb394c6a69ac12feac7c2db81f5aff5a55)
Change-Id: I91364ef699f9af0a57959caf65372a6559d7a6b0
Signed-off-by: Paul Lawrence <paullawrence@google.com>
Kcov causes the compiler to add a call to __sanitizer_cov_trace_pc() in
every basic block. Ftrace patches in a call to _mcount() to each
function it has annotated.
Letting these mechanisms annotate each other is a bad thing. Break the
loop by adding 'notrace' to __sanitizer_cov_trace_pc() so that ftrace
won't try to patch this code.
This patch lets arm64 with KCOV and STACK_TRACER boot.
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 64145065
(cherry-picked from bdab42dfc974d15303afbf259f340f374a453974)
Change-Id: If7708ca761f81e0645d709b263e61493fc016e01
Signed-off-by: Paul Lawrence <paullawrence@google.com>
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing). Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system. A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/). However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.
kcov does not aim to collect as much coverage as possible. It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g. scheduler, locking).
Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes. Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch). I've
dropped the second mode for simplicity.
This patch adds the necessary support on kernel side. The complimentary
compiler support was added in gcc revision 231296.
We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:
https://github.com/google/syzkaller/wiki/Found-Bugs
We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation". For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.
Why not gcov. Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
typical coverage can be just a dozen of basic blocks (e.g. an invalid
input). In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M). Cost of
kcov depends only on number of executed basic blocks/edges. On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.
kcov exposes kernel PCs and control flow to user-space which is
insecure. But debugfs should not be mapped as user accessible.
Based on a patch by Quentin Casasnovas.
[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 64145065
(cherry-picked from 5c9a8750a6409c63a0f01d51a9024861022f6593)
Change-Id: I17b5e04f6e89b241924e78ec32ead79c38b860ce
Signed-off-by: Paul Lawrence <paullawrence@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlo06IYACgkQONu9yGCS
aT6M4hAAhACzW/fsu/NDmfsx8qroVSfugMaZd2kWd1Hne6lx4SXK/Fy61UFRLC04
oImmBfzkkDekMg3wserA+pQmUaB1ZZl3wowh7J1M9wgfNdaNvPe5mN/9tU+LRGKH
wOjZT1UWZ9Vf4a2JavsyujIL+H7QiOrsvZMaOKdUjD+chg3wexIQFoYg3NdE+wPZ
/Rhztxvuj+yBG6zZl3Ws9y55suq2NATcltpiW4bbVZf5i2cMA3en/ugsGpWuB/UO
IF2cnqzgernOpkkzVGFbXd0ePH8MhLxEiMMm+cVoE5xDGM0M7HMCePiPc66yOyYy
4axU5KiVRRe1y0a0QDWGOO9MNPX1q0AE2Gy6B6p3nlOVvA5LO9mW1mI9gGY1yH5/
Cfr9GqE9N/SmHQdLVGq8SFMKDdrOfxqyaFTOdTzMxa3TQX3qNYhoUWxcWmDVeMGY
hNCqS1wTQ8Pp3ZH7VREm/kGpLFmcIe7vaERzhZYyXGU9cE+o2REWIJzx4W5pSH3D
qaw9V+vN7aiep9TzP7G8TibXszW3j07+I7K4Ua3wBAfnbJR4hUcsExROBr/oV1+m
klzq/xoj5L1m6x4Jf5avvaW5ykbnzKIeX3urALrW4qqnd3nyrir0w9Ja1YeBymMz
56uGu8vqb02TZySPky7sSRnAyctEBP4SUL4vuudDRxIm+mbNors=
=ZyVC
-----END PGP SIGNATURE-----
Merge 4.4.106 into android-4.4
Changes in 4.4.106
can: ti_hecc: Fix napi poll return value for repoll
can: kvaser_usb: free buf in error paths
can: kvaser_usb: Fix comparison bug in kvaser_usb_read_bulk_callback()
can: kvaser_usb: ratelimit errors if incomplete messages are received
can: kvaser_usb: cancel urb on -EPIPE and -EPROTO
can: ems_usb: cancel urb on -EPIPE and -EPROTO
can: esd_usb2: cancel urb on -EPIPE and -EPROTO
can: usb_8dev: cancel urb on -EPIPE and -EPROTO
virtio: release virtio index when fail to device_register
hv: kvp: Avoid reading past allocated blocks from KVP file
isa: Prevent NULL dereference in isa_bus driver callbacks
scsi: libsas: align sata_device's rps_resp on a cacheline
efi: Move some sysfs files to be read-only by root
ASN.1: fix out-of-bounds read when parsing indefinite length item
ASN.1: check for error from ASN1_OP_END__ACT actions
X.509: reject invalid BIT STRING for subjectPublicKey
x86/PCI: Make broadcom_postcore_init() check acpi_disabled
ALSA: pcm: prevent UAF in snd_pcm_info
ALSA: seq: Remove spurious WARN_ON() at timer check
ALSA: usb-audio: Fix out-of-bound error
ALSA: usb-audio: Add check return value for usb_string()
iommu/vt-d: Fix scatterlist offset handling
s390: fix compat system call table
kdb: Fix handling of kallsyms_symbol_next() return value
drm: extra printk() wrapper macros
drm/exynos: gem: Drop NONCONTIG flag for buffers allocated without IOMMU
media: dvb: i2c transfers over usb cannot be done from stack
arm64: KVM: fix VTTBR_BADDR_MASK BUG_ON off-by-one
KVM: VMX: remove I/O port 0x80 bypass on Intel hosts
arm64: fpsimd: Prevent registers leaking from dead tasks
ARM: BUG if jumping to usermode address in kernel mode
ARM: avoid faulting on qemu
scsi: storvsc: Workaround for virtual DVD SCSI version
thp: reduce indentation level in change_huge_pmd()
thp: fix MADV_DONTNEED vs. numa balancing race
mm: drop unused pmdp_huge_get_and_clear_notify()
Revert "drm/armada: Fix compile fail"
Revert "spi: SPI_FSL_DSPI should depend on HAS_DMA"
Revert "s390/kbuild: enable modversions for symbols exported from asm"
vti6: Don't report path MTU below IPV6_MIN_MTU.
ARM: OMAP2+: gpmc-onenand: propagate error on initialization failure
x86/hpet: Prevent might sleep splat on resume
selftest/powerpc: Fix false failures for skipped tests
module: set __jump_table alignment to 8
ARM: OMAP2+: Fix device node reference counts
ARM: OMAP2+: Release device node after it is no longer needed.
gpio: altera: Use handle_level_irq when configured as a level_high
HID: chicony: Add support for another ASUS Zen AiO keyboard
usb: gadget: configs: plug memory leak
USB: gadgetfs: Fix a potential memory leak in 'dev_config()'
kvm: nVMX: VMCLEAR should not cause the vCPU to shut down
libata: drop WARN from protocol error in ata_sff_qc_issue()
workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq
scsi: lpfc: Fix crash during Hardware error recovery on SLI3 adapters
irqchip/crossbar: Fix incorrect type of register size
KVM: nVMX: reset nested_run_pending if the vCPU is going to be reset
arm: KVM: Survive unknown traps from guests
arm64: KVM: Survive unknown traps from guests
spi_ks8995: fix "BUG: key accdaa28 not in .data!"
bnx2x: prevent crash when accessing PTP with interface down
bnx2x: fix possible overrun of VFPF multicast addresses array
bnx2x: do not rollback VF MAC/VLAN filters we did not configure
ipv6: reorder icmpv6_init() and ip6_mr_init()
crypto: s5p-sss - Fix completing crypto request in IRQ handler
i2c: riic: fix restart condition
zram: set physical queue limits to avoid array out of bounds accesses
netfilter: don't track fragmented packets
axonram: Fix gendisk handling
drm/amd/amdgpu: fix console deadlock if late init failed
powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
EDAC, i5000, i5400: Fix use of MTR_DRAM_WIDTH macro
EDAC, i5000, i5400: Fix definition of NRECMEMB register
kbuild: pkg: use --transform option to prefix paths in tar
mac80211_hwsim: Fix memory leak in hwsim_new_radio_nl()
route: also update fnhe_genid when updating a route cache
route: update fnhe_expires for redirect when the fnhe exists
lib/genalloc.c: make the avail variable an atomic_long_t
dynamic-debug-howto: fix optional/omitted ending line number to be LARGE instead of 0
NFS: Fix a typo in nfs_rename()
sunrpc: Fix rpc_task_begin trace point
block: wake up all tasks blocked in get_request()
sparc64/mm: set fields in deferred pages
sctp: do not free asoc when it is already dead in sctp_sendmsg
sctp: use the right sk after waking up from wait_buf sleep
atm: horizon: Fix irq release error
jump_label: Invoke jump_label_test() via early_initcall()
xfrm: Copy policy family in clone_policy
IB/mlx4: Increase maximal message size under UD QP
IB/mlx5: Assign send CQ and recv CQ of UMR QP
afs: Connect up the CB.ProbeUuid
ipvlan: fix ipv6 outbound device
audit: ensure that 'audit=1' actually enables audit for PID 1
ipmi: Stop timers before cleaning up the module
s390: always save and restore all registers on context switch
more bio_map_user_iov() leak fixes
tipc: fix memory leak in tipc_accept_from_sock()
rds: Fix NULL pointer dereference in __rds_rdma_map
sit: update frag_off info
packet: fix crash in fanout_demux_rollover()
net/packet: fix a race in packet_bind() and packet_notifier()
Revert "x86/efi: Build our own page table structures"
Revert "x86/efi: Hoist page table switching code into efi_call_virt()"
Revert "x86/mm/pat: Ensure cpa->pfn only contains page frame numbers"
arm: KVM: Fix VTTBR_BADDR_MASK BUG_ON off-by-one
usb: gadget: ffs: Forbid usb_ep_alloc_request from sleeping
Linux 4.4.106
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 173743dd99a49c956b124a74c8aacb0384739a4c ]
Prior to this patch we enabled audit in audit_init(), which is too
late for PID 1 as the standard initcalls are run after the PID 1 task
is forked. This means that we never allocate an audit_context (see
audit_alloc()) for PID 1 and therefore miss a lot of audit events
generated by PID 1.
This patch enables audit as early as possible to help ensure that when
PID 1 is forked it can allocate an audit_context if required.
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 92ee46efeb505ead3ab06d3c5ce695637ed5f152 ]
Fengguang Wu reported that running the rcuperf test during boot can cause
the jump_label_test() to hit a WARN_ON(). The issue is that the core jump
label code relies on kernel_text_address() to detect when it can no longer
update branches that may be contained in __init sections. The
kernel_text_address() in turn assumes that if the system_state variable is
greter than or equal to SYSTEM_RUNNING then __init sections are no longer
valid (since the assumption is that they have been freed). However, when
rcuperf is setup to run in early boot it can call kernel_power_off() which
sets the system_state to SYSTEM_POWER_OFF.
Since rcuperf initialization is invoked via a module_init(), we can make
the dependency of jump_label_test() needing to complete before rcuperf
explicit by calling it via early_initcall().
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1510609727-2238-1-git-send-email-jbaron@akamai.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 637fdbae60d6cb9f6e963c1079d7e0445c86ff7d ]
If queue_delayed_work() gets called with NULL @wq, the kernel will
oops asynchronuosly on timer expiration which isn't too helpful in
tracking down the offender. This actually happened with smc.
__queue_delayed_work() already does several input sanity checks
synchronously. Add NULL @wq check.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.uk
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c07d35338081d107e57cf37572d8cc931a8e32e2 upstream.
kallsyms_symbol_next() returns a boolean (true on success). Currently
kdb_read() tests the return value with an inequality that
unconditionally evaluates to true.
This is fixed in the obvious way and, since the conditional branch is
supposed to be unreachable, we also add a WARN_ON().
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the
number of unique stack traces needed to be stored.
Move the definition of __irq_entry to <linux/interrupt.h> so that the
users don't need to pull in <linux/ftrace.h>. Also introduce the
__softirq_entry macro which is similar to __irq_entry, but puts the
corresponding functions to the .softirqentry.text section.
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 64145065
(cherry-picked from be7635e7287e0e8013af3c89a6354a9e0182594c)
Change-Id: Ib321eb9c2b76ef4785cf3fd522169f524348bd9a
Signed-off-by: Paul Lawrence <paullawrence@google.com>
For upmigrating misfit running task case, the currently running
task's util has been counted into cpu_util(). Thus currently
__cpu_overutilized() which add task's uitl twice is overestimated.
Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
'cached_raw_freq' is used to get the next frequency quickly but should
always be in sync with sg_policy->next_freq. There are cases where it is
not and in such cases it should be reset to avoid switching to incorrect
frequencies.
Consider this case for example:
- policy->cur is 1.2 GHz (Max)
- New request comes for 780 MHz and we store that in cached_raw_freq.
- Based on 780 MHz, we calculate the effective frequency as 800 MHz.
- We then decide not to update the frequency as
sugov_up_down_rate_limit() return true.
- Here cached_raw_freq is 780 MHz and sg_policy->next_freq is 1.2 GHz.
- Now if the utilization doesn't change in next request, then the next
target frequency will still be 780 MHz and it will match with
cached_raw_freq and so we will directly return 1.2 GHz instead of 800
MHz.
BACKPORT of upstream commit 07458f6a5171 ("cpufreq: schedutil: Reset
cached_raw_freq when not in sync with next_freq").
This also updates sugov_update_commit() for handling up/down tunables, which
aren't present in mainline.
Change-Id: I70bca2c5dfdb545a0471d1c9e4c5addb30ab5494
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlofw0sACgkQONu9yGCS
aT4MPBAAo85uk2d6CXKRkNl3qKWtiStKXUet+NJFVr4GotOeg6ul9yul5jcs4pvl
BJYnBh2LE77oDCOUKaSKI/0nDOHJs9n5m8GxjvG6cAvfn9RdgNm6kCCxNQFEhpNT
IrmRrmCMd3aKPNrdz2Cbd4qHzNr0JuIv/bykNHDA/rw+PkQeLzZgiGIw9ftg1yHJ
npzNLCjfVDPRy4qUCDYSS7+p83oHpWq3tHfha7M1S5HphsjVWjG79ABIKkN8w86z
5KnY3dqt5tqO4w0gZzKXv0gg4IJS62YqeJbF/dSefASvnBkINIzxBOEu0+xOFQ5t
ezKkukpe8ivX4eUP2ruF9jAjVLCPYCm6UaWbYQZBAAf04KHC09uXDjB4wdGCINt6
tdOgfm60OsPHUFjx9KBn8M81Iabq8DYNubp+naG2U/j7lGzh3+mvyAlzQKetXMct
b69skOxrjfT+2cCYeqz0UupHJigi5VLjX8hjpraXJA9oEwdS5gr9CfckEN3aUysu
YmQ2LtgGuglUdV3Lc4QptFxRDoKna3E/Gx6rzMDPtRdV1L6dn9CULRz+Pw4T+nWl
m6Ly9QXJVmC+d6fPW7cOEytPKRIqAUHSXQZxcPNPEcaPxD9CPWGO6TJLanc0BNYS
g7u9kLA2fWmWnAkvEosP8lxJlQvgorhkXdCpEWuL+mAbnaImpts=
=2wPT
-----END PGP SIGNATURE-----
Merge 4.4.103 into android-4.4
Changes in 4.4.103
s390: fix transactional execution control register handling
s390/runtime instrumention: fix possible memory corruption
s390/disassembler: add missing end marker for e7 table
s390/disassembler: increase show_code buffer size
ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
AF_VSOCK: Shrink the area influenced by prepare_to_wait
vsock: use new wait API for vsock_stream_sendmsg()
sched: Make resched_cpu() unconditional
lib/mpi: call cond_resched() from mpi_powm() loop
x86/decoder: Add new TEST instruction pattern
ARM: 8722/1: mm: make STRICT_KERNEL_RWX effective for LPAE
ARM: 8721/1: mm: dump: check hardware RO bit for LPAE
MIPS: ralink: Fix MT7628 pinmux
MIPS: ralink: Fix typo in mt7628 pinmux function
ALSA: hda: Add Raven PCI ID
dm bufio: fix integer overflow when limiting maximum cache size
dm: fix race between dm_get_from_kobject() and __dm_destroy()
MIPS: Fix an n32 core file generation regset support regression
MIPS: BCM47XX: Fix LED inversion for WRT54GSv1
autofs: don't fail mount for transient error
nilfs2: fix race condition that causes file system corruption
eCryptfs: use after free in ecryptfs_release_messaging()
bcache: check ca->alloc_thread initialized before wake up it
isofs: fix timestamps beyond 2027
NFS: Fix typo in nomigration mount option
nfs: Fix ugly referral attributes
nfsd: deal with revoked delegations appropriately
rtlwifi: rtl8192ee: Fix memory leak when loading firmware
rtlwifi: fix uninitialized rtlhal->last_suspend_sec time
ata: fixes kernel crash while tracing ata_eh_link_autopsy event
ext4: fix interaction between i_size, fallocate, and delalloc after a crash
ALSA: pcm: update tstamp only if audio_tstamp changed
ALSA: usb-audio: Add sanity checks to FE parser
ALSA: usb-audio: Fix potential out-of-bound access at parsing SU
ALSA: usb-audio: Add sanity checks in v2 clock parsers
ALSA: timer: Remove kernel warning at compat ioctl error paths
ALSA: hda/realtek - Fix ALC700 family no sound issue
fix a page leak in vhost_scsi_iov_to_sgl() error recovery
fs/9p: Compare qid.path in v9fs_test_inode
iscsi-target: Fix non-immediate TMR reference leak
target: Fix QUEUE_FULL + SCSI task attribute handling
KVM: nVMX: set IDTR and GDTR limits when loading L1 host state
KVM: SVM: obey guest PAT
SUNRPC: Fix tracepoint storage issues with svc_recv and svc_rqst_status
clk: ti: dra7-atl-clock: Fix of_node reference counting
clk: ti: dra7-atl-clock: fix child-node lookups
libnvdimm, namespace: fix label initialization to use valid seq numbers
libnvdimm, namespace: make 'resource' attribute only readable by root
IB/srpt: Do not accept invalid initiator port names
IB/srp: Avoid that a cable pull can trigger a kernel crash
NFC: fix device-allocation error return
i40e: Use smp_rmb rather than read_barrier_depends
igb: Use smp_rmb rather than read_barrier_depends
igbvf: Use smp_rmb rather than read_barrier_depends
ixgbevf: Use smp_rmb rather than read_barrier_depends
i40evf: Use smp_rmb rather than read_barrier_depends
fm10k: Use smp_rmb rather than read_barrier_depends
ixgbe: Fix skb list corruption on Power systems
parisc: Fix validity check of pointer size argument in new CAS implementation
powerpc/signal: Properly handle return value from uprobe_deny_signal()
media: Don't do DMA on stack for firmware upload in the AS102 driver
media: rc: check for integer overflow
cx231xx-cards: fix NULL-deref on missing association descriptor
media: v4l2-ctrl: Fix flags field on Control events
sched/rt: Simplify the IPI based RT balancing logic
fscrypt: lock mutex before checking for bounce page pool
net/9p: Switch to wait_event_killable()
PM / OPP: Add missing of_node_put(np)
e1000e: Fix error path in link detection
e1000e: Fix return value test
e1000e: Separate signaling for link check/link up
RDS: RDMA: return appropriate error on rdma map failures
PCI: Apply _HPX settings only to relevant devices
dmaengine: zx: set DMA_CYCLIC cap_mask bit
net: Allow IP_MULTICAST_IF to set index to L3 slave
net: 3com: typhoon: typhoon_init_one: make return values more specific
net: 3com: typhoon: typhoon_init_one: fix incorrect return values
drm/armada: Fix compile fail
ath10k: fix incorrect txpower set by P2P_DEVICE interface
ath10k: ignore configuring the incorrect board_id
ath10k: fix potential memory leak in ath10k_wmi_tlv_op_pull_fw_stats()
ath10k: set CTS protection VDEV param only if VDEV is up
ALSA: hda - Apply ALC269_FIXUP_NO_SHUTUP on HDA_FIXUP_ACT_PROBE
drm: Apply range restriction after color adjustment when allocation
mac80211: Remove invalid flag operations in mesh TSF synchronization
mac80211: Suppress NEW_PEER_CANDIDATE event if no room
iio: light: fix improper return value
staging: iio: cdc: fix improper return value
spi: SPI_FSL_DSPI should depend on HAS_DMA
netfilter: nft_queue: use raw_smp_processor_id()
netfilter: nf_tables: fix oob access
ASoC: rsnd: don't double free kctrl
btrfs: return the actual error value from from btrfs_uuid_tree_iterate
ASoC: wm_adsp: Don't overrun firmware file buffer when reading region data
s390/kbuild: enable modversions for symbols exported from asm
xen: xenbus driver must not accept invalid transaction ids
Revert "sctp: do not peel off an assoc from one netns to another one"
Linux 4.4.103
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.
When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.
When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:
b6366f048e ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.
Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.
This greatly simplifies the code!
The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:
rto_push_work - the irq work to process each CPU set in rto_mask
rto_lock - the lock to protect some of the other rto fields
rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
rto_loop_next - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
rto_loop - a variable protected by rto_lock that is used to
compare against rto_loop_next
rto_cpu - The cpu to send the next IPI to, also protected by
the rto_lock.
When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.
If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.
When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.
Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.
If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.
This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?
Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.
The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not. This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):
o CPU1 is waiting for expedited wait to complete:
sync_rcu_exp_select_cpus
rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
IPI sent to CPU5
synchronize_sched_expedited_wait
ret = swait_event_timeout(rsp->expedited_wq,
sync_rcu_preempt_exp_done(rnp_root),
jiffies_stall);
expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())
o CPU5 handles IPI and fails to acquire rq lock.
Handles IPI
sync_sched_exp_handler
resched_cpu
returns while failing to try lock acquire rq->lock
need_resched is not set
o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
idle (schedule() is not called).
o CPU 1 reports RCU stall.
Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Suggested-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry pick from commit fc6eead7c1e2e5376c25d2795d4539fdacbc0648)
Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW,
remove the duplicitive tk->raw_time.tv_nsec, which can be
stored in tk->tkr_raw.xtime_nsec (similarly to how its handled
for monotonic time).
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Daniel Mentz <danielmentz@google.com>
Tested-by: Daniel Mentz <danielmentz@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Bug: 20045882
Bug: 63737556
Change-Id: I243827d21b08703a09d2d2fe738a9258be224582
(cherry pick from commit 3d88d56c5873f6eebe23e05c3da701960146b801)
Due to how the MONOTONIC_RAW accumulation logic was handled,
there is the potential for a 1ns discontinuity when we do
accumulations. This small discontinuity has for the most part
gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW
in their vDSO clock_gettime implementation, we've seen failures
with the inconsistency-check test in kselftest.
This patch addresses the issue by using the same sub-ns
accumulation handling that CLOCK_MONOTONIC uses, which avoids
the issue for in-kernel users.
Since the ARM64 vDSO implementation has its own clock_gettime
calculation logic, this patch reduces the frequency of errors,
but failures are still seen. The ARM64 vDSO will need to be
updated to include the sub-nanosecond xtime_nsec values in its
calculation for this issue to be completely fixed.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Tested-by: Daniel Mentz <danielmentz@google.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "stable #4 . 8+" <stable@vger.kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Bug: 20045882
Bug: 63737556
Change-Id: I6c55dd7685f6bd212c6af9d09c527528e1dd5fa1
commit 0d0e57697f162da4aa218b5feafe614fb666db07 upstream.
The patch fixes two things at once:
1) It checks the env->allow_ptr_leaks and only prints the map address to
the log if we have the privileges to do so, otherwise it just dumps 0
as we would when kptr_restrict is enabled on %pK. Given the latter is
off by default and not every distro sets it, I don't want to rely on
this, hence the 0 by default for unprivileged.
2) Printing of ldimm64 in the verifier log is currently broken in that
we don't print the full immediate, but only the 32 bit part of the
first insn part for ldimm64. Thus, fix this up as well; it's okay to
access, since we verified all ldimm64 earlier already (including just
constants) through replace_map_fd_with_map_ptr().
Fixes: 1be7f75d16 ("bpf: enable non-root eBPF programs")
Fixes: cbd3570086 ("bpf: verifier (add ability to receive verification log)")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 4.4: s/bpf_verifier_env/verifier_env/]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAloMZ5oACgkQONu9yGCS
aT7uNg//Yr5QbYa+KgrcZaDLRH0otZkXsgwu4pt0uSW7Cudc5j6s2/RBSDxM1qvH
9Dnj9PY7gmC9xSC9qzBf/p8VzNv1ZxxNRkrvgR4sSMiL2o4Up8Yhs7mvxAZoagmQ
N3fsXrwgrly8qn1oQh5Q5zT2FZMVpnIjvNGDW1Rf+AhjyUQ5hXdVft88O705q0Ok
yPlJ0b/Y9sMLiUe65jxwaZS+0RnpjWEM3V7n2gOSoiYXuVaqwMC5lUsasUYBNeU4
o9tf4DR/9+3SAzcSbwYdC+MCLNia/FCw0N+RJk28eaMoca5ifOcPQEXAhC9u6N/V
XsSx79cV4I+X9wdH/L1jqw6LZn5Zh57tH1s1izB8cuBP51OEhltKuIBPSqhVSqDe
YSuho1hU32Kx1peTixwYwNikAmPMJ7biTCkA6azmFM7QWkX0Pq5kSF6nOV99fcCt
LWDfvHrOYmQfr8tLM0+/OnE0DtVDga5z0Bx9cKGjX9EPqtzi+wbBFWV1ccqNlnx6
1T7JQ98fNtFKo++jBy7RmIpGVkDkF/sio2lamboWB0ite4HuLwQwmISaHfJv7OE/
4m+mmBmSng3kUv9yFyNHhjaKVebveSHsF7aG/KhGd1ZbKFGPqC1Vz3naXvHzNDiD
HJGbMUOz9uH3jgugNP2zI8kqOKPL81YMZ5xyAZ6JyuAa/GgMW/Q=
=DOmu
-----END PGP SIGNATURE-----
Merge 4.4.98 into android-4.4
Changes in 4.4.98
adv7604: Initialize drive strength to default when using DT
video: fbdev: pmag-ba-fb: Remove bad `__init' annotation
PCI: mvebu: Handle changes to the bridge windows while enabled
xen/netback: set default upper limit of tx/rx queues to 8
drm: drm_minor_register(): Clean up debugfs on failure
KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counter
iommu/arm-smmu-v3: Clear prior settings when updating STEs
powerpc/corenet: explicitly disable the SDHC controller on kmcoge4
ARM: omap2plus_defconfig: Fix probe errors on UARTs 5 and 6
crypto: vmx - disable preemption to enable vsx in aes_ctr.c
iio: trigger: free trigger resource correctly
phy: increase size of MII_BUS_ID_SIZE and bus_id
serial: sh-sci: Fix register offsets for the IRDA serial port
usb: hcd: initialize hcd->flags to 0 when rm hcd
netfilter: nft_meta: deal with PACKET_LOOPBACK in netdev family
IPsec: do not ignore crypto err in ah4 input
Input: mpr121 - handle multiple bits change of status register
Input: mpr121 - set missing event capability
IB/ipoib: Change list_del to list_del_init in the tx object
s390/qeth: issue STARTLAN as first IPA command
net: dsa: select NET_SWITCHDEV
platform/x86: hp-wmi: Fix detection for dock and tablet mode
cdc_ncm: Set NTB format again after altsetting switch for Huawei devices
KEYS: trusted: sanitize all key material
KEYS: trusted: fix writing past end of buffer in trusted_read()
platform/x86: hp-wmi: Fix error value for hp_wmi_tablet_state
platform/x86: hp-wmi: Do not shadow error values
x86/uaccess, sched/preempt: Verify access_ok() context
workqueue: Fix NULL pointer dereference
crypto: x86/sha1-mb - fix panic due to unaligned access
KEYS: fix NULL pointer dereference during ASN.1 parsing [ver #2]
ARM: 8720/1: ensure dump_instr() checks addr_limit
ALSA: seq: Fix OSS sysex delivery in OSS emulation
ALSA: seq: Avoid invalid lockdep class warning
MIPS: microMIPS: Fix incorrect mask in insn_table_MM
MIPS: Fix CM region target definitions
MIPS: SMP: Use a completion event to signal CPU up
MIPS: Fix race on setting and getting cpu_online_mask
MIPS: SMP: Fix deadlock & online race
test: firmware_class: report errors properly on failure
selftests: firmware: add empty string and async tests
selftests: firmware: send expected errors to /dev/null
tools: firmware: check for distro fallback udev cancel rule
MIPS: AR7: Defer registration of GPIO
MIPS: AR7: Ensure that serial ports are properly set up
Input: elan_i2c - add ELAN060C to the ACPI table
drm/vmwgfx: Fix Ubuntu 17.10 Wayland black screen issue
rbd: use GFP_NOIO for parent stat and data requests
can: sun4i: handle overrun in RX FIFO
can: c_can: don't indicate triple sampling support for D_CAN
x86/oprofile/ppro: Do not use __this_cpu*() in preemptible context
PKCS#7: fix unitialized boolean 'want'
Linux 4.4.98
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit cef572ad9bd7f85035ba8272e5352040e8be0152 upstream.
When queue_work() is used in irq (not in task context), there is
a potential case that trigger NULL pointer dereference.
----------------------------------------------------------------
worker_thread()
|-spin_lock_irq()
|-process_one_work()
|-worker->current_pwq = pwq
|-spin_unlock_irq()
|-worker->current_func(work)
|-spin_lock_irq()
|-worker->current_pwq = NULL
|-spin_unlock_irq()
//interrupt here
|-irq_handler
|-__queue_work()
//assuming that the wq is draining
|-is_chained_work(wq)
|-current_wq_worker()
//Here, 'current' is the interrupted worker!
|-current->current_pwq is NULL here!
|-schedule()
----------------------------------------------------------------
Avoid it by checking for task context in current_wq_worker(), and
if not in task context, we shouldn't use the 'current' to check the
condition.
Reported-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Li Bin <huawei.libin@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 8d03ecfe47 ("workqueue: reimplement is_chained_work() using current_wq_worker()")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We all should be using (and improving) the schedutil governor now. Get
rid of the non-upstream governor.
Tested on Hikey.
Change-Id: Ic660756536e5da51952738c3c18b94e31f58cd57
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Boosted RT tasks can be deboosted quickly, this makes boost usless
for RT tasks and causes lots of glitching. Use timers to prevent
de-boost too soon and wait for long enough such that next enqueue
happens after a threshold.
While this can be solved in the governor, there are following
advantages:
- The approach used is governor-independent
- Reduces boost group lock contention for frequently sleepers/wakers
- Works with schedfreq without any other schedfreq hacks.
Bug: 30210506
Change-Id: I41788b235586988be446505deb7c0529758a9898
Signed-off-by: Joel Fernandes <joelaf@google.com>
This patch adds schedtune enqueue/dequeue to RT scheduling class.
Change-Id: If416e64319d62191f3aedd675d3e9a21fe2102fb
Signed-off-by: Joel Fernandes <joelaf@google.com>
util_delta becomes not zero in eenv_before, which will affect the
calculation of grp_util in group_idle_state(). Fix it under the
new condition.
Change-Id: Ic3853bb45876a8e388afcbe4e72d25fc42b1d7b0
Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
If no energy saving for target_cpu in the calculation of energy_diff(),
backup_cpu will be set as the new dst_cpu for the next calculation. At this
point, we also need update the new trg_cpu as backup_cpu to make sure the
subsequent calculation of energy_diff() is correct.
Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
Before commit 5f8b3a757d ("sched/fair: consider task utilization in
group_norm_util()"), eenv->util_delta is used to distinguish energy
before and energy after in sched_group_energy(). After that commit,
eenv->util_delta can not do that any more.
In this commit, use trg_cpu to distinguish energy before/after in
sched_group_energy().
Before apply this commit, cap_before/cap_delta is not correct:
<idle>-0 [001] 147504.608920: sched_energy_diff: pid=7 comm=rcu_preempt
src_cpu=1 dst_cpu=3 usage_delta=7 nrg_before=250 nrg_after=250 nrg_diff=0
cap_before=0 cap_after=528 cap_delta=1056 nrg_delta=0 nrg_payoff=0
After apply this commit, cap_before/cap_delta retrun to normal:
<idle>-0 [001] 220.494011: sched_energy_diff: pid=7 comm=rcu_preempt
src_cpu=1 dst_cpu=2 usage_delta=3 nrg_before=248 nrg_after=248 nrg_diff=0
cap_before=528 cap_after=528 cap_delta=0 nrg_delta=0 nrg_payoff=0
Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAln62hMACgkQONu9yGCS
aT4KoRAAg9FasYL4oGtTiNglajWWP33eVv5NiwwdFdCfQGOEIzZPQy7ST/I/b1CB
Kql3D9oA7d8ZFtg05KyQb4csH+9WNjeLBr4G2NGZurEC0c6vbU/64ceADgzA4YKJ
RJ2iDWEQKq45RVH6BlrDyITu9H20TRgzcsQZ7fiswB3ZJsPTfdyvDlUlN3A4JhXY
2eTF3CNVPEGLnaT4PY5tuLZpLIZkQkZzw1xr9YCq9YA5aNYi2OywWbyQq27ruX2d
K238FiaYSN5LeUxU6JE2tk4CxrHJ0pOiw6kBiSgIv3MwDQa5iQypKVQA2tnAXHqL
rPb4cGAcDSQYzpCu4XimDlLEQhoAX2BceSakdYXoMu66AKewizSnopAljhPHp9uk
0GO6lSJv0f+NGoCpxOE2FDfMIwiPbLC9LfMDWqpFvPanMfMe156p6D+LL4GfTaus
x4oZZa61aPwjomobEM4hzZk5bp1AjkiDxKHCBvwpuVTOIFlxlVcuB4RyuY2VsuHN
4a/tw9iEHkyJYCt3tsePTltgrAws2j7KCWLx+F3LTXWzmZ9//9bFq63V6kIh0a2b
nPozkt0Xj7iygJwU1G2i5XAMTF5tPH8ELioGiakv0Rkj1ncMSXx1s2dO1uxR06a5
bx/MFLbo1AyZhE8Tk4LcT/rEHtjhj/24FX6sEq4xNjw/GvAzlp0=
=ScnL
-----END PGP SIGNATURE-----
Merge 4.4.96 into android-4.4
Changes in 4.4.96
workqueue: replace pool->manager_arb mutex with a flag
ALSA: hda/realtek - Add support for ALC236/ALC3204
ALSA: hda - fix headset mic problem for Dell machines with alc236
ceph: unlock dangling spinlock in try_flush_caps()
usb: xhci: Handle error condition in xhci_stop_device()
spi: uapi: spidev: add missing ioctl header
fuse: fix READDIRPLUS skipping an entry
xen/gntdev: avoid out of bounds access in case of partial gntdev_mmap()
Input: elan_i2c - add ELAN0611 to the ACPI table
Input: gtco - fix potential out-of-bound access
assoc_array: Fix a buggy node-splitting case
scsi: zfcp: fix erp_action use-before-initialize in REC action trace
scsi: sg: Re-fix off by one in sg_fill_request_table()
can: sun4i: fix loopback mode
can: kvaser_usb: Correct return value in printout
can: kvaser_usb: Ignore CMD_FLUSH_QUEUE_REPLY messages
regulator: fan53555: fix I2C device ids
x86/microcode/intel: Disable late loading on model 79
ecryptfs: fix dereference of NULL user_key_payload
Revert "drm: bridge: add DT bindings for TI ths8135"
Linux 4.4.96
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 692b48258dda7c302e777d7d5f4217244478f1f6 upstream.
Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
lockdep:
[ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
[ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
[ 1270.473240] -----------------------------------------------------
[ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
[ 1270.474994]
[ 1270.474994] and this task is already holding:
[ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
[ 1270.476046] which would create a new lock dependency:
[ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
[ 1270.476949]
[ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
[ 1270.477553] (&pool->lock/1){-.-.}
...
[ 1270.488900] to a HARDIRQ-irq-unsafe lock:
[ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.}
...
[ 1270.494735] Possible interrupt unsafe locking scenario:
[ 1270.494735]
[ 1270.495250] CPU0 CPU1
[ 1270.495600] ---- ----
[ 1270.495947] lock(&(&lock->wait_lock)->rlock);
[ 1270.496295] local_irq_disable();
[ 1270.496753] lock(&pool->lock/1);
[ 1270.497205] lock(&(&lock->wait_lock)->rlock);
[ 1270.497744] <Interrupt>
[ 1270.497948] lock(&pool->lock/1);
, which will cause a irq inversion deadlock if the above lock scenario
happens.
The root cause of this safe -> unsafe lock order is the
mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
held.
Unlocking mutex while holding an irq spinlock was never safe and this
problem has been around forever but it never got noticed because the
only time the mutex is usually trylocked while holding irqlock making
actual failures very unlikely and lockdep annotation missed the
condition until the recent b9c16a0e1f73 ("locking/mutex: Fix
lockdep_assert_held() fail").
Using mutex for pool->manager_arb has always been a bit of stretch.
It primarily is an mechanism to arbitrate managership between workers
which can easily be done with a pool flag. The only reason it became
a mutex is that pool destruction path wants to exclude parallel
managing operations.
This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
and make the destruction path wait for the current manager on a wait
queue.
v2: Drop unnecessary flag clearing before pool destruction as
suggested by Boqun.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Upmigrate misfit current task upon scheduler tick with stopper.
We can kick an random (not necessarily big CPU) NOHZ idle CPU when a
CPU bound task is in need of upmigration. But it's not efficient as that
way needs following unnecessary wakeups:
1. Busy little CPU A to kick idle B
2. B runs idle balancer and enqueue migration/A
3. B goes idle
4. A runs migration/A, enqueues busy task on B.
5. B wakes up again.
This change makes active upmigration more efficiently by doing:
1. Busy little CPU A find target CPU B upon tick.
2. CPU A enqueues migration/A.
Change-Id: Ie865738054ea3296f28e6ba01710635efa7193c0
[joonwoop: The original version had logic to reserve CPU. The logic is
omitted in this version.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
Currently active_load_balance_cpu_stop is run by cpu stopper and it
pushes running tasks off the busiest CPU onto idle target CPU. But
there is no check to see whether target cpu is offline or not before
pushing the tasks. With the introduction of active migration in the
scheduler tick path (see check_for_migration()) there have been
instances of attempts to migrate tasks to offline CPUs.
Add a check as to whether the target cpu is online or not to prevent
scheduling on offline CPUs.
Change-Id: Ib8ac7f8aeabd3ca7365f3eae977075952dab4f21
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
Active balance currently picks one task to migrate from busy cpu to
a chosen cpu (push_cpu). This patch extends active load balance to
recognize a particular task ('push_task') that needs to be migrated to
'push_cpu'. This capability will be leveraged by HMP-aware task
placement in a subsequent patch.
Change-Id: If31320111e6cc7044e617b5c3fd6d8e0c0e16952
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlny7PcACgkQONu9yGCS
aT7UAw//Zqfv7J8fAyfeTnVsbBg5GcD7SW7clzILMclb2nCGcS1g6ZFZAOoWRZV3
eSnYKLBvOxfnF+TunD/aNagtKMIEuPqA/9Srd3uL2sdso1FtMsL8PR07Ml1o3Qf3
eUeKt/JfchpD5r+1ca4CWqRDvFyuQwP8RMsbXUqjsDEy/5USeGOzPGa9jdQ5JI/O
zvXlHyWuXryvDWHlM52H67CSdZC2KSUNeybji8EqrJlK2yijJQ8CvN7jrBevSVVy
2XpMx9RoeAbOAH36VuW5R3/tpUPW70L3Tw8zTo8dfs5w/TEMgddzDd4aWz8PLLqo
mhD6V3bgkqzkbXEaLvmmht/IMg497K6HcAFzfDu/M8X1wSaNfNmJ8IBKmDTBeuBO
t5Ha2mqDxiCTrEXq/USEgiBW28PJXKv3C5MhdSlYPWo/QGho0boTRirmbbmYupf/
T02LwRWo8MQT9l+sX5zFx+/Tw/f5/f1bWSWVJ5ns+lFbQ5smnA2nxvcPLg6LuGEe
tXZ7R+7v5yFp5quyUPTz6eY8Tau4mswuzm4avob0QLr6ZDQXTt8WbD8kmaQRoDq4
U0k58ZZbdPnOyf0zxFTURQoPk+MI/EV9tclQcWEsR+AXWVzqA71rSUMCcSl7Gad2
/vSkgDcSQ1xt0UQ7Bqf7PdNSVLtfH/n/jXJGJXwYW/Q0r/zRTYY=
=vcfF
-----END PGP SIGNATURE-----
Merge 4.4.95 into android-4.4
Changes in 4.4.95
USB: devio: Revert "USB: devio: Don't corrupt user memory"
USB: core: fix out-of-bounds access bug in usb_get_bos_descriptor()
USB: serial: metro-usb: add MS7820 device id
usb: cdc_acm: Add quirk for Elatec TWN3
usb: quirks: add quirk for WORLDE MINI MIDI keyboard
usb: hub: Allow reset retry for USB2 devices on connect bounce
ALSA: usb-audio: Add native DSD support for Pro-Ject Pre Box S2 Digital
can: gs_usb: fix busy loop if no more TX context is available
usb: musb: sunxi: Explicitly release USB PHY on exit
usb: musb: Check for host-mode using is_host_active() on reset interrupt
can: esd_usb2: Fix can_dlc value for received RTR, frames
drm/nouveau/bsp/g92: disable by default
drm/nouveau/mmu: flush tlbs before deleting page tables
ALSA: seq: Enable 'use' locking in all configurations
ALSA: hda: Remove superfluous '-' added by printk conversion
i2c: ismt: Separate I2C block read from SMBus block read
brcmsmac: make some local variables 'static const' to reduce stack size
bus: mbus: fix window size calculation for 4GB windows
clockevents/drivers/cs5535: Improve resilience to spurious interrupts
rtlwifi: rtl8821ae: Fix connection lost problem
KEYS: encrypted: fix dereference of NULL user_key_payload
lib/digsig: fix dereference of NULL user_key_payload
KEYS: don't let add_key() update an uninstantiated key
pkcs7: Prevent NULL pointer dereference, since sinfo is not always set.
parisc: Avoid trashing sr2 and sr3 in LWS code
parisc: Fix double-word compare and exchange in LWS code on 32-bit kernels
sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task()
f2fs crypto: replace some BUG_ON()'s with error checks
f2fs crypto: add missing locking for keyring_key access
fscrypt: fix dereference of NULL user_key_payload
KEYS: Fix race between updating and finding a negative key
fscrypto: require write access to mount to set encryption policy
FS-Cache: fix dereference of NULL user_key_payload
Linux 4.4.95
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
have_sched_energy_data is defined only for CONFIG_SMP, so declare it
only with CONFIG_SMP.
Fixes warning from intel bot:
tree: https://android.googlesource.com/kernel/msm android-4.4
head: a21299785a
commit: a21299785a [5/5] sched/core: Warn
if ENERGY_AWARE is enabled but data is missing
config: i386-randconfig-x002-201743 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
git checkout a21299785a
# save the attached .config to linux build tree
make ARCH=i386
All warnings (new ones prefixed by >>):
>> kernel//sched/core.c:94:13: warning: 'have_sched_energy_data' used
but never defined
static bool have_sched_energy_data(void);
^~~~~~~~~~~~~~~~~~~~~~
vim +/have_sched_energy_data +94 kernel//sched/core.c
93
> 94 static bool have_sched_energy_data(void);
95
Change-Id: I266b63ece6fb31d2b5b11821a8244e147ba6d3a4
Signed-off-by: Joel Fernandes <joelaf@google.com>
If the EAS energy model is missing or incomplete, i.e. sd_scs is NULL, then
sched_group_energy will return -EINVAL on the assumption that it raced with a
CPU hotplug event. In that case, energy_diff will return 0 and the energy-aware
wake path will silently fail to trigger any migrations.
This case can be triggered by disabling CONFIG_SCHED_MC on existing platforms,
so that there are no sched_groups with the SD_SHARE_CAP_STATES flag, so that
sd_scs is NULL.
Add checks so that a warning is printed if EAS is ever enabled while the
necessary data is not present.
Change-Id: Id233a510b5ad8b7fcecac0b1d789e730bbfc7c4a
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
It is preferable that WALT window rollover occurs just
before a tick, since the tick is an opportune moment
to record a complete window's statistics, as well as report
those stats to the cpu frequency governor. When CONFIG_HZ
results in a TICK_NSEC that isn't a integral number, this
requirement may be violated. Account for this by reducing
the WALT window size to the nearest multiple of TICK_NSEC.
Commit d368c6faa1 ("sched: walt: fix window misalignment
when HZ=300") attempted to do this but WALT isn't using
MIN_SCHED_RAVG_WINDOW as the window size and the patch was
doing nothing.
Also, change the type of 'walt_disabled' to bool and warn
if an invalid window size causes WALT to be disabled.
Change-Id: Ie3dcfc21a3df4408254ca1165a355bbe391ed5c7
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
(from https://patchwork.kernel.org/patch/9895261/)
This patch adds a parameter to select_task_rq, sibling_count_hint
allowing the caller, where it has this information, to inform the
sched_class the number of tasks that are being woken up as part of
the same event.
The wake_q mechanism is one case where this information is available.
select_task_rq_fair can then use the information to detect that it
needs to widen the search space for task placement in order to avoid
overloading the last-level cache domain's CPUs.
* * *
The reason I am investigating this change is the following use case
on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
all repeatedly do X amount of work then
pthread_barrier_wait (i.e. sleep until the last task finishes its X
and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
finish faster, and then those CPUs pull over the tasks that are still
running:
v CPU v ->time->
-------------
0 (big) 11111 /333
-------------
1 (big) 22222 /444|
-------------
2 (LITTLE) 333333/
-------------
3 (LITTLE) 444444/
-------------
Now when task 4 hits the barrier (at |) and wakes the others up,
there are 4 tasks with prev_cpu=<big> and 0 tasks with
prev_cpu=<little>. want_affine therefore means that we'll only look
in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
on the bigs until the next load balance, something like this:
v CPU v ->time->
------------------------
0 (big) 11111 /333 31313\33333
------------------------
1 (big) 22222 /444|424\4444444
------------------------
2 (LITTLE) 333333/ \222222
------------------------
3 (LITTLE) 444444/ \1111
------------------------
^^^
underutilization
So, I'm trying to get want_affine = 0 for these tasks.
I don't _think_ any incarnation of the wakee_flips mechanism can help
us here because which task is waker and which tasks are wakees
generally changes with each iteration.
However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
nice property that we know exactly how many tasks are being woken, so
we can cheat.
It might be a disadvantage that we "widen" _every_ task that's woken in
an event, while select_idle_sibling would work fine for the first
sd_llc_size - 1 tasks.
IIUC, if wake_affine() behaves correctly this trick wouldn't be
necessary on SMP systems, so it might be best guarded by the presence
of SD_ASYM_CPUCAPACITY?
* * *
Final note..
In order to observe "perfect" behaviour for this use case, I also had
to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
above we are working through the work queue and have placed tasks 3
and 2, and are about to place task 1:
v CPU v ->time->
--------------
0 (big) 11111 /333 3
--------------
1 (big) 22222 /444|4
--------------
2 (LITTLE) 333333/ 2
--------------
3 (LITTLE) 444444/ <- Task 1 should go here
--------------
If TTWU_QUEUE is enabled, we will not yet have enqueued task
2 (having instead sent a reschedule IPI) or attached its load to CPU
2. So we are likely to also place task 1 on cpu 2. Disabling
TTWU_QUEUE means that we enqueue task 2 before placing task 1,
solving this issue. TTWU_QUEUE is there to minimise rq lock
contention, and I guess that this contention is less of an issue on
big.LITTLE systems since they have relatively few CPUs, which
suggests the trade-off makes sense here.
Change-Id: I2080302839a263e0841a89efea8589ea53bbda9c
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Energy cost estimation has been a long lasting challenge for WALT
because WALT guides CPU frequency based on the CPU utilization of
previous window. Consequently it's not possible to know newly
waking-up task's energy cost until WALT's end of the current window.
The WALT already tracks 'Previous Runnable Sum' (prev_runnable_sum)
and 'Cumulative Runnable Average' (cr_avg). They are designed for
CPU frequency guidance and task placement but unfortunately both
are not suitable for the energy cost estimation.
It's because using prev_runnable_sum for energy cost calculation would
make us to account CPU and task's energy solely based on activity in the
previous window so for example, any task didn't have an activity in the
previous window will be accounted as a 'zero energy cost' task.
Energy estimation with cr_avg is what energy_diff() relies on at present.
However cr_avg can only represent instantaneous picture of energy cost
thus for example, if a CPU was fully occupied for an entire WALT window
and became idle just before window boundary, and if there is a wake-up,
energy_diff() accounts that CPU is a 'zero energy cost' CPU.
As a result, introduce a new accounting unit 'Cumulative Window Demand'.
The cumulative window demand tracks all the tasks' demands have seen in
current window which is neither instantaneous nor actual execution time.
Because task demand represents estimated scaled execution time when the
task runs a full window, accumulation of all the demands represents
predicted CPU load at the end of window.
Thus we can estimate CPU's frequency at the end of current WALT window
with the cumulative window demand.
The use of prev_runnable_sum for the CPU frequency guidance and cr_avg
for the task placement have not changed and these are going to be used
for both purpose while this patch aims to add an additional statistics.
Change-Id: I9908c77ead9973a26dea2b36c001c2baf944d4f5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Patch 5680f23f20 ("sched/fair: streamline find_best_target
heuristics") has reworked function find_best_target, as result the
variable "target_util" is useless now. So remove it.
Change-Id: I5447062419e5828a49115119984fac6cd37db034
Signed-off-by: Leo Yan <leo.yan@linaro.org>
schedtune_initialized is protected by CONFIG_CGROUP_SCHEDTUNE, but
is being used without CONFIG_CGROUP_SCHEDTUNE being defined. Add
appropriate ifdefs around the usage of schedtune_initialized to
avoid a compilation error when CONFIG_CGROUP_SCHEDTUNE is not
defined.
Change-Id: Iab79bf053d74db3eeb84c09d71d43b4e39746ed2
Signed-off-by: Russ Weight <russell.h.weight@intel.com>
Signed-off-by: Fei Yang <fei.yang@intel.com>
The group_max_util() function is used to compute the maximum utilization
across the CPUs of a certain energy_env configuration.
Its main client is the energy_diff function when it needs to compute the
SG capacity for one of the before/after scheduling candidates.
Currently, the energy_diff function sets util_delta = 0 when it wants to
compute the energy corresponding to the scheduling candidate where the
task runs in the previous CPU. This implies that, for the task waking up
in the previous CPU we consider only its blocked load tracked by the CPU
RQ. However, in case of a medium-big task which is waking up on a long
time idle CPU, this blocked load can be already completely decayed.
More in general, the current approach is biased towards under-estimating
the capacity requirements for the "before" scheduling candidate.
This patch fixes this by:
- always use the cpu_util_wake() to properly get the utilization of a CPU
without any (partially decayed) contribution of the waking up task
- adding the task utilization to the cpu_util_wake just for the target
cpu
The "target CPU" is defined by the energy_env to be either the src_cpu or
the dst_cpu, depending on which scheduling candidate we are considering.
Finally, since this update removes the last usage of calc_util_delta()
this function is now safely removed.
Change-Id: I20ee1bcf40cee6bf6e265fb2d32ef79061ad6ced
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
The group_norm_util() function is used to compute the normalized
utilization of a SG given a certain energy_env configuration.
The main client of this function is the energy_diff function when it
comes to compute the SG energy for one of the before/after scheduling
candidates.
Currently, the energy_diff function sets util_delta = 0 when it wants to
compute the energy corresponding to the scheduling candidate where the
task runs in the previous CPU. This implies that, for the task waking up
in the previous CPU we consider only its blocked load tracked by the CPU
RQ. However, in case of a medium-big task which is waking up on a long
time idle CPU, this blocked load can be already completely decayed.
More in general, the current approach is biased towards under-estimating
the energy consumption for the "before" scheduling candidate.
This patch fixes this by:
- always use the cpu_util_wake() to properly get the utilization of a CPU
without any (partially decayed) contribution of the waking up task
- adding the task utilization to the cpu_util_wake just for the
target cpu
The "target CPU" is defined by the energy_env to be either the src_cpu
or the dst_cpu, depending on which scheduling candidate we are
considering.
This patch update also the definition of __cpu_norm_util(), which is
currently called just by the group_norm_util() function. This allows to
simplify the code by using this function just to normalize a specified
utilization with respect to a given capacity.
This update allows to completely remove any dependency of
group_norm_util() from calc_util_delta().
Change-Id: I3b6ec50ce8decb1521faae660e326ab3319d3c82
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
For non latency sensitive tasks the goal is to optimize for energy efficiency.
Thus, we should try our best to avoid moving a task on a CPU which is then
going to be marked as overutilized.
Let's use the capacity_margin metric to verify if a candidate target CPU
should be considered without risking to bail out of EAS mode.
Change-Id: Ib3697106f4073aedf4a6c6ce42bd5d000fa8c007
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
The find_best_target can sometimes not return a valid backup CPU, either
because it cannot find one or just becasue it returns prev_cpu as a backup.
In these cases we should skip the energy_diff evaluation for the backup CPU.
Change-Id: I3787dbdfe74122348dd7a7485b88c4679051bd32
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>