commit 41c25707d21716826e3c1f60967f5550610ec1c9 upstream.
In most cases, a cgroup controller don't care about the liftimes of
cgroups. For the controller, a css becomes online when ->css_online()
is called on it and offline when ->css_offline() is called.
However, cpuset is special in that the user interface it exposes cares
whether certain cgroups exist or not. Combined with the RCU delay
between cgroup removal and css offlining, this can lead to user
visible behavior oddities where operations which should succeed after
cgroup removals fail for some time period. The effects of cgroup
removals are delayed when seen from userland.
This patch adds css_is_dying() which tests whether offline is pending
and updates is_cpuset_online() so that the function returns false also
while offline is pending. This gets rid of the userland visible
delays.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5ea30e4e58040cfd6434c2f33dc3ea76e2c15b05 upstream.
The stack canary is an 'unsigned long' and should be fully initialized to
random data rather than only 32 bits of random data.
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Arjan van Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-hardening@lists.openwall.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170504133209.3053-1-danielmicay@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c70d9d809fdeecedb96972457ee45c49a232d97f upstream.
When I introduced ptracer_cred I failed to consider the weirdness of
fork where the task_struct copies the old value by default. This
winds up leaving ptracer_cred set even when a process forks and
the child process does not wind up being ptraced.
Because ptracer_cred is not set on non-ptraced processes whose
parents were ptraced this has broken the ability of the enlightenment
window manager to start setuid children.
Fix this by properly initializing ptracer_cred in ptrace_init_task
This must be done with a little bit of care to preserve the current value
of ptracer_cred when ptrace carries through fork. Re-reading the
ptracer_cred from the ptracing process at this point is inconsistent
with how PT_PTRACE_CAP has been maintained all of these years.
Tested-by: Takashi Iwai <tiwai@suse.de>
Fixes: 64b875f7ac8a ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The scheduling change (bug 31501544) to avoid putting RT threads on cores that
are handling softint's was catching cases where there was no reason
to believe the softint would take a long time, resulting in unnecessary
migration overhead. This patch reduces the migration to cases where
the core has a softint that is actually likely to take a long time,
as opposed to the RCU, SCHED, and TIMER softints that are rather quick.
Bug: 31752786
Change-Id: Ib4e179f1e15c736b2fdba31070494e357e9fbbe2
Git-commit: ce05770bd37b8065b61ef650108ecef2b97b148b
Git-repo: https://android.googlesource.com/kernel/msm
[pkondeti@codeaurora.org: resolved minor merge conflicts]
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
select_best_cpu() has previous CPU's cluster bias which overrides
the best_cpu with best_sibling_cpu when the power cost is same.
When the power table is configured incorrectly or static_cpu_pwr_cost/
static_cluster_pwr_cost tunables are set to a large value, the
power_cost() for all candidate CPUs can return INT_MAX. So the
stats.min_cost is never changed from it's initial value i.e INT_MAX.
In the above scenario, we find stats.best_cpu >= 0 && stats.min_cost =
stats.best_sibling_cpu_cost = INT_MAX && stats.best_sibling_cpu_cost = -1
and replace best_cpu with best_sibling_cpu i.e -1.
Change-Id: I09829e278e41daaaff959428ff50927aba29104c
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
* refs/heads/tmp-9bc4622:
Linux 4.4.70
drivers: char: mem: Check for address space wraparound with mmap()
nfsd: encoders mustn't use unitialized values in error cases
drm/edid: Add 10 bpc quirk for LGD 764 panel in HP zBook 17 G2
PCI: Freeze PME scan before suspending devices
PCI: Fix pci_mmap_fits() for HAVE_PCI_RESOURCE_TO_USER platforms
tracing/kprobes: Enforce kprobes teardown after testing
osf_wait4(): fix infoleak
genirq: Fix chained interrupt data ordering
uwb: fix device quirk on big-endian hosts
metag/uaccess: Check access_ok in strncpy_from_user
metag/uaccess: Fix access_ok()
iommu/vt-d: Flush the IOTLB to get rid of the initial kdump mappings
staging: rtl8192e: rtl92e_get_eeprom_size Fix read size of EPROM_CMD.
staging: rtl8192e: fix 2 byte alignment of register BSSIDR.
mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp
xc2028: Fix use-after-free bug properly
arm64: documentation: document tagged pointer stack constraints
arm64: uaccess: ensure extension of access_ok() addr
arm64: xchg: hazard against entire exchange variable
ARM: dts: at91: sama5d3_xplained: not all ADC channels are available
ARM: dts: at91: sama5d3_xplained: fix ADC vref
powerpc/64e: Fix hang when debugging programs with relocated kernel
powerpc/pseries: Fix of_node_put() underflow during DLPAR remove
powerpc/book3s/mce: Move add_taint() later in virtual mode
cx231xx-cards: fix NULL-deref at probe
cx231xx-audio: fix NULL-deref at probe
cx231xx-audio: fix init error path
dvb-frontends/cxd2841er: define symbol_rate_min/max in T/C fe-ops
zr364xx: enforce minimum size when reading header
dib0700: fix NULL-deref at probe
s5p-mfc: Fix unbalanced call to clock management
gspca: konica: add missing endpoint sanity check
ceph: fix recursion between ceph_set_acl() and __ceph_setattr()
iio: proximity: as3935: fix as3935_write
ipx: call ipxitf_put() in ioctl error path
USB: hub: fix non-SS hub-descriptor handling
USB: hub: fix SS hub-descriptor handling
USB: serial: io_ti: fix div-by-zero in set_termios
USB: serial: mct_u232: fix big-endian baud-rate handling
USB: serial: qcserial: add more Lenovo EM74xx device IDs
usb: serial: option: add Telit ME910 support
USB: iowarrior: fix info ioctl on big-endian hosts
usb: musb: tusb6010_omap: Do not reset the other direction's packet size
ttusb2: limit messages to buffer size
mceusb: fix NULL-deref at probe
usbvision: fix NULL-deref at probe
net: irda: irda-usb: fix firmware name on big-endian hosts
usb: host: xhci-mem: allocate zeroed Scratchpad Buffer
xhci: apply PME_STUCK_QUIRK and MISSING_CAS quirk for Denverton
usb: host: xhci-plat: propagate return value of platform_get_irq()
sched/fair: Initialize throttle_count for new task-groups lazily
sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
fscrypt: avoid collisions when presenting long encrypted filenames
f2fs: check entire encrypted bigname when finding a dentry
fscrypt: fix context consistency check when key(s) unavailable
net: qmi_wwan: Add SIMCom 7230E
ext4 crypto: fix some error handling
ext4 crypto: don't let data integrity writebacks fail with ENOMEM
USB: serial: ftdi_sio: add Olimex ARM-USB-TINY(H) PIDs
USB: serial: ftdi_sio: fix setting latency for unprivileged users
pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()
pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
iio: dac: ad7303: fix channel description
of: fix sparse warning in of_pci_range_parser_one
proc: Fix unbalanced hard link numbers
cdc-acm: fix possible invalid access when processing notification
drm/nouveau/tmr: handle races with hw when updating the next alarm time
drm/nouveau/tmr: avoid processing completed alarms when adding a new one
drm/nouveau/tmr: fix corruption of the pending list when rescheduling an alarm
drm/nouveau/tmr: ack interrupt before processing alarms
drm/nouveau/therm: remove ineffective workarounds for alarm bugs
drm/amdgpu: Make display watermark calculations more accurate
drm/amdgpu: Avoid overflows/divide-by-zero in latency_watermark calculations.
ath9k_htc: fix NULL-deref at probe
ath9k_htc: Add support of AirTies 1eda:2315 AR9271 device
s390/cputime: fix incorrect system time
s390/kdump: Add final note
regulator: tps65023: Fix inverted core enable logic.
KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulation
KVM: x86: Fix load damaged SSEx MXCSR register
ima: accept previously set IMA_NEW_FILE
mwifiex: pcie: fix cmd_buf use-after-free in remove/reset
rtlwifi: rtl8821ae: setup 8812ae RFE according to device type
md: update slab_cache before releasing new stripes when stripes resizing
dm space map disk: fix some book keeping in the disk space map
dm thin metadata: call precommit before saving the roots
dm bufio: make the parameter "retain_bytes" unsigned long
dm cache metadata: fail operations if fail_io mode has been established
dm bufio: check new buffer allocation watermark every 30 seconds
dm bufio: avoid a possible ABBA deadlock
dm raid: select the Kconfig option CONFIG_MD_RAID0
dm btree: fix for dm_btree_find_lowest_key()
infiniband: call ipv6 route lookup via the stub interface
tpm_crb: check for bad response size
ARM: tegra: paz00: Mark panel regulator as enabled on boot
USB: core: replace %p with %pK
char: lp: fix possible integer overflow in lp_setup()
watchdog: pcwd_usb: fix NULL-deref at probe
USB: ene_usb6250: fix DMA to the stack
usb: misc: legousbtower: Fix memory leak
usb: misc: legousbtower: Fix buffers on stack
ANDROID: uid_sys_stats: defer io stats calulation for dead tasks
ANDROID: AVB: Fix linter errors.
ANDROID: AVB: Fix invalidate_vbmeta_submit().
ANDROID: sdcardfs: Check for NULL in revalidate
Linux 4.4.69
ipmi: Fix kernel panic at ipmi_ssif_thread()
wlcore: Add RX_BA_WIN_SIZE_CHANGE_EVENT event
wlcore: Pass win_size taken from ieee80211_sta to FW
mac80211: RX BA support for sta max_rx_aggregation_subframes
mac80211: pass block ack session timeout to to driver
mac80211: pass RX aggregation window size to driver
Bluetooth: hci_intel: add missing tty-device sanity check
Bluetooth: hci_bcm: add missing tty-device sanity check
Bluetooth: Fix user channel for 32bit userspace on 64bit kernel
tty: pty: Fix ldisc flush after userspace become aware of the data already
serial: omap: suspend device on probe errors
serial: omap: fix runtime-pm handling on unbind
serial: samsung: Use right device for DMA-mapping calls
arm64: KVM: Fix decoding of Rt/Rt2 when trapping AArch32 CP accesses
padata: free correct variable
CIFS: add misssing SFM mapping for doublequote
cifs: fix CIFS_IOC_GET_MNT_INFO oops
CIFS: fix mapping of SFM_SPACE and SFM_PERIOD
SMB3: Work around mount failure when using SMB3 dialect to Macs
Set unicode flag on cifs echo request to avoid Mac error
fs/block_dev: always invalidate cleancache in invalidate_bdev()
ceph: fix memory leak in __ceph_setxattr()
fs/xattr.c: zero out memory copied to userspace in getxattr
ext4: evict inline data when writing to memory map
IB/mlx4: Reduce SRIOV multicast cleanup warning message to debug level
IB/mlx4: Fix ib device initialization error flow
IB/IPoIB: ibX: failed to create mcg debug file
IB/core: Fix sysfs registration error flow
vfio/type1: Remove locked page accounting workqueue
dm era: save spacemap metadata root after the pre-commit
crypto: algif_aead - Require setkey before accept(2)
block: fix blk_integrity_register to use template's interval_exp if not 0
KVM: arm/arm64: fix races in kvm_psci_vcpu_on
KVM: x86: fix user triggerable warning in kvm_apic_accept_events()
um: Fix PTRACE_POKEUSER on x86_64
x86, pmem: Fix cache flushing for iovec write < 8 bytes
selftests/x86/ldt_gdt_32: Work around a glibc sigaction() bug
x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup
usb: hub: Do not attempt to autosuspend disconnected devices
usb: hub: Fix error loop seen after hub communication errors
usb: Make sure usb/phy/of gets built-in
usb: misc: add missing continue in switch
staging: comedi: jr3_pci: cope with jiffies wraparound
staging: comedi: jr3_pci: fix possible null pointer dereference
staging: gdm724x: gdm_mux: fix use-after-free on module unload
staging: vt6656: use off stack for out buffer USB transfers.
staging: vt6656: use off stack for in buffer USB transfers.
USB: Proper handling of Race Condition when two USB class drivers try to call init_usb_class simultaneously
USB: serial: ftdi_sio: add device ID for Microsemi/Arrow SF2PLUS Dev Kit
usb: host: xhci: print correct command ring address
iscsi-target: Set session_fall_back_to_erl0 when forcing reinstatement
target: Convert ACL change queue_depth se_session reference usage
target/fileio: Fix zero-length READ and WRITE handling
target: Fix compare_and_write_callback handling for non GOOD status
xen: adjust early dom0 p2m handling to xen hypervisor behavior
ANDROID: AVB: Only invalidate vbmeta when told to do so.
ANDROID: sdcardfs: Move top to its own struct
ANDROID: lowmemorykiller: account for unevictable pages
ANDROID: usb: gadget: fix NULL pointer issue in mtp_read()
ANDROID: usb: f_mtp: return error code if transfer error in receive_file_work function
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
Conflicts:
drivers/usb/gadget/function/f_mtp.c
fs/ext4/page-io.c
net/mac80211/agg-rx.c
Change-Id: Id65e75bf3bcee4114eb5d00730a9ef2444ad58eb
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
Add appropriate #ifdef guards to ensure the smp-only easstats structs
are not used when smp is not enabled. Arnd got a report from buildbot,
analysed it, and pointed out exactly what the issue was.
Reported-by: "Arnd Bergmann" <arnd@arndb.de>
Suggested-by: "Arnd Bergmann" <arnd@arndb.de>
Fixes: 4b85765a3d ("sched/fair: Add eas (& cas)
specific rq, sd and task stats")
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I60554dea20137f6774db3f59b4afd40a06554cfc
When EAS is enabled during boot, we have to be careful not to use
schedtune from fair.c before it is ready or it will warn us and we'll
get a traceback in the console.
Change-Id: I1a5cf29b18af626545c636c51219f9ed497c19fa
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
sched_energy_diff tracepoint is in a place where it can never trace
payoff or nrg.delta. If CONFIG_SCHED_TUNE is enabled, put it in
a place where those values exist. If it is not enabled, trace from
the current location
Change-Id: Id5442f2b34ec76625491d27c0f4285433ca12699
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
We use 5 groups everywhere else, this should default to the same.
Change-Id: I05a20bdcf8046ea90a2e36979940cef11246e735
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
When using WALT we always used boosted cpu util for OPP selection.
This is the primary purpose for boosted cpu util, but we hadn't
changed the PELT utilization check to do the same thing.
Fix that here.
Change-Id: Id5ffb26eac23b25fe754255221f6d21b8cededfd
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
sched_group_energy() was supposed to support per-cpu capacity states
(DVFS), however, while fixing a hotplug issue this was broken as we bail
out if there is no SD_SHARE_CAP_STATES flag set.
This patch implements the hotplug race check differently and should
therefore reinstate support for per-cpu capacity states.
Change-Id: I5b865666c9ce833dcfa6514c574580d75aa0a195
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
In some cases, the new_util of a task can be the same on several
CPUs. This causes an issue because the target_util is only updated
if the current new_util is strictly smaller than target_util.
To fix that, the cpu_util_wake() return value is used alongside the
new_util value. If two CPUs compute the same new_util value,
we'll now also look at their cpu_util_wake() return value. In this
case, the CPU that last ran the task will be chosen in priority.
Change-Id: Ia1ea2c4b3ec39621372c2f748862317d5b497723
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
wake_cap performs task and cpu utilization synchronization which is
what allows us to subtract current task util from prev_cpu util and
have a sensible number to work with.
It looks as though if wake_wide returns 0, we could potentially not
execute wake_cap, which would result in unsynced signals we then use
for energy calculations.
This is not necessarily an issue we've seen in traces, but it looks
as though it should be changed.
Change-Id: Ic54a3cba2a10d946ea20113a04371dea04115e82
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
When we place a waking task with find_best_target, we calculate the
existing and new utilisation of each candidate cpu. However, we do
not remove any blocked load resulting from the waking task on the
previous cpu which might cause unnecessary migrations.
Switch to using cpu_util_wake which does this for us, which requires
moving cpu_util_wake a few functions earlier.
Also, we have multiple potential cpu utilization signals here, so
update the necessary bits to allow WALT to work properly (including
not subtracting task util for WALT).
When WALT is in use, cpu utilization is the utilization
in the previous completed window, whilst the task utilization
ignores fully idle windows. There seems to be no way to have a
decently accurate estimate of how much (if any) utilization from
this task remains on the prev cpu.
Instead, just return cpu_util when we're using WALT.
Change-Id: I448203ab98ffb5c020dfb6b218581eef1f5601f7
Reported-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
With the ability to choose between WALT and PELT for utilisation tracking
we can have the situation where we're using WALT to make all the
decisions and reporting PELT figures in the sched_load_avg_(cpu|task)
trace points. This is not too much of an issue, but when analysing trace
it is nice to see numbers representing what the scheduler is using rather
than needing to add in additional sched_walt_* traces to figure it out.
Add reporting for both types, and make the util_avg member reflect what
will be seen from cpu or task_util functions in the scheduler.
Change-Id: I2abbd2c5fa70822096d0f3372b4c12b1c6af1590
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
For Energy-Aware Scheduling (EAS) to work properly, even in the
case that there is only one cpu per cluster or that cpus are hot-plugged
out, the Energy Model (EM) data on all energy-aware sched domains (sd)
has to be present for all online cpus.
Mainline sd hierarchy setup code will remove sd's which are not useful
for task scheduling e.g. in the following situations:
1. Only 1 cpu is/remains in one cluster of a multi cluster system.
This remaining cpu only has DIE and no MC sd.
2. A complete cluster in a two cluster system is hot-plugged out.
The cpus of the remaining cluster only have MC and no DIE sd.
To make sure that all online cpus keep all their energy-aware sd's,
the sd degenerate functionality has been changed to not free a sd if
its first sched group (sg) contains EM data in case:
1. There is only 1 cpu left in the sd.
2. There have to be at least 2 sg's if certain sd flags are set.
Instead of freeing such a sd it now clears only its SD_LOAD_BALANCE
flag. This will make sure that the EAS functionality will always see
all energy-aware sd's for all online cpus.
It will introduce a tiny performance degradation for operations on
affected cpus since the hot-path macro for_each_domain() has to deal
with sd's not contributing to task scheduling at all now.
In most cases the exisiting code makes sure that task scheduling is not
invoked on a sd with !SD_LOAD_BALANCE.
However, a small change is necessary in update_sd_lb_stats() to make
sure that sd->parent is only initialized to !NULL in case the parent sd
contains more than 1 sg.
The handling of newidle decay values before the SD_LOAD_BALANCE check in
rebalance_domains() stays unchanged.
Test (w/ CONFIG_SCHED_DEBUG):
JUNO r0 default system:
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd07
CPU part : 0xd07
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags:
$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
Test 1: Hotplug-out one A57 (CPU part 0xd07) cpu:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd07
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags for remaining A57 (cpu2) cpu:
$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags`
832e <-- MC SD with !SD_LOAD_BALANCE
102f
Test 2: Hotplug-out the entire A57 cluster:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 0 > /sys/devices/system/cpu/cpu2/online
$ cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
SD names and flags for the remaining A53 (CPU part 0xd03) cluster:
$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102e <-- DIE SD with !SD_LOAD_BALANCE
832f
102e
832f
102e
832f
102e
Change-Id: If24aa2b2628f334abbf0207d39e2a86168d9d673
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.
Let take the example of a task a that is dequeued of its task group A:
root
(cfs_rq)
\
(se)
A
(cfs_rq)
\
(se)
a
Task "a" was the only task in task group A which becomes idle when a is
dequeued.
We have the sequence:
- dequeue_entity a->se
- update_load_avg(a->se)
- dequeue_entity_load_avg(A->cfs_rq, a->se)
- update_cfs_shares(A->cfs_rq)
A->cfs_rq->load.weight == 0
A->se->load.weight is updated with the new share (0 in this case)
- dequeue_entity A->se
- update_load_avg(A->se) but its weight is now null so the last time
slot (up to a tick) will be accounted with a weight of 0 instead of
its real weight during the time slot. The last time slot will be
accounted as an idle one whereas it was a running one.
If the running time of task a is short enough that no tick happens when it
runs, all running time of group entity A->se will be accounted as idle
time.
Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.
update_cfs_shares() now takes the sched_entity as a parameter instead of the
cfs_rq, and the weight of the group_entity is updated only once its load_avg
has been synced with current time.
Change-Id: Id6ce3be1767b44b444ce2a77ed1ba063e57c0664
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 89ee048f3cc796db6f26906c6bef4edf0bee70fd)
[minor cherry pick stuff]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Commit:
fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
did something non-obvious but also did it buggy yet latent.
The problem was exposed for real by a later commit in the v4.7 merge window:
2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
... after which tg->load_avg and cfs_rq->load.weight had different
units (10 bit fixed point and 20 bit fixed point resp.).
Add a comment to explain the use of cfs_rq->load.weight over the
'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct
for the difference in unit.
Since this is (now, as per a previous commit) the only user of
calc_tg_weight(), collapse it.
The effects of this bug should be randomly inconsistent SMP-balancing
of cgroups workloads.
Change-Id: If1e565662ea163485edd94a12aef644d0e0dfe7a
Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit ea1dc6fc6242f991656e35e2ed3d90ec1cd13418)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
A scheduler performance regression has been reported by Joseph Salisbury,
which he bisected back to:
3d30544f0212 ("sched/fair: Apply more PELT fixes)
The regression triggers when several levels of task groups are involved
(read: SystemD) and cpu_possible_mask != cpu_present_mask.
The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
is initialized to scale_load_down(se->load.weight). During the creation of
a child task group, its group entities on possible CPUs are attached to
parent's cfs_rq (tg_parent) and their loads are added to the parent's load
(tg_parent->load_avg) with update_tg_load_avg().
But only the load on online CPUs will then be updated to reflect real load,
whereas load on other CPUs will stay at the initial value.
The result is a tg_parent->load_avg that is higher than the real load, the
weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
smaller than it should be, and the task group gets a less running time than
what it could expect.
( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
of the task group will be much higher than sum of ".tg_load_avg_contrib"
of online cfs_rqs of the task group. )
The load of group entities don't have to be intialized to something else
than 0 because their load will increase when an entity is attached.
Change-Id: Ie55021ff98ba49016adfddb2444e9c9709939226
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org> # 4.8.x
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: joonwoop@codeaurora.org
Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes)
Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b5a9b340789b2b24c6896bcf7a065c31a4db671c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Starting with the following commit:
fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
calc_tg_weight() doesn't compute the right value as expected by effective_load().
The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.
Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.
So stop using calc_tg_weight() and do it explicitly.
The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.
Change-Id: I1c0058ff674650cf295c8dc3b88a5a3de4bddab0
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v4.3+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7dd4912594daf769a46744848b05bd5bc6d62469)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.
During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.
The propagation relies on patch:
"sched: Fix hierarchical order in rq->leaf_cfs_rq_list"
... which orders children and parents, to ensure that it's done in one pass.
Change-Id: I33782e35fc4711f5901e8c23d6aa7ec5f2ff7ee5
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4e5160766fcc9f41bbd38bac11f92dce993644aa)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.
For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.
For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.
Change-Id: Id34a9888484716961c9027299c0b4d82881a39d1
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 09a43ace1f986b003c118fdf6ddf1fd685692d49)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.
The hierarchical order in shares update list has been introduced by
commit:
67e86250f8 ("sched: Introduce hierarchal order on shares update list")
With the current implementation a child can be still put after its
parent.
Lets take the example of:
root
\
b
/\
c d*
|
e*
with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail
The branch d -> e will be added the first time that they are enqueued,
starting with e then d.
When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail
Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail
e is not placed at the right position and will be called the last
whereas it should be called at the beginning.
Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.
Change-Id: I4fe0b8502ea628c13d14e8e5c5279bce67fb8845
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9c2791f936ef5fd04a118b5c284f2c9a95f4a647)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Every time we modify load/utilization of sched_entity, we start to
sync it with its cfs_rq. This update is done in different ways:
- when attaching/detaching a sched_entity, we update cfs_rq and then
we sync the entity with the cfs_rq.
- when enqueueing/dequeuing the sched_entity, we update both
sched_entity and cfs_rq metrics to now.
Use update_load_avg() everytime we have to update and sync cfs_rq and
sched_entity before changing the state of a sched_enity.
Change-Id: Ibde9a7e07ac80e9d5753bb4a0c30dfb3643cc666
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[backported FROMLIST]
Signed-off-by: Andres Oportus <andresoportus@google.com>
(cherry picked from commit d31b1a66cbe0931733583ad9d9e8c6cfd710907d)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Vincent noted that the update_tg_load_avg() usage in commit:
3d30544f0212 ("sched/fair: Apply more PELT fixes")
isn't entirely sufficient. We need to call this function every time
cfs_rq->avg.load changes, this includes when update_cfs_rq_load_avg()
returns true, but {attach,detach}_entity_load_avg() themselves also
change it. This means we need to unconditionally call
update_tg_load_avg().
Also, add more comments.
Change-Id: I7e55fceb587601f73c760c8b0d47a7ef2b777b9e
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7c3edd2c300b7ef2005a69dc727692ee07434aa5)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
One additional 'rule' for using update_cfs_rq_load_avg() is that one
should call update_tg_load_avg() if it returns true.
Add a bunch of comments to hopefully clarify some of the rules:
o You need to update cfs_rq _before_ any entity attach/detach,
this is important, because while for mathmatical consisency this
isn't strictly needed, it is required for the physical
interpretation of the model, you attach/detach _now_.
o When you modify the cfs_rq avg, you have to then call
update_tg_load_avg() in order to propagate changes upwards.
o (Fair) entities are always attached, switched_{to,from}_fair()
deal with !fair. This directly follows from the definition of the
cfs_rq averages, namely that they are a direct sum of all
(runnable or blocked) entities on that rq.
It is the second rule that this patch enforces, but it adds comments
pertaining to all of them.
Change-Id: Icdc906e98c67b84cb9582c893bc761a9886be57a
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 3d30544f02120b884bba2a9466c87dba980e3be5)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
A new task's util_avg is set to full utilization of a CPU (100% time
running). This accelerates a new task's utilization ramp-up, useful to
boost its execution in early time. However, it may result in
(insanely) high utilization for a transient time period when a flood
of tasks are spawned. Importantly, it violates the "fundamentally
bounded" CPU utilization, and its side effect is negative if we don't
take any measure to bound it.
This patch proposes an algorithm to address this issue. It has
two methods to approach a sensible initial util_avg:
(1) An expected (or average) util_avg based on its cfs_rq's util_avg:
util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
(2) A trajectory of how successive new tasks' util develops, which
gives 1/2 of the left utilization budget to a new task such that
the additional util is noticeably large (when overall util is low) or
unnoticeably small (when overall util is high enough). In the meantime,
the aggregate utilization is well bounded:
util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
where n denotes the nth task.
If util_avg is larger than util_avg_cap, then the effective util is
clamped to the util_avg_cap.
Change-Id: Idafe989b24d9e70911666f09800bf1d5a011e1f4
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2b8c41daba327c633228169e8bd8ec067ab443f8)
[integrate with schedfreq - schedfreq has a tuneable for init task util
but this commit removes the use of the tuneable since we have a new
algorithm for calculating an initial utilisation. I've left the tuneable
in place, but it is no longer used even when schedfreq is the CPUFreq
governor]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Rename best_idle to best_idle_cpu so the same name is used like in
find_best_target().
Fix if (best_idle > 0) since best_idle_cpu = 0 is a valid target.
Use 'unsigned long' data type for best_idle_capacity.
Since we're looking for the shallowest best_idle_cstate initialize
best_idle_cstate = INT_MAX. For cpus which are not idle (idle_idx = -1)
the condition 'if (idle_idx < best_idle_cstate && ...)' is never
executed.
Change-Id: Ic5b63d58478696b3d1ec6253cf739a69a574cf99
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 8bff5e9c0968108d465e1f2a4624fc5ec2f00849)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Simplify backup_capacity handling and use 'unsigned long'
data type for cpu capacity, simplify target_util handling,
simplify idle_idx handling & refactor min_util, new_util.
Also return first idle cpu for prefer_idle task immediately.
Change-Id: Ic89e140f7b369f3965703fdc8463013d16e9b94a
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
The schedtune task parameter 'boosted' is mapped into the cpu iteration
order. Currently for 'boosted' equal true the iteration starts at the
last cpu (NR_CPUS-1) whereas for 'boosted' equal false it starts at the
first cpu (0).
This only has the desired effect if the cpu topology oerdering matches
the underlying assumption. This e.g. is the case for the
Qc snapdragon 821 with its [L0 L1 b0 b1] cpu topology layout
(L=lower max freq, b=higher max freq). This results in cpus with higher
maximum capacity being given the highest logical cpu ids. However not
all big.LITTLE systems enumerate their cpus in the same way. For example,
the ARM Versatile Express Juno board has 6 cpus for which the default
configuration has topology [L0 b0 b1 L1 L2 L3].
To make this approach independent from the cpu topology layout it now
iterates over the cpus in the order of the sched_groups of the EAS
sched_domain (sd_ea). The order of cpu iteration is different for the
different cpu types in case the cpu is used to dereference sd_ea.
Considering the Qc snapdragon 821 again, for cpu L0 and L1 the order is
'b0->b1->L0->L1' whereas for b0 and b1 the order is 'b0->b1->L0->L1'.
This approach does not allow the exact same iteration order as with the
currently used flat iteration over [0 .. NR_CPUS-1] but the cpus
are ordered by the original cpu capacity.
The cpu iteration is now done in the sd_ea sched_group order required by
the 'boosted' value ['L0->L1->b0->b1'/'b0->b1->L0->L1'] rather than
forward/backward over the flat cpu space ['L0->L1->b0->b1'/
'b1->b0->L1->L0'].
Change-Id: I8fbe2073dedd2ecb1c750620c6000c11a5ff4358
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit a0c6a4272c3968c0ff50d3fed65f5865b72d777b)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
This will allow to start iterating from a cpu with max or min original
capacity in the wakeup path regardless on which cpu the scheduler is
currently running (smp_processor_id()) or the previous cpu of the task
(task_cpu(p)). This iteration has to happen on a sched_domain spanning
all cpus in the order of the sched_groups of this sched_domain seen by
the starting cpu.
In case of an SMP system the first cpu with max orig capacity and the
the one with min orig capacity is the same. This can temporally happen
on a big.LITTLE system with hotplug as well.
E.g. the different order of cpu iteration can be used to map schedtune
task parameter 'boosted' into the cpu iteration order in
find_best_target().
Use of READ_ONCE()/WRITE_ONCE() to avoid load/store tearing.
Change-Id: I812fbd9c7e5f506617e456c0eec3edcd2c016e92
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit fd6e9543c1fd8971a5e2e68e39b2f6e591d46114)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Commit fd5c98da1a42 "WIP: sched: Store system-wide maximum cpu capacity
in root domain" was repalced by commit 8148bdfff4f5 "WIP: sched: Update
max cpu capacity in case of max frequency constraints" which didn't
remove all the now unused bits.
Change-Id: I067f6366431f43337cffa7a2a8e0de32dd33d2f9
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 6d284a607cec51bcafca313bc396bc3103b1e876)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
With the new wakeup approach this sysctl is not necessary any more.
Change-Id: I52114b3c918791f6a4f9f30f50002919ccbc1a9c
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 885c0d503bcdf0ef4e9b46822496f16b20aa3bbd)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>