* refs/heads/tmp-d6bbe8b
Linux 4.4.127
Revert "ip6_vti: adjust vti mtu according to mtu of lower device"
net: cavium: liquidio: fix up "Avoid dma_unmap_single on uninitialized ndata"
spi: davinci: fix up dma_mapping_error() incorrect patch
Revert "mtip32xx: use runtime tag to initialize command header"
Revert "cpufreq: Fix governor module removal race"
Revert "ARM: dts: omap3-n900: Fix the audio CODEC's reset pin"
Revert "ARM: dts: am335x-pepper: Fix the audio CODEC's reset pin"
Revert "PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()"
nospec: Kill array_index_nospec_mask_check()
nospec: Move array_index_nospec() parameter checking into separate macro
net: hns: Fix ethtool private flags
md/raid10: reset the 'first' at the end of loop
ARM: dts: am57xx-beagle-x15-common: Add overide powerhold property
ARM: dts: dra7: Add power hold and power controller properties to palmas
Documentation: pinctrl: palmas: Add ti,palmas-powerhold-override property definition
vt: change SGR 21 to follow the standards
Input: i8042 - enable MUX on Sony VAIO VGN-CS series to fix touchpad
Input: i8042 - add Lenovo ThinkPad L460 to i8042 reset list
staging: comedi: ni_mio_common: ack ai fifo error interrupts.
fs/proc: Stop trying to report thread stacks
crypto: x86/cast5-avx - fix ECB encryption when long sg follows short one
crypto: ahash - Fix early termination in hash walk
parport_pc: Add support for WCH CH382L PCI-E single parallel port card.
media: usbtv: prevent double free in error case
mei: remove dev_err message on an unsupported ioctl
USB: serial: cp210x: add ELDAT Easywave RX09 id
USB: serial: ftdi_sio: add support for Harman FirmwareHubEmulator
USB: serial: ftdi_sio: add RT Systems VX-8 cable
usb: dwc2: Improve gadget state disconnection handling
scsi: virtio_scsi: always read VPD pages for multiqueue too
llist: clang: introduce member_address_is_nonnull()
Bluetooth: Fix missing encryption refresh on Security Request
netfilter: x_tables: add and use xt_check_proc_name
netfilter: bridge: ebt_among: add more missing match size checks
xfrm: Refuse to insert 32 bit userspace socket policies on 64 bit systems
net: xfrm: use preempt-safe this_cpu_read() in ipcomp_alloc_tfms()
RDMA/ucma: Introduce safer rdma_addr_size() variants
RDMA/ucma: Don't allow join attempts for unsupported AF family
RDMA/ucma: Check that device exists prior to accessing it
RDMA/ucma: Check that device is connected prior to access it
RDMA/ucma: Ensure that CM_ID exists prior to access it
RDMA/ucma: Fix use-after-free access in ucma_close
RDMA/ucma: Check AF family prior resolving address
xfrm_user: uncoditionally validate esn replay attribute struct
arm64: avoid overflow in VA_START and PAGE_OFFSET
selinux: Remove redundant check for unknown labeling behavior
netfilter: ctnetlink: Make some parameters integer to avoid enum mismatch
tty: provide tty_name() even without CONFIG_TTY
audit: add tty field to LOGIN event
frv: declare jiffies to be located in the .data section
jiffies.h: declare jiffies and jiffies_64 with ____cacheline_aligned_in_smp
fs: compat: Remove warning from COMPATIBLE_IOCTL
selinux: Remove unnecessary check of array base in selinux_set_mapping()
cpumask: Add helper cpumask_available()
genirq: Use cpumask_available() for check of cpumask variable
netfilter: nf_nat_h323: fix logical-not-parentheses warning
Input: mousedev - fix implicit conversion warning
dm ioctl: remove double parentheses
PCI: Make PCI_ROM_ADDRESS_MASK a 32-bit constant
writeback: fix the wrong congested state variable definition
ACPI, PCI, irq: remove redundant check for null string pointer
kprobes/x86: Fix to set RWX bits correctly before releasing trampoline
usb: gadget: f_hid: fix: Prevent accessing released memory
usb: gadget: align buffer size when allocating for OUT endpoint
usb: gadget: fix usb_ep_align_maybe endianness and new usb_ep_align
usb: gadget: change len to size_t on alloc_ep_req()
usb: gadget: define free_ep_req as universal function
partitions/msdos: Unable to mount UFS 44bsd partitions
perf/hwbp: Simplify the perf-hwbp code, fix documentation
ALSA: pcm: potential uninitialized return values
ALSA: pcm: Use dma_bytes as size parameter in dma_mmap_coherent()
mtd: jedec_probe: Fix crash in jedec_read_mfr()
Replace #define with enum for better compilation errors.
Add missing include to drivers/tty/goldfish.c
Fix whitespace in drivers/tty/goldfish.c
ANDROID: fuse: Add null terminator to path in canonical path to avoid issue
ANDROID: sdcardfs: Fix sdcardfs to stop creating cases-sensitive duplicate entries.
ANDROID: add missing include to pdev_bus
ANDROID: pdev_bus: replace writel with gf_write_ptr
ANDROID: Cleanup type casting in goldfish.h
ANDROID: Include missing headers in goldfish.h
ANDROID: cpufreq: times: skip printing invalid frequencies
ANDROID: xt_qtaguid: Remove unnecessary null checks to device's name
ANDROID: xt_qtaguid: Remove unnecessary null checks to ifa_label
ANDROID: cpufreq: times: allocate enough space for a uid_entry
Linux 4.4.126
net: systemport: Rewrite __bcm_sysport_tx_reclaim()
net: fec: Fix unbalanced PM runtime calls
ieee802154: 6lowpan: fix possible NULL deref in lowpan_device_event()
s390/qeth: on channel error, reject further cmd requests
s390/qeth: lock read device while queueing next buffer
s390/qeth: when thread completes, wake up all waiters
s390/qeth: free netdevice when removing a card
team: Fix double free in error path
skbuff: Fix not waking applications when errors are enqueued
net: Only honor ifindex in IP_PKTINFO if non-0
netlink: avoid a double skb free in genlmsg_mcast()
net/iucv: Free memory obtained by kzalloc
net: ethernet: ti: cpsw: add check for in-band mode setting with RGMII PHY interface
net: ethernet: arc: Fix a potential memory leak if an optional regulator is deferred
l2tp: do not accept arbitrary sockets
ipv6: fix access to non-linear packet in ndisc_fill_redirect_hdr_option()
dccp: check sk for closed state in dccp_sendmsg()
net: Fix hlist corruptions in inet_evict_bucket()
Revert "genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs"
scsi: sg: don't return bogus Sg_requests
Revert "genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs"
UPSTREAM: drm: virtio-gpu: set atomic flag
UPSTREAM: drm: virtio-gpu: transfer dumb buffers to host on plane update
UPSTREAM: drm: virtio-gpu: ensure plane is flushed to host on atomic update
UPSTREAM: drm: virtio-gpu: get the fb from the plane state for atomic updates
Linux 4.4.125
bpf, x64: increase number of passes
bpf: skip unnecessary capability check
kbuild: disable clang's default use of -fmerge-all-constants
staging: lustre: ptlrpc: kfree used instead of kvfree
perf/x86/intel: Don't accidentally clear high bits in bdw_limit_period()
x86/entry/64: Don't use IST entry for #BP stack
x86/boot/64: Verify alignment of the LOAD segment
x86/build/64: Force the linker to use 2MB page size
kvm/x86: fix icebp instruction handling
tty: vt: fix up tabstops properly
can: cc770: Fix use after free in cc770_tx_interrupt()
can: cc770: Fix queue stall & dropped RTR reply
can: cc770: Fix stalls on rt-linux, remove redundant IRQ ack
staging: ncpfs: memory corruption in ncp_read_kernel()
mtd: nand: fsl_ifc: Fix nand waitfunc return value
tracing: probeevent: Fix to support minus offset from symbol
rtlwifi: rtl8723be: Fix loss of signal
brcmfmac: fix P2P_DEVICE ethernet address generation
acpi, numa: fix pxm to online numa node associations
drm: udl: Properly check framebuffer mmap offsets
drm/radeon: Don't turn off DP sink when disconnected
drm/vmwgfx: Fix a destoy-while-held mutex problem.
x86/mm: implement free pmd/pte page interfaces
mm/vmalloc: add interfaces to free unmapped page table
libata: Modify quirks for MX100 to limit NCQ_TRIM quirk to MU01 version
libata: Make Crucial BX100 500GB LPM quirk apply to all firmware versions
libata: Apply NOLPM quirk to Crucial M500 480 and 960GB SSDs
libata: Enable queued TRIM for Samsung SSD 860
libata: disable LPM for Crucial BX100 SSD 500GB drive
libata: Apply NOLPM quirk to Crucial MX100 512GB SSDs
libata: remove WARN() for DMA or PIO command without data
libata: fix length validation of ATAPI-relayed SCSI commands
Bluetooth: btusb: Fix quirk for Atheros 1525/QCA6174
clk: bcm2835: Protect sections updating shared registers
ahci: Add PCI-id for the Highpoint Rocketraid 644L card
PCI: Add function 1 DMA alias quirk for Highpoint RocketRAID 644L
mmc: dw_mmc: fix falling from idmac to PIO mode when dw_mci_reset occurs
ALSA: hda/realtek - Always immediately update mute LED with pin VREF
ALSA: aloop: Fix access to not-yet-ready substream via cable
ALSA: aloop: Sync stale timer before release
ALSA: usb-audio: Fix parsing descriptor of UAC2 processing unit
iio: st_pressure: st_accel: pass correct platform data to init
MIPS: ralink: Remove ralink_halt()
ANDROID: cpufreq: times: fix proc_time_in_state_show
dtc: turn off dtc unit address warnings by default
Linux 4.4.124
RDMA/ucma: Fix access to non-initialized CM_ID object
dmaengine: ti-dma-crossbar: Fix event mapping for TPCC_EVT_MUX_60_63
clk: si5351: Rename internal plls to avoid name collisions
nfsd4: permit layoutget of executable-only files
RDMA/ocrdma: Fix permissions for OCRDMA_RESET_STATS
ip6_vti: adjust vti mtu according to mtu of lower device
iommu/vt-d: clean up pr_irq if request_threaded_irq fails
pinctrl: Really force states during suspend/resume
coresight: Fix disabling of CoreSight TPIU
pty: cancel pty slave port buf's work in tty_release
drm/omap: DMM: Check for DMM readiness after successful transaction commit
vgacon: Set VGA struct resource types
IB/umem: Fix use of npages/nmap fields
RDMA/cma: Use correct size when writing netlink stats
IB/ipoib: Avoid memory leak if the SA returns a different DGID
mmc: avoid removing non-removable hosts during suspend
platform/chrome: Use proper protocol transfer function
cros_ec: fix nul-termination for firmware build info
media: [RESEND] media: dvb-frontends: Add delay to Si2168 restart
media: bt8xx: Fix err 'bt878_probe()'
rtlwifi: rtl_pci: Fix the bug when inactiveps is enabled.
RDMA/iwpm: Fix uninitialized error code in iwpm_send_mapinfo()
drm/msm: fix leak in failed get_pages
media: c8sectpfe: fix potential NULL pointer dereference in c8sectpfe_timer_interrupt
Bluetooth: hci_qca: Avoid setup failure on missing rampatch
perf tests kmod-path: Don't fail if compressed modules aren't supported
rtc: ds1374: wdt: Fix stop/start ioctl always returning -EINVAL
rtc: ds1374: wdt: Fix issue with timeout scaling from secs to wdt ticks
cifs: small underflow in cnvrtDosUnixTm()
net: hns: fix ethtool_get_strings overflow in hns driver
sm501fb: don't return zero on failure path in sm501fb_start()
video: fbdev: udlfb: Fix buffer on stack
tcm_fileio: Prevent information leak for short reads
ia64: fix module loading for gcc-5.4
md/raid10: skip spare disk as 'first' disk
Input: twl4030-pwrbutton - use correct device for irq request
power: supply: pda_power: move from timer to delayed_work
bnx2x: Align RX buffers
drm/nouveau/kms: Increase max retries in scanout position queries.
ACPI / PMIC: xpower: Fix power_table addresses
ipmi/watchdog: fix wdog hang on panic waiting for ipmi response
ARM: DRA7: clockdomain: Change the CLKTRCTRL of CM_PCIE_CLKSTCTRL to SW_WKUP
mmc: sdhci-of-esdhc: limit SD clock for ls1012a/ls1046a
staging: wilc1000: fix unchecked return value
staging: unisys: visorhba: fix s-Par to boot with option CONFIG_VMAP_STACK set to y
mtip32xx: use runtime tag to initialize command header
mfd: palmas: Reset the POWERHOLD mux during power off
mac80211: don't parse encrypted management frames in ieee80211_frame_acked
Btrfs: send, fix file hole not being preserved due to inline extent
rndis_wlan: add return value validation
mt7601u: check return value of alloc_skb
iio: st_pressure: st_accel: Initialise sensor platform data properly
NFS: don't try to cross a mountpount when there isn't one there.
infiniband/uverbs: Fix integer overflows
scsi: mac_esp: Replace bogus memory barrier with spinlock
qlcnic: fix unchecked return value
wan: pc300too: abort path on failure
mmc: host: omap_hsmmc: checking for NULL instead of IS_ERR()
openvswitch: Delete conntrack entry clashing with an expectation.
netfilter: xt_CT: fix refcnt leak on error path
Fix driver usage of 128B WQEs when WQ_CREATE is V1.
ASoC: Intel: Skylake: Uninitialized variable in probe_codec()
IB/mlx4: Change vma from shared to private
IB/mlx4: Take write semaphore when changing the vma struct
HSI: ssi_protocol: double free in ssip_pn_xmit()
IB/ipoib: Update broadcast object if PKey value was changed in index 0
IB/ipoib: Fix deadlock between ipoib_stop and mcast join flow
ALSA: hda - Fix headset microphone detection for ASUS N551 and N751
e1000e: fix timing for 82579 Gigabit Ethernet controller
tcp: remove poll() flakes with FastOpen
NFS: Fix missing pg_cleanup after nfs_pageio_cond_complete()
md/raid10: wait up frozen array in handle_write_completed
iommu/omap: Register driver before setting IOMMU ops
ARM: 8668/1: ftrace: Fix dynamic ftrace with DEBUG_RODATA and !FRAME_POINTER
KVM: PPC: Book3S PR: Exit KVM on failed mapping
scsi: virtio_scsi: Always try to read VPD pages
clk: ns2: Correct SDIO bits
ath: Fix updating radar flags for coutry code India
spi: dw: Disable clock after unregistering the host
media/dvb-core: Race condition when writing to CAM
net: ipv6: send unsolicited NA on admin up
i2c: i2c-scmi: add a MS HID
genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs
cpufreq/sh: Replace racy task affinity logic
ACPI/processor: Replace racy task affinity logic
ACPI/processor: Fix error handling in __acpi_processor_start()
time: Change posix clocks ops interfaces to use timespec64
Input: ar1021_i2c - fix too long name in driver's device table
rtc: cmos: Do not assume irq 8 for rtc when there are no legacy irqs
x86: i8259: export legacy_pic symbol
regulator: anatop: set default voltage selector for pcie
platform/x86: asus-nb-wmi: Add wapf4 quirk for the X302UA
staging: android: ashmem: Fix possible deadlock in ashmem_ioctl
CIFS: Enable encryption during session setup phase
SMB3: Validate negotiate request must always be signed
tpm_tis: fix potential buffer overruns caused by bit glitches on the bus
tpm: fix potential buffer overruns caused by bit glitches on the bus
BACKPORT, FROMLIST: crypto: arm64/speck - add NEON-accelerated implementation of Speck-XTS
Linux 4.4.123
bpf: fix incorrect sign extension in check_alu_op()
usb: gadget: bdc: 64-bit pointer capability check
USB: gadget: udc: Add missing platform_device_put() on error in bdc_pci_probe()
btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
btrfs: alloc_chunk: fix DUP stripe size handling
ARM: dts: LogicPD Torpedo: Fix I2C1 pinmux
scsi: sg: only check for dxfer_len greater than 256M
scsi: sg: fix static checker warning in sg_is_valid_dxfer
scsi: sg: fix SG_DXFER_FROM_DEV transfers
irqchip/gic-v3-its: Ensure nr_ites >= nr_lpis
fs/aio: Use RCU accessors for kioctx_table->table[]
fs/aio: Add explicit RCU grace period when freeing kioctx
lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
fs: Teach path_connected to handle nfs filesystems with multiple roots.
drm/amdgpu/dce: Don't turn off DP sink when disconnected
ALSA: seq: Clear client entry before deleting else at closing
ALSA: seq: Fix possible UAF in snd_seq_check_queue()
ALSA: hda - Revert power_save option default value
ALSA: pcm: Fix UAF in snd_pcm_oss_get_formats()
x86/mm: Fix vmalloc_fault to use pXd_large
x86/vm86/32: Fix POPF emulation
selftests/x86/entry_from_vm86: Add test cases for POPF
selftests/x86: Add tests for the STR and SLDT instructions
selftests/x86: Add tests for User-Mode Instruction Prevention
selftests/x86/entry_from_vm86: Exit with 1 if we fail
ima: relax requiring a file signature for new files with zero length
rcutorture/configinit: Fix build directory error message
ipvlan: add L2 check for packets arriving via virtual devices
ASoC: nuc900: Fix a loop timeout test
mac80211: remove BUG() when interface type is invalid
mac80211_hwsim: enforce PS_MANUAL_POLL to be set after PS_ENABLED
agp/intel: Flush all chipset writes after updating the GGTT
drm/amdkfd: Fix memory leaks in kfd topology
veth: set peer GSO values
media: cpia2: Fix a couple off by one bugs
scsi: dh: add new rdac devices
scsi: devinfo: apply to HP XP the same flags as Hitachi VSP
scsi: core: scsi_get_device_flags_keyed(): Always return device flags
spi: sun6i: disable/unprepare clocks on remove
tools/usbip: fixes build with musl libc toolchain
ath10k: fix invalid STS_CAP_OFFSET_MASK
clk: qcom: msm8916: fix mnd_width for codec_digcodec
cpufreq: Fix governor module removal race
ath10k: update tdls teardown state to target
ARM: dts: omap3-n900: Fix the audio CODEC's reset pin
ARM: dts: am335x-pepper: Fix the audio CODEC's reset pin
mtd: nand: fix interpretation of NAND_CMD_NONE in nand_command[_lp]()
net: xfrm: allow clearing socket xfrm policies.
test_firmware: fix setting old custom fw path back on exit
sched: Stop resched_cpu() from sending IPIs to offline CPUs
sched: Stop switched_to_rt() from sending IPIs to offline CPUs
ARM: dts: exynos: Correct Trats2 panel reset line
HID: elo: clear BTN_LEFT mapping
video/hdmi: Allow "empty" HDMI infoframes
drm/edid: set ELD connector type in drm_edid_to_eld()
wil6210: fix memory access violation in wil_memcpy_from/toio_32
pwm: tegra: Increase precision in PWM rate calculation
kprobes/x86: Set kprobes pages read-only
kprobes/x86: Fix kprobe-booster not to boost far call instructions
scsi: sg: close race condition in sg_remove_sfp_usercontext()
scsi: sg: check for valid direction before starting the request
perf session: Don't rely on evlist in pipe mode
perf inject: Copy events when reordering events in pipe mode
drivers/perf: arm_pmu: handle no platform_device
usb: gadget: dummy_hcd: Fix wrong power status bit clear/reset in dummy_hub_control()
usb: dwc2: Make sure we disconnect the gadget state
md/raid6: Fix anomily when recovering a single device in RAID6.
regulator: isl9305: fix array size
MIPS: r2-on-r6-emu: Clear BLTZALL and BGEZALL debugfs counters
MIPS: r2-on-r6-emu: Fix BLEZL and BGTZL identification
MIPS: BPF: Fix multiple problems in JIT skb access helpers.
MIPS: BPF: Quit clobbering callee saved registers in JIT code.
coresight: Fixes coresight DT parse to get correct output port ID.
drm/amdgpu: Fail fb creation from imported dma-bufs. (v2)
drm/radeon: Fail fb creation from imported dma-bufs.
video: ARM CLCD: fix dma allocation size
iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range
apparmor: Make path_max parameter readonly
scsi: ses: don't get power status of SES device slot on probe
fm10k: correctly check if interface is removed
ALSA: firewire-digi00x: handle all MIDI messages on streaming packets
reiserfs: Make cancel_old_flush() reliable
ARM: dts: koelsch: Correct clock frequency of X2 DU clock input
net/faraday: Add missing include of of.h
powerpc: Avoid taking a data miss on every userspace instruction miss
ARM: dts: r8a7791: Correct parent of SSI[0-9] clocks
ARM: dts: r8a7790: Correct parent of SSI[0-9] clocks
NFC: nfcmrvl: double free on error path
NFC: nfcmrvl: Include unaligned.h instead of access_ok.h
vxlan: vxlan dev should inherit lowerdev's gso_max_size
drm/vmwgfx: Fixes to vmwgfx_fb
braille-console: Fix value returned by _braille_console_setup
bonding: refine bond_fold_stats() wrap detection
f2fs: relax node version check for victim data in gc
blk-throttle: make sure expire time isn't too big
mm: Fix false-positive VM_BUG_ON() in page_cache_{get,add}_speculative()
driver: (adm1275) set the m,b and R coefficients correctly for power
dmaengine: imx-sdma: add 1ms delay to ensure SDMA channel is stopped
tcp: sysctl: Fix a race to avoid unexpected 0 window from space
spi: omap2-mcspi: poll OMAP2_MCSPI_CHSTAT_RXS for PIO transfer
ASoC: rcar: ssi: don't set SSICR.CKDV = 000 with SSIWSR.CONT
sched: act_csum: don't mangle TCP and UDP GSO packets
Input: qt1070 - add OF device ID table
sysrq: Reset the watchdog timers while displaying high-resolution timers
timers, sched_clock: Update timeout for clock wrap
media: i2c/soc_camera: fix ov6650 sensor getting wrong clock
scsi: ipr: Fix missed EH wakeup
solo6x10: release vb2 buffers in solo_stop_streaming()
of: fix of_device_get_modalias returned length when truncating buffers
batman-adv: handle race condition for claims between gateways
ARM: dts: Adjust moxart IRQ controller and flags
net/8021q: create device with all possible features in wanted_features
HID: clamp input to logical range if no null state
perf probe: Return errno when not hitting any event
ath10k: disallow DFS simulation if DFS channel is not enabled
drm: Defer disabling the vblank IRQ until the next interrupt (for instant-off)
drivers: net: xgene: Fix hardware checksum setting
perf tools: Make perf_event__synthesize_mmap_events() scale
i40e: fix ethtool to get EEPROM data from X722 interface
i40e: Acquire NVM lock before reads on all devices
perf sort: Fix segfault with basic block 'cycles' sort dimension
selinux: check for address length in selinux_socket_bind()
PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()
ath10k: fix a warning during channel switch with multiple vaps
drm: qxl: Don't alloc fbdev if emulation is not supported
HID: reject input outside logical range only if null state is set
staging: wilc1000: add check for kmalloc allocation failure.
staging: speakup: Replace BUG_ON() with WARN_ON().
Input: tsc2007 - check for presence and power down tsc2007 during probe
blkcg: fix double free of new_blkg in blkcg_init_queue
ANDROID: cpufreq: times: avoid prematurely freeing uid_entry
ANDROID: Use standard logging functions in goldfish_pipe
ANDROID: Fix whitespace in goldfish
staging: android: ashmem: Fix possible deadlock in ashmem_ioctl
llist: clang: introduce member_address_is_nonnull()
Linux 4.4.122
fixup: sctp: verify size of a new chunk in _sctp_make_chunk()
serial: 8250_pci: Add Brainboxes UC-260 4 port serial device
usb: gadget: f_fs: Fix use-after-free in ffs_fs_kill_sb()
usb: usbmon: Read text within supplied buffer size
USB: usbmon: remove assignment from IS_ERR argument
usb: quirks: add control message delay for 1b1c:1b20
USB: storage: Add JMicron bridge 152d:2567 to unusual_devs.h
staging: android: ashmem: Fix lockdep issue during llseek
staging: comedi: fix comedi_nsamples_left.
uas: fix comparison for error code
tty/serial: atmel: add new version check for usart
serial: sh-sci: prevent lockup on full TTY buffers
x86: Treat R_X86_64_PLT32 as R_X86_64_PC32
x86/module: Detect and skip invalid relocations
Revert "ARM: dts: LogicPD Torpedo: Fix I2C1 pinmux"
NFS: Fix an incorrect type in struct nfs_direct_req
scsi: qla2xxx: Replace fcport alloc with qla2x00_alloc_fcport
ubi: Fix race condition between ubi volume creation and udev
ext4: inplace xattr block update fails to deduplicate blocks
netfilter: x_tables: pack percpu counter allocations
netfilter: x_tables: pass xt_counters struct to counter allocator
netfilter: x_tables: pass xt_counters struct instead of packet counter
netfilter: use skb_to_full_sk in ip_route_me_harder
netfilter: ipv6: fix use-after-free Write in nf_nat_ipv6_manip_pkt
netfilter: bridge: ebt_among: add missing match size checks
netfilter: ebtables: CONFIG_COMPAT: don't trust userland offsets
netfilter: IDLETIMER: be syzkaller friendly
netfilter: nat: cope with negative port range
netfilter: x_tables: fix missing timer initialization in xt_LED
netfilter: add back stackpointer size checks
tc358743: fix register i2c_rd/wr function fix
Input: tca8418_keypad - remove double read of key event register
ARM: omap2: hide omap3_save_secure_ram on non-OMAP3 builds
netfilter: nfnetlink_queue: fix timestamp attribute
watchdog: hpwdt: fix unused variable warning
watchdog: hpwdt: Check source of NMI
watchdog: hpwdt: SMBIOS check
nospec: Include <asm/barrier.h> dependency
ALSA: hda: add dock and led support for HP ProBook 640 G2
ALSA: hda: add dock and led support for HP EliteBook 820 G3
ALSA: seq: More protection for concurrent write and ioctl races
ALSA: seq: Don't allow resizing pool in use
ALSA: hda/realtek - Fix dock line-out volume on Dell Precision 7520
x86/MCE: Serialize sysfs changes
bcache: don't attach backing with duplicate UUID
kbuild: Handle builtin dtb file names containing hyphens
loop: Fix lost writes caused by missing flag
Input: matrix_keypad - fix race when disabling interrupts
MIPS: OCTEON: irq: Check for null return on kzalloc allocation
MIPS: ath25: Check for kzalloc allocation failure
MIPS: BMIPS: Do not mask IPIs during suspend
drm/amdgpu: fix KV harvesting
drm/radeon: fix KV harvesting
drm/amdgpu: Notify sbios device ready before send request
drm/amdgpu: Fix deadlock on runtime suspend
drm/radeon: Fix deadlock on runtime suspend
drm/nouveau: Fix deadlock on runtime suspend
drm: Allow determining if current task is output poll worker
workqueue: Allow retrieval of current task's work struct
scsi: qla2xxx: Fix NULL pointer crash due to active timer for ABTS
RDMA/mlx5: Fix integer overflow while resizing CQ
RDMA/ucma: Check that user doesn't overflow QP state
RDMA/ucma: Limit possible option size
ANDROID: ranchu: 32 bit framebuffer support
ANDROID: Address checkpatch warnings in goldfishfb
ANDROID: Address checkpatch.pl warnings in goldfish_pipe
ANDROID: sdcardfs: fix lock issue on 32 bit/SMP architectures
ANDROID: goldfish: Fix typo in goldfish_cmd_locked() call
ANDROID: Address checkpatch.pl warnings in goldfish_pipe_v2
FROMLIST: f2fs: don't put dentry page in pagecache into highmem
Linux 4.4.121
btrfs: preserve i_mode if __btrfs_set_acl() fails
bpf, x64: implement retpoline for tail call
dm io: fix duplicate bio completion due to missing ref count
mpls, nospec: Sanitize array index in mpls_label_ok()
net: mpls: Pull common label check into helper
sctp: verify size of a new chunk in _sctp_make_chunk()
s390/qeth: fix IPA command submission race
s390/qeth: fix SETIP command handling
sctp: fix dst refcnt leak in sctp_v6_get_dst()
sctp: fix dst refcnt leak in sctp_v4_get_dst
udplite: fix partial checksum initialization
ppp: prevent unregistered channels from connecting to PPP units
netlink: ensure to loop over all netns in genlmsg_multicast_allns()
net: ipv4: don't allow setting net.ipv4.route.min_pmtu below 68
net: fix race on decreasing number of TX queues
ipv6 sit: work around bogus gcc-8 -Wrestrict warning
hdlc_ppp: carrier detect ok, don't turn off negotiation
fib_semantics: Don't match route with mismatching tclassid
bridge: check brport attr show in brport_show
Revert "led: core: Fix brightness setting when setting delay_off=0"
x86/spectre: Fix an error message
leds: do not overflow sysfs buffer in led_trigger_show
x86/apic/vector: Handle legacy irq data correctly
ARM: dts: LogicPD Torpedo: Fix I2C1 pinmux
btrfs: Don't clear SGID when inheriting ACLs
x86/syscall: Sanitize syscall table de-references under speculation fix
KVM: mmu: Fix overlap between public and private memslots
ARM: mvebu: Fix broken PL310_ERRATA_753970 selects
nospec: Allow index argument to have const-qualified type
media: m88ds3103: don't call a non-initalized function
cpufreq: s3c24xx: Fix broken s3c_cpufreq_init()
ALSA: hda: Add a power_save blacklist
ALSA: usb-audio: Add a quirck for B&W PX headphones
tpm_i2c_nuvoton: fix potential buffer overruns caused by bit glitches on the bus
tpm_i2c_infineon: fix potential buffer overruns caused by bit glitches on the bus
tpm: st33zp24: fix potential buffer overruns caused by bit glitches on the bus
ANDROID: Delete the goldfish_nand driver.
ANDROID: Add input support for Android Wear.
ANDROID: proc: fix config & includes for /proc/uid
FROMLIST: ARM: amba: Don't read past the end of sysfs "driver_override" buffer
UPSTREAM: ANDROID: binder: remove WARN() for redundant txn error
ANDROID: cpufreq: times: Add missing includes
ANDROID: cpufreq: Add time_in_state to /proc/uid directories
ANDROID: proc: Add /proc/uid directory
ANDROID: cpufreq: times: track per-uid time in state
ANDROID: cpufreq: track per-task time in state
Conflicts:
drivers/gpu/drm/msm/msm_gem.c
drivers/net/wireless/ath/regd.c
kernel/sched/core.c
Change-Id: I9bb7b5a062415da6925a5a56a34e6eb066a53320
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
commit 9b54d816e00425c3a517514e0d677bb3cec49258 upstream.
If blkg_create fails, new_blkg passed as an argument will
be freed by blkg_create, so there is no need to free it again.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Change-Id: I77fbb181de7e39c83fbfba8cfb128d6ace161f31
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
Git-commit: 97419acd22a0bacc52dbc34d5bbc96d315e48acb
[riteshh@codeaurora.org: resolved merge conflicts]
Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
commit 39a169b62b415390398291080dafe63aec751e0a upstream.
get_disk(),get_gendisk() calls have non explicit side effect: they
increase the reference on the disk owner module.
The following is the correct sequence how to get a disk reference and
to put it:
disk = get_gendisk(...);
/* use disk */
owner = disk->fops->owner;
put_disk(disk);
module_put(owner);
fs/block_dev.c is aware of this required module_put() call, but f.e.
blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
a module reference. To see a leakage in action cgroups throttle config
can be used. In the following script I'm removing throttle for /dev/ram0
(actually this is NOP, because throttle was never set for this device):
# lsmod | grep brd
brd 5175 0
# i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
/sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
done
# lsmod | grep brd
brd 5175 100
Now brd module has 100 references.
The issue is fixed by calling module_put() just right away put_disk().
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gi-Oh Kim <gi-oh.kim@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
get_disk(),get_gendisk() calls have non explicit side effect: they
increase the reference on the disk owner module.
The following is the correct sequence how to get a disk reference and
to put it:
disk = get_gendisk(...);
/* use disk */
owner = disk->fops->owner;
put_disk(disk);
module_put(owner);
fs/block_dev.c is aware of this required module_put() call, but f.e.
blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
a module reference. To see a leakage in action cgroups throttle config
can be used. In the following script I'm removing throttle for /dev/ram0
(actually this is NOP, because throttle was never set for this device):
# lsmod | grep brd
brd 5175 0
# i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
/sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
done
# lsmod | grep brd
brd 5175 100
Now brd module has 100 references.
The issue is fixed by calling module_put() just right away put_disk().
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gi-Oh Kim <gi-oh.kim@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 39a169b62b415390398291080dafe63aec751e0a)
Signed-off-by: Alex Shi <alex.shi@linaro.org>
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Pull cgroup updates from Tejun Heo:
"The cgroup core saw several significant updates this cycle:
- percpu_rwsem for threadgroup locking is reinstated. This was
temporarily dropped due to down_write latency issues. Oleg's
rework of percpu_rwsem which is scheduled to be merged in this
merge window resolves the issue.
- On the v2 hierarchy, when controllers are enabled and disabled, all
operations are atomic and can fail and revert cleanly. This allows
->can_attach() failure which is necessary for cpu RT slices.
- Tasks now stay associated with the original cgroups after exit
until released. This allows tracking resources held by zombies
(e.g. pids) and makes it easy to find out where zombies came from
on the v2 hierarchy. The pids controller was broken before these
changes as zombies escaped the limits; unfortunately, updating this
behavior required too many invasive changes and I don't think it's
a good idea to backport them, so the pids controller on 4.3, the
first version which included the pids controller, will stay broken
at least until I'm sure about the cgroup core changes.
- Optimization of a couple common tests using static_key"
* 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
cgroup: fix race condition around termination check in css_task_iter_next()
blkcg: don't create "io.stat" on the root cgroup
cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
cgroup: replace error handling in cgroup_init() with WARN_ON()s
cgroup: add cgroup_subsys->free() method and use it to fix pids controller
cgroup: keep zombies associated with their original cgroups
cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
cgroup: don't hold css_set_rwsem across css task iteration
cgroup: reorganize css_task_iter functions
cgroup: factor out css_set_move_task()
cgroup: keep css_set and task lists in chronological order
cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
cgroup: make css_sets pin the associated cgroups
cgroup: relocate cgroup_[try]get/put()
cgroup: move check_for_release() invocation
cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
cgroup: make cgroup->nr_populated count the number of populated css_sets
cgroup: remove an unused parameter from cgroup_task_migrate()
cgroup: fix too early usage of static_branch_disable()
cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
...
The stat files on the root cgroup shows stats for the whole system and
usually don't contain any information which isn't available through
the usual system monitoring mechanisms. Some controllers skip
collecting these duplicate stats to optimize cases where cgroup isn't
used and later try to emulate the result on demand.
This leads to complexities and subtle differences in the information
shown through different channels. This is entirely unnecessary and
cgroup v2 is dropping stat files which are duplicate from all
controllers. This patch removes "io.stat" from the root hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
Pull block updates from Jens Axboe:
"This is a bit bigger than it should be, but I could (did) not want to
send it off last week due to both wanting extra testing, and expecting
a fix for the bounce regression as well. In any case, this contains:
- Fix for the blk-merge.c compilation warning on gcc 5.x from me.
- A set of back/front SG gap merge fixes, from me and from Sagi.
This ensures that we honor SG gapping for integrity payloads as
well.
- Two small fixes for null_blk from Matias, fixing a leak and a
capacity propagation issue.
- A blkcg fix from Tejun, fixing a NULL dereference.
- A fast clone optimization from Ming, fixing a performance
regression since the arbitrarily sized bio's were introduced.
- Also from Ming, a regression fix for bouncing IOs"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix bounce_end_io
block: blk-merge: fast-clone bio when splitting rw bios
block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg
block: Copy a user iovec if it includes gaps
block: Refuse adding appending a gapped integrity page to a bio
block: Refuse request/bio merges with gaps in the integrity payload
block: Check for gaps on front and back merges
null_blk: fix wrong capacity when bs is not 512 bytes
null_blk: fix memory leak on cleanup
block: fix bogus compiler warnings in blk-merge.c
While making the root blkg unconditional, ec13b1d6f0 ("blkcg: always
create the blkcg_gq for the root blkcg") removed the part which clears
q->root_blkg and ->root_rl.blkg during q exit. This leaves the two
pointers dangling after blkg_destroy_all(). blk-throttle exit path
performs blkg traversals and dereferences ->root_blkg and can lead to
the following oops.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000558
IP: [<ffffffff81389746>] __blkg_lookup+0x26/0x70
...
task: ffff88001b4e2580 ti: ffff88001ac0c000 task.ti: ffff88001ac0c000
RIP: 0010:[<ffffffff81389746>] [<ffffffff81389746>] __blkg_lookup+0x26/0x70
...
Call Trace:
[<ffffffff8138d14a>] blk_throtl_drain+0x5a/0x110
[<ffffffff8138a108>] blkcg_drain_queue+0x18/0x20
[<ffffffff81369a70>] __blk_drain_queue+0xc0/0x170
[<ffffffff8136a101>] blk_queue_bypass_start+0x61/0x80
[<ffffffff81388c59>] blkcg_deactivate_policy+0x39/0x100
[<ffffffff8138d328>] blk_throtl_exit+0x38/0x50
[<ffffffff8138a14e>] blkcg_exit_queue+0x3e/0x50
[<ffffffff8137016e>] blk_release_queue+0x1e/0xc0
...
While the bug is a straigh-forward use-after-free bug, it is tricky to
reproduce because blkg release is RCU protected and the rest of exit
path usually finishes before RCU grace period.
This patch fixes the bug by updating blkg_destro_all() to clear
q->root_blkg and ->root_rl.blkg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Richard W.M. Jones" <rjones@redhat.com>
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Link: http://lkml.kernel.org/g/CA+5PVA5rzQ0s4723n5rHBcxQa9t0cW8BPPBekr_9aMRoWt2aYg@mail.gmail.com
Fixes: ec13b1d6f0 ("blkcg: always create the blkcg_gq for the root blkcg")
Cc: stable@vger.kernel.org # v4.2+
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
cgroup is trying to make interface consistent across different
controllers. For weight based resource control, the knob should have
the range [1, 10000] and default to 100. This patch updates
cfq-iosched so that the weight range conforms. The internal
calculations have enough range and the widening of the weight range
shouldn't cause any problem.
* blkcg_policy->cpd_bind_fn() is added. If present, this is invoked
when blkcg is attached to a hierarchy.
* cfq_cpd_init() is updated to use the new default value on the
unified hierarchy.
* cfq_cpd_bind() callback is implemented to clear per-blkg configs and
apply the default config matching the hierarchy type.
* cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is
now responsible for initializing the initial weights when blkcg is
enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg interface grew to be the biggest of all controllers and
unfortunately most inconsistent too. The interface files are
inconsistent with a number of cloes duplicates. Some files have
recursive variants while others don't. There's distinction between
normal and leaf weights which isn't intuitive and there are a lot of
stat knobs which don't make much sense outside of debugging and expose
too much implementation details to userland.
In the unified hierarchy, everything is always hierarchical and
internal nodes can't have tasks rendering the two structural issues
twisting the current interface. The interface has to be updated in a
significant anyway and this is a good chance to revamp it as a whole.
This patch implements blkcg interface for the unified hierarchy.
* (from a previous patch) blkcg is identified by "io" instead of
"blkio" on the unified hierarchy. Given that the whole interface is
updated anyway, the rename shouldn't carry noticeable conversion
overhead.
* The original interface consisted of 27 files is replaced with the
following three files.
blkio.stat : per-blkcg stats
blkio.weight : per-cgroup and per-cgroup-queue weight settings
blkio.max : per-cgroup-queue bps and iops max limits
Documentation/cgroups/unified-hierarchy.txt updated accordingly.
v2: blkcg_policy->dfl_cftypes wasn't removed on
blkcg_policy_unregister() corrupting the cftypes list. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, blkg_conf_prep() expects input to be of the following form
MAJ:MIN NUM
and reads the NUM part into blkg_conf_ctx->v. This is quite
restrictive and gets in the way in implementing blkcg interface for
the unified hierarchy. This patch updates blkg_conf_prep() so that it
expects
MAJ:MIN BODY_STR
where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
with ->body which is a char pointer pointing to the start of BODY_STR.
Parsing of the body is moved to blkg_conf_prep()'s callers.
To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
non-const pointer and to accommodate that const is dropped from @input
too.
This doesn't cause any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkio interface has become messy over time and is currently the
largest. In addition to the inconsistent naming scheme, it has
multiple stat files which report more or less the same thing, a number
of debug stat files which expose internal details which shouldn't have
been part of the public interface in the first place, recursive and
non-recursive stats and leaf and non-leaf knobs.
Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
don't make any sense on the unified hierarchy as only leaf cgroups can
contain processes. cgroups is going through a major interface
revision with the unified hierarchy involving significant fundamental
usage changes and given that a significant portion of the interface
doesn't make sense anymore, it's a good time to reorganize the
interface.
As the first step, this patch renames the external visible subsystem
name from "blkio" to "io". This is more concise, matches the other
two major subsystem names, "cpu" and "memory", and better suited as
blkcg will be involved in anything writeback related too whether an
actual block device is involved or not.
As the subsystem legacy_name is set to "blkio", the only userland
visible change outside the unified hierarchy is that blkcg is reported
as "io" instead of "blkio" in the subsystem initialized message during
boot. On the unified hierarchy, blkcg now appears as "io".
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg currently returns -EINVAL for most errors which can be pretty
confusing given that the failure modes are quite varied. Update the
error returns so that
* -EINVAL only for syntactic errors.
* -ERANGE if the value is out of range.
* -ENODEV if the target device can't be found.
* -EOPNOTSUPP if the policy is not enabled on the target device.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
The recent percpu conversion of blkg_rwstat triggered the following
warning in certain configurations.
block/blk-cgroup.c:654:1: warning: the frame size of 1360 bytes is larger than 1024 bytes
This is because blkg_rwstat now contains four percpu_counter which can
be pretty big depending on debug options although it shouldn't be a
problem in production configs. This patch removes one of the two
local blkg_rwstat variables used by blkg_rwstat_recursive_sum() to
reduce stack usage.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Link: http://article.gmane.org/gmane.linux.kernel.cgroups/13835
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, both cfq-iosched and blk-throttle keep track of
io_service_bytes and io_serviced stats. While keeping track of them
separately may be useful during development, it doesn't make much
sense otherwise. Also, blk-throttle was counting bio's as IOs while
cfq-iosched request's, which is more confusing than informative.
This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
removes the counterparts from cfq-iosched and blk-throttle and let
them print from the common blkg counters. The common counters are
incremented during bio issue in blkcg_bio_issue_check().
The outputs are still filtered by whether the policy has
blkg_policy_data on a given blkg, so cfq's output won't show up if it
has never been used for a given blkg. The only times when the outputs
would differ significantly are when policies are attached on the fly
or elevators are switched back and forth. Those are quite exceptional
operations and I don't think they warrant keeping separate counters.
v3: Update blkio-controller.txt accordingly.
v2: Account IOs during bio issues instead of request completions so
that bio-based drivers can be handled the same way.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, blkg_[rw]stat_recursive_sum() assume that the target
counter is located in pd (blkg_policy_data); however, some counters
are planned to be moved to blkg (blkcg_gq).
This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
blkg_policy pointers instead of pd. If policy is NULL, it indexes
into blkg. If non-NULL, into the blkg's pd of the policy.
The existing usages are updated to maintain the current behaviors.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg_[rw]stat are used as stat counters for blkcg policies. It isn't
per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
it. This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
per-cpu wrapping in blk-throttle.
* blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
percpu_counter. This makes syncp unnecessary as remote accesses are
handled by percpu_counter itself.
* blkg_[rw]stat_init() can now fail due to percpu allocation failure
and thus are updated to return int.
* percpu_counters need explicit freeing. blkg_[rw]stat_exit() added.
* As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
and summing results are stored in ->aux_cnt[] instead.
* Custom per-cpu stat implementation in blk-throttle is removed.
This makes all blkcg stat counters per-cpu without complicating policy
implmentations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
cgroup stats are local to each cgroup and doesn't propagate to
ancestors by default. When recursive stats are necessary, the sum is
calculated over all the descendants. This initially was for backward
compatibility to support both group-local and recursive stats but this
mode of operation makes general sense as stat update is much hotter
thafn reporting those stats.
This however ends up losing recursive stats when a child is removed.
To work around this, cfq-iosched adds its stats to its parent
cfq_group->dead_stats which is summed up together when calculating
recursive stats.
It's planned that the core stats will be moved to blkcg_gq, so we want
to move the mechanism for keeping track of the stats of dead children
from cfq to blkcg core. This patch adds blkg_[rw]stat->aux_cnt which
are atomic64_t's keeping track of auxiliary counts which are excluded
when reading local counts but included for recursive.
blkg_[rw]stat_merge() which were used by cfq to implement dead_stats
are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of
a dead cgroup to the aux counts of parent->stats instead of separate
->dead_stats.
This will also help making blkg_[rw]stats per-cpu.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkg (blkcg_gq) currently is created by blkcg policies invoking
blkg_lookup_create() which ends up repeating about the same code in
different policies. Theoretically, this can avoid the overhead of
looking and/or creating blkg's if blkcg is enabled but no policy is in
use; however, the cost of blkg lookup / creation is very low
especially if only the root blkcg is in use which is highly likely if
no blkcg policy is in active use - it boils down to a single very
predictable conditional and surrounding RCU protection.
This patch consolidates blkg creation to a new function
blkcg_bio_issue_check() which is called during bio issue from
generic_make_request_checks(). blkcg_bio_issue_check() is now the
only function which tries to create missing blkg's. The subsequent
policy and request_list operations just perform blkg_lookup() and if
missing falls back to the root.
* blk_get_rl() no longer tries to create blkg. It uses blkg_lookup()
instead of blkg_lookup_create().
* blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
read locked and blkg already looked up. Both throtl_lookup_tg() and
throtl_lookup_create_tg() are dropped.
* cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with
cfq_lookup_cfqg()which uses blkg_lookup().
This consolidates blkg handling and avoids unnecessary blkg creation
retries under memory pressure. In addition, this provides a common
bio entry point into blkcg where things like common accounting can be
performed.
v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
!CONFIG_BLK_DEV_THROTTLING.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkg_lookup() checks whether the target queue is bypassing and, if
not, calls __blkg_lookup() which first checks the lookup hint and then
performs radix tree walk. The operations upto hint checking are
trivial and there are many users of this function. This patch inlines
blkg_lookup() and the fast path part of __blkg_lookup(). The radix
tree lookup and hint update are now in blkg_lookup_slowpath().
This will help consolidating blkg handling by easing moving root blkcg
short-circuit to inlined lookup fast path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Each active policy has a cpd (blkcg_policy_data) on each blkcg. The
cpd's were allocated by blkcg core and each policy could request to
allocate extra space at the end by setting blkcg_policy->cpd_size
larger than the size of cpd.
This is a bit unusual but blkg (blkcg_gq) policy data used to be
handled this way too so it made sense to be consistent; however, blkg
policy data switched to alloc/free callbacks.
This patch makes similar changes to cpd handling.
blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size. As
cpd allocation is now done from policy side, it can simply allocate a
larger area which embeds cpd at the beginning.
As ->cpd_alloc_fn() may be able to perform all necessary
initializations, this patch makes ->cpd_init_fn() optional.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
* Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used
for blkcg_policy_data.
* Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of
blkcg. This makes it consistent with blkg_policy_data methods and
to-be-added cpd alloc/free methods.
* blkcg_policy_data->blkcg and cpd_to_blkcg() added so that
cpd_init_fn() can determine the associated blkcg from
blkcg_policy_data.
v2: blkcg_policy_data->blkcg initializations were missing. Added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
(blkg_policy_data) while the older ones use blkg (blkcg_gq). As using
blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
can always be mapped to blkg and given that these are policy-specific
methods, it makes sense to converge on pd.
This patch makes all methods deal with pd instead of blkg. Most
conversions are trivial. In blk-cgroup.c, a couple method invocation
sites now test whether pd exists instead of policy state for
consistency. This shouldn't cause any behavioral differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
With the recent addition of alloc and free methods, things became
messier. This patch reorganizes them according to the followings.
* ->pd_alloc_fn()
Responsible for allocation and static initializations - the ones
which can be done independent of where the pd might be attached.
* ->pd_init_fn()
Initializations which require the knowledge of where the pd is
attached.
* ->pd_free_fn()
The counter part of pd_alloc_fn(). Static de-init and freeing.
This leaves ->pd_exit_fn() without any users. Removed.
While at it, collapse an one liner function throtl_pd_exit(), which
has only one user, into its user.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
A blkg (blkcg_gq) represents the relationship between a cgroup and
request_queue. Each active policy has a pd (blkg_policy_data) on each
blkg. The pd's were allocated by blkcg core and each policy could
request to allocate extra space at the end by setting
blkcg_policy->pd_size larger than the size of pd.
This is a bit unusual but was done this way mostly to simplify error
handling and all the existing use cases could be handled this way;
however, this is becoming too restrictive now that percpu memory can
be allocated without blocking.
This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
and pd_free_fn() - which are used to allocate and release pd for a
given policy. As pd allocation is now done from policy side, it can
simply allocate a larger area which embeds pd at the beginning. This
change makes ->pd_size pointless. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkg_create() allows NULL ->pd_init_fn() but blkcg_activate_policy()
doesn't. As both in-kernel policies implement ->pd_init_fn, it
currently doesn't break anything. Update blkcg_activate_policy() so
that its behavior is consistent with blkg_create().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When a policy gets activated, it needs to allocate and install its
policy data on all existing blkg's (blkcg_gq's). Because blkg
iteration is protected by a spinlock, it currently counts the total
number of blkg's in the system, allocates the matching number of
policy data on a list and installs them during a single iteration.
This can be simplified by using speculative GFP_NOWAIT allocations
while iterating and falling back to a preallocated policy data on
failure. If the preallocated one has already been consumed, it
releases the lock, preallocate with GFP_KERNEL and then restarts the
iteration. This can be a bit more expensive than before but policy
activation is a very cold path and shouldn't matter.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg_css_alloc() bypasses policy data allocation and blkcg_css_free()
bypasses policy data and blkcg freeing for blkcg_root. There's no
reason to to treat policy data any differently for blkcg_root. If the
root css gets allocated after policies are registered, policy
registration path will add policy data; otherwise, the alloc path
will. The free path isn't never invoked for root csses.
This patch removes the unnecessary special handling of blkcg_root from
css_alloc/free paths.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When blkcg_init_queue() fails midway after creating a new blkg, it
performs kfree() directly; however, this doesn't free the policy data
areas. Make it use blkg_free() instead. In turn, blkg_free() is
updated to handle root request_list special case.
While this fixes a possible memory leak, it's on an unlikely failure
path of an already cold path and the size leaked per occurrence is
miniscule too. I don't think it needs to be tagged for -stable.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg performs several allocations to track IOs per cgroup and enforce
resource control. Most of these allocations are performed lazily on
demand in the IO path and thus can't involve reclaim path. Currently,
these allocations use GFP_ATOMIC; however, blkcg can gracefully deal
with occassional failures of these allocations by punting IOs to the
root cgroup and there's no reason to reach into the emergency reserve.
This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following
allocations.
* bdi_writeback_congested and blkcg_gq allocations in blkg_create().
* radix tree node allocations for blkcg->blkg_tree.
* cfq_queue allocation on ioprio changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When a blkcg configuration is targeted to a partition rather than a
whole device, blkg_conf_prep fails with -EINVAL; unfortunately, it
forgets to put the gendisk ref in that case. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
e48453c386 ("block, cgroup: implement policy-specific per-blkcg
data") updated per-blkcg policy data to be dynamically allocated.
When a policy is registered, its policy data aren't created. Instead,
when the policy is activated on a queue, the policy data are allocated
if there are blkg's (blkcg_gq's) which are attached to a given blkcg.
This is buggy. Consider the following scenario.
1. A blkcg is created. No blkg's attached yet.
2. The policy is registered. No policy data is allocated.
3. The policy is activated on a queue. As the above blkcg doesn't
have any blkg's, it won't allocate the matching blkcg_policy_data.
4. An IO is issued from the blkcg and blkg is created and the blkcg
still doesn't have the matching policy data allocated.
With cfq-iosched, this leads to an oops.
It also doesn't free policy data on policy unregistration assuming
that freeing of all policy data on blkcg destruction should take care
of it; however, this also is incorrect.
1. A blkcg has policy data.
2. The policy gets unregistered but the policy data remains.
3. Another policy gets registered on the same slot.
4. Later, the new policy tries to allocate policy data on the previous
blkcg but the slot is already occupied and gets skipped. The
policy ends up operating on the policy data of the previous policy.
There's no reason to manage blkcg_policy_data lazily. The reason we
do lazy allocation of blkg's is that the number of all possible blkg's
is the product of cgroups and block devices which can reach a
surprising level. blkcg_policy_data is contrained by the number of
cgroups and shouldn't be a problem.
This patch makes blkcg_policy_data to be allocated for all existing
blkcg's on policy registration and freed on unregistration and removes
blkcg_policy_data handling from policy [de]activation paths. This
makes that blkcg_policy_data are created and removed with the policy
they belong to and fixes the above described problems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: e48453c386 ("block, cgroup: implement policy-specific per-blkcg data")
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add all_blkcgs list goes through blkcg->all_blkcgs_node and is
protected by blkcg_pol_mutex. This will be used to fix
blkcg_policy_data allocation bug.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
An entry in blkcg_policy[] is stable while there are non-bypassing
in-flight IOs on a request_queue which has the policy activated. This
is why most derefs of blkcg_policy[] don't need explicit locking;
however, blkcg_css_alloc() isn't invoked from IO path and thus doesn't
have this protection and may race policies being added and removed.
Fix it by adding explicit blkcg_pol_mutex protection around
blkcg_policy[] iteration in blkcg_css_alloc().
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: e48453c386 ("block, cgroup: implement policy-specific per-blkcg data")
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg_pol_mutex primarily protects the blkcg_policy array. It also
protects cgroup file type [un]registration during policy addition /
removal. This puts blkcg_pol_mutex outside cgroup internal
synchronization and in turn makes it impossible to grab from blkcg's
cgroup methods as that leads to cyclic dependency.
Another problematic dependency arising from this is through cgroup
interface file deactivation. Removing a cftype requires removing all
files of the type which in turn involves draining all on-going
invocations of the file methods. This means that an interface file
implementation can't grab blkcg_pol_mutex as draining can lead to AA
deadlock.
blkcg_reset_stats() is already in this situation. It currently
trylocks blkcg_pol_mutex and then unwinds and retries the whole
operation on failure, which is cumbersome at best. It has a lengthy
comment explaining how cgroup internal synchronization is involved and
expected to be updated but as explained above this doesn't need cgroup
internal locking to deadlock. It's a self-contained AA deadlock.
The described circular dependencies can be easily broken by moving
cftype [un]registration out of blkcg_pol_mutex and protect them with
an outer mutex. This patch introduces blkcg_pol_register_mutex which
wraps entire policy [un]registration including cftype operations and
shrinks blkcg_pol_mutex critical section. This also makes the trylock
dancing in blkcg_reset_stats() unnecessary. Removed.
This patch is necessary for the following blkcg_policy_data allocation
bug fixes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, per-blkcg data is freed each time a policy is deactivated,
that is also upon scheduler switch. However, when switching from a
scheduler implementing a policy which requires per-blkcg data to
another one, that same policy might be active on other devices, and
therefore those same per-blkcg data could be still in use.
This commit lets per-blkcg data be freed when the blkcg is freed
instead of on policy deactivation.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Reported-and-tested-by: Michael Kaminsky <kaminsky@cs.cmu.edu>
Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.
This code has been in development for a long time, and it has been
simmering in for-next for a good chunk of this cycle too. This is one
of those problems that has been talked about for at least half a
decade, finally there's a solution and code to go with it.
Also see last weeks writeup on LWN:
http://lwn.net/Articles/648292/"
* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
writeback, blkio: add documentation for cgroup writeback support
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
writeback: do foreign inode detection iff cgroup writeback is enabled
v9fs: fix error handling in v9fs_session_init()
bdi: fix wrong error return value in cgwb_create()
buffer: remove unusued 'ret' variable
writeback: disassociate inodes from dying bdi_writebacks
writeback: implement foreign cgroup inode bdi_writeback switching
writeback: add lockdep annotation to inode_to_wb()
writeback: use unlocked_inode_to_wb transaction in inode_congested()
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
writeback: implement [locked_]inode_to_wb_and_lock_list()
writeback: implement foreign cgroup inode detection
writeback: make writeback_control track the inode being written back
writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
writeback: implement memcg writeback domain based throttling
writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
writeback: implement memcg wb_domain
writeback: update wb_over_bg_thresh() to use wb_domain aware operations
...
The block IO (blkio) controller enables the block layer to provide service
guarantees in a hierarchical fashion. Specifically, service guarantees
are provided by registered request-accounting policies. As of now, a
proportional-share and a throttling policy are available. They are
implemented, respectively, by the CFQ I/O scheduler and the blk-throttle
subsystem. Unfortunately, as for adding new policies, the current
implementation of the block IO controller is only halfway ready to allow
new policies to be plugged in. This commit provides a solution to make
the block IO controller fully ready to handle new policies.
In what follows, we first describe briefly the current state, and then
list the changes made by this commit.
The throttling policy does not need any per-cgroup information to perform
its task. In contrast, the proportional share policy uses, for each cgroup,
both the weight assigned by the user to the cgroup, and a set of dynamically-
computed weights, one for each device.
The first, user-defined weight is stored in the blkcg data structure: the
block IO controller allocates a private blkcg data structure for each
cgroup in the blkio cgroups hierarchy (regardless of which policy is active).
In other words, the block IO controller internally mirrors the blkio cgroups
with private blkcg data structures.
On the other hand, for each cgroup and device, the corresponding dynamically-
computed weight is maintained in the following, different way. For each device,
the block IO controller keeps a private blkcg_gq structure for each cgroup in
blkio. In other words, block IO also keeps one private mirror copy of the blkio
cgroups hierarchy for each device, made of blkcg_gq structures.
Each blkcg_gq structure keeps per-policy information in a generic array of
dynamically-allocated 'dedicated' data structures, one for each registered
policy (so currently the array contains two elements). To be inserted into the
generic array, each dedicated data structure embeds a generic blkg_policy_data
structure. Consider now the array contained in the blkcg_gq structure
corresponding to a given pair of cgroup and device: one of the elements
of the array contains the dedicated data structure for the proportional-share
policy, and this dedicated data structure contains the dynamically-computed
weight for that pair of cgroup and device.
The generic strategy adopted for storing per-policy data in blkcg_gq structures
is already capable of handling new policies, whereas the one adopted with blkcg
structures is not, because per-policy data are hard-coded in the blkcg
structures themselves (currently only data related to the proportional-
share policy).
This commit addresses the above issues through the following changes:
. It generalizes blkcg structures so that per-policy data are stored in the same
way as in blkcg_gq structures.
Specifically, it lets also the blkcg structure store per-policy data in a
generic array of dynamically-allocated dedicated data structures. We will
refer to these data structures as blkcg dedicated data structures, to
distinguish them from the dedicated data structures inserted in the generic
arrays kept by blkcg_gq structures.
To allow blkcg dedicated data structures to be inserted in the generic array
inside a blkcg structure, this commit also introduces a new blkcg_policy_data
structure, which is the equivalent of blkg_policy_data for blkcg dedicated
data structures.
. It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a
cpd_size field and a cpd_init field, to be initialized by the policy with,
respectively, the size of the blkcg dedicated data structures, and the
address of a constructor function for blkcg dedicated data structures.
. It moves the CFQ-specific fields embedded in the blkcg data structure (i.e.,
the fields related to the proportional-share policy), into a new blkcg
dedicated data structure called cfq_group_data.
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
A blkg (blkcg_gq) can be congested and decongested independently from
other blkgs on the same request_queue. Accordingly, for cgroup
writeback support, the congestion status at bdi (backing_dev_info)
should be split and updated separately from matching blkg's.
This patch prepares by adding blkg->wb_congested and associating a
blkg with its matching per-blkcg bdi_writeback_congested on creation.
v2: Updated to associate bdi_writeback_congested instead of
bdi_writeback.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback). This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).
On the default hierarchy, blkcg implicitly enables memcg. This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg. This means that there may be multiple
wb's of a bdi mapped to the same blkcg. As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state. This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.
bdi->wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
by its memcg id. Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().
Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi->wb.
v3: inode_attach_wb() in account_page_dirtied() moved inside
mapping_cap_account_dirty() block where it's known to be !NULL.
Also, an unnecessary NULL check before kfree() removed. Both
detected by the kbuild bot.
v2: Updated so that wb association is per inode and wb is per memcg
rather than blkcg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add global constant blkcg_root_css which points to &blkcg_root.css.
This will be used by cgroup writeback support. If blkcg is disabled,
it's defined as ERR_PTR(-EINVAL).
v2: The declarations moved to include/linux/blk-cgroup.h as suggested
by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, blkcg does a minor optimization where the root blkcg is
created when the first blkcg policy is activated on a queue and
destroyed on the deactivation of the last. On systems where blkcg is
configured but not used, this saves one blkcg_gq struct per queue. On
systems where blkcg is actually used, there's no difference. The only
case where this can lead to any meaninful, albeit still minute, save
in memory consumption is when all blkcg policies are deactivated after
being widely used in the system, which is a hihgly unlikely scenario.
The conditional existence of root blkcg_gq has already created several
bugs in blkcg and became an issue once again for the new per-cgroup
wb_congested mechanism for cgroup writeback support leading to a NULL
dereference when no blkcg policy is active. This is really not worth
bothering with. This patch makes blkcg always allocate and link the
root blkcg_gq and release it only on queue destruction.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
cgroup aware writeback support will require exposing some of blkcg
details. In preprataion, move block/blk-cgroup.h to
include/linux/blk-cgroup.h. This patch is pure file move.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg->id is a unique id given to each blkcg; however, the
cgroup_subsys_state which each blkcg embeds already has ->serial_nr
which can be used for the same purpose. Drop blkcg->id and replace
its uses with blkcg->css.serial_nr. Rename cfq_cgroup->blkcg_id to
->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for
consistency.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull cgroup changes from Tejun Heo:
"Mostly changes to get the v2 interface ready. The core features are
mostly ready now and I think it's reasonable to expect to drop the
devel mask in one or two devel cycles at least for a subset of
controllers.
- cgroup added a controller dependency mechanism so that block cgroup
can depend on memory cgroup. This will be used to finally support
IO provisioning on the writeback traffic, which is currently being
implemented.
- The v2 interface now uses a separate table so that the interface
files for the new interface are explicitly declared in one place.
Each controller will explicitly review and add the files for the
new interface.
- cpuset is getting ready for the hierarchical behavior which is in
the similar style with other controllers so that an ancestor's
configuration change doesn't change the descendants' configurations
irreversibly and processes aren't silently migrated when a CPU or
node goes down.
All the changes are to the new interface and no behavior changed for
the multiple hierarchies"
* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
cpuset: fix the WARN_ON() in update_nodemasks_hier()
cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
cgroup: distinguish the default and legacy hierarchies when handling cftypes
cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
cpuset: export effective masks to userspace
cpuset: allow writing offlined masks to cpuset.cpus/mems
cpuset: enable onlined cpu/node in effective masks
cpuset: refactor cpuset_hotplug_update_tasks()
cpuset: make cs->{cpus, mems}_allowed as user-configured masks
cpuset: apply cs->effective_{cpus,mems}
cpuset: initialize top_cpuset's configured masks at mount
cpuset: use effective cpumask to build sched domains
cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
cpuset: update cs->effective_{cpus, mems} when config changes
cpuset: update cpuset->effective_{cpus,mems} at hotplug
cpuset: add cs->effective_cpus and cs->effective_mems
cgroup: clean up sane_behavior handling
...