evie/android_kernel_oneplus_msm8998 - Gay Catgirls Forgejo: gay catgirls having sex

evie/android_kernel_oneplus_msm8998

Author	SHA1	Message	Date
Trilok Soni	5ab1e18aa3	Revert "Merge remote-tracking branch 'msm-4.4/tmp-510d0a3f' into msm-4.4" This reverts commit `9d6fd2c3e9` ("Merge remote-tracking branch 'msm-4.4/tmp-510d0a3f' into msm-4.4"), because it breaks the dump parsing tools due to kernel can be loaded anywhere in the memory now and not fixed at linear mapping. Change-Id: Id416f0a249d803442847d09ac47781147b0d0ee6 Signed-off-by: Trilok Soni <tsoni@codeaurora.org>	2016-08-26 14:34:05 -07:00
Trilok Soni	9d6fd2c3e9	Merge remote-tracking branch 'msm-4.4/tmp-510d0a3f' into msm-4.4 * msm-4.4/tmp-510d0a3f: Linux 4.4.11 nf_conntrack: avoid kernel pointer value leak in slab name drm/radeon: fix DP link training issue with second 4K monitor drm/i915/bdw: Add missing delay during L3 SQC credit programming drm/i915: Bail out of pipe config compute loop on LPT drm/radeon: fix PLL sharing on DCE6.1 (v2) Revert "[media] videobuf2-v4l2: Verify planes array in buffer dequeueing" Input: max8997-haptic - fix NULL pointer dereference get_rock_ridge_filename(): handle malformed NM entries tools lib traceevent: Do not reassign parg after collapse_tree() qla1280: Don't allocate 512kb of host tags atomic_open(): fix the handling of create_error regulator: axp20x: Fix axp22x ldo_io voltage ranges regulator: s2mps11: Fix invalid selector mask and voltages for buck9 workqueue: fix rebind bound workers warning ARM: dts: at91: sam9x5: Fix the memory range assigned to the PMC vfs: rename: check backing inode being equal vfs: add vfs_select_inode() helper perf/core: Disable the event on a truncated AUX record regmap: spmi: Fix regmap_spmi_ext_read in multi-byte case pinctrl: at91-pio4: fix pull-up/down logic spi: spi-ti-qspi: Handle truncated frames properly spi: spi-ti-qspi: Fix FLEN and WLEN settings if bits_per_word is overridden spi: pxa2xx: Do not detect number of enabled chip selects on Intel SPT ALSA: hda - Fix broken reconfig ALSA: hda - Fix white noise on Asus UX501VW headset ALSA: hda - Fix subwoofer pin on ASUS N751 and N551 ALSA: usb-audio: Yet another Phoneix Audio device quirk ALSA: usb-audio: Quirk for yet another Phoenix Audio devices (v2) crypto: testmgr - Use kmalloc memory for RSA input crypto: hash - Fix page length clamping in hash walk crypto: qat - fix invalid pf2vf_resp_wq logic s390/mm: fix asce_bits handling with dynamic pagetable levels zsmalloc: fix zs_can_compact() integer overflow ocfs2: fix posix_acl_create deadlock ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang net/route: enforce hoplimit max value tcp: refresh skb timestamp at retransmit time net: thunderx: avoid exposing kernel stack net: fix a kernel infoleak in x25 module uapi glibc compat: fix compile errors when glibc net/if.h included before linux/if.h MIME-Version: 1.0 bridge: fix igmp / mld query parsing net: bridge: fix old ioctl unlocked net device walk VSOCK: do not disconnect socket when peer has shutdown SEND only net/mlx4_en: Fix endianness bug in IPV6 csum calculation net: fix infoleak in rtnetlink net: fix infoleak in llc net: fec: only clear a queue's work bit if the queue was emptied netem: Segment GSO packets on enqueue sch_dsmark: update backlog as well sch_htb: update backlog as well net_sched: update hierarchical backlog too net_sched: introduce qdisc_replace() helper gre: do not pull header in ICMP error processing net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case samples/bpf: fix trace_output example bpf: fix check_map_func_compatibility logic bpf: fix refcnt overflow bpf: fix double-fdput in replace_map_fd_with_map_ptr() net/mlx4_en: fix spurious timestamping callbacks ipv4/fib: don't warn when primary address is missing if in_dev is dead net/mlx5e: Fix minimum MTU net/mlx5e: Device's mtu field is u16 and not int openvswitch: use flow protocol when recalculating ipv6 checksums atl2: Disable unimplemented scatter/gather feature vlan: pull on __vlan_insert_tag error path and fix csum correction net: use skb_postpush_rcsum instead of own implementations cdc_mbim: apply "NDP to end" quirk to all Huawei devices bpf/verifier: reject invalid LD_ABS \| BPF_DW instruction net: sched: do not requeue a NULL skb packet: fix heap info leak in PACKET_DIAG_MCLIST sock_diag interface route: do not cache fib route info on local routes with oif decnet: Do not build routes to devices without decnet private data. parisc: Use generic extable search and sort routines arm64: kasan: Use actual memory node when populating the kernel image shadow arm64: mm: treat memstart_addr as a signed quantity arm64: lse: deal with clobbered IP registers after branch via PLT arm64: mm: check at build time that PAGE_OFFSET divides the VA space evenly arm64: kasan: Fix zero shadow mapping overriding kernel image shadow arm64: consistently use p?d_set_huge arm64: fix KASLR boot-time I-cache maintenance arm64: hugetlb: partial revert of 66b3923a1a0f arm64: make irq_stack_ptr more robust arm64: efi: invoke EFI_RNG_PROTOCOL to supply KASLR randomness efi: stub: use high allocation for converted command line efi: stub: add implementation of efi_random_alloc() efi: stub: implement efi_get_random_bytes() based on EFI_RNG_PROTOCOL arm64: kaslr: randomize the linear region arm64: add support for kernel ASLR arm64: add support for building vmlinux as a relocatable PIE binary arm64: switch to relative exception tables extable: add support for relative extables to search and sort routines scripts/sortextable: add support for ET_DYN binaries arm64: futex.h: Add missing PAN toggling arm64: make asm/elf.h available to asm files arm64: avoid dynamic relocations in early boot code arm64: avoid R_AARCH64_ABS64 relocations for Image header fields arm64: add support for module PLTs arm64: move brk immediate argument definitions to separate header arm64: mm: use bit ops rather than arithmetic in pa/va translations arm64: mm: only perform memstart_addr sanity check if DEBUG_VM arm64: User die() instead of panic() in do_page_fault() arm64: allow kernel Image to be loaded anywhere in physical memory arm64: defer __va translation of initrd_start and initrd_end arm64: move kernel image to base of vmalloc area arm64: kvm: deal with kernel symbols outside of linear mapping arm64: decouple early fixmap init from linear mapping arm64: pgtable: implement static [pte\|pmd\|pud]_offset variants arm64: introduce KIMAGE_VADDR as the virtual base of the kernel region arm64: add support for ioremap() block mappings arm64: prevent potential circular header dependencies in asm/bug.h of/fdt: factor out assignment of initrd_start/initrd_end of/fdt: make memblock minimum physical address arch configurable arm64: Remove the get_thread_info() function arm64: kernel: Don't toggle PAN on systems with UAO arm64: cpufeature: Test 'matches' pointer to find the end of the list arm64: kernel: Add support for User Access Override arm64: add ARMv8.2 id_aa64mmfr2 boiler plate arm64: cpufeature: Change read_cpuid() to use sysreg's mrs_s macro arm64: use local label prefixes for __reg_num symbols arm64: vdso: Mark vDSO code as read-only arm64: ubsan: select ARCH_HAS_UBSAN_SANITIZE_ALL arm64: ptdump: Indicate whether memory should be faulting arm64: Add support for ARCH_SUPPORTS_DEBUG_PAGEALLOC arm64: Drop alloc function from create_mapping arm64: prefetch: add missing #include for spin_lock_prefetch arm64: lib: patch in prfm for copy_page if requested arm64: lib: improve copy_page to deal with 128 bytes at a time arm64: prefetch: add alternative pattern for CPUs without a prefetcher arm64: prefetch: don't provide spin_lock_prefetch with LSE arm64: allow vmalloc regions to be set with set_memory_* arm64: kernel: implement ACPI parking protocol arm64: mm: create new fine-grained mappings at boot arm64: ensure _stext and _etext are page-aligned arm64: mm: allow passing a pgdir to alloc_init_* arm64: mm: allocate pagetables anywhere arm64: mm: use fixmap when creating page tables arm64: mm: add functions to walk tables in fixmap arm64: mm: add __{pud,pgd}_populate arm64: mm: avoid redundant __pa(__va(x)) arm64: mm: add functions to walk page tables by PA arm64: mm: move pte_* macros arm64: kasan: avoid TLB conflicts arm64: mm: add code to safely replace TTBR1_EL1 arm64: add function to install the idmap arm64: unmap idmap earlier arm64: unify idmap removal arm64: mm: place empty_zero_page in bss arm64: mm: specialise pagetable allocators asm-generic: Fix local variable shadow in __set_fixmap_offset Eliminate the .eh_frame sections from the aarch64 vmlinux and kernel modules arm64: Fix an enum typo in mm/dump.c arm64: kasan: ensure that the KASAN zero page is mapped read-only arch/arm64/include/asm/pgtable.h: add pmd_mkclean for THP arm64: hide __efistub_ aliases from kallsyms Linux 4.4.10 drm/i915/skl: Fix DMC load on Skylake J0 and K0 lib/test-string_helpers.c: fix and improve string_get_size() tests ACPI / processor: Request native thermal interrupt handling via _OSC drm/i915: Fake HDMI live status drm/i915: Make RPS EI/thresholds multiple of 25 on SNB-BDW drm/i915: Fix eDP low vswing for Broadwell drm/i915/ddi: Fix eDP VDD handling during booting and suspend/resume drm/radeon: make sure vertical front porch is at least 1 iio: ak8975: fix maybe-uninitialized warning iio: ak8975: Fix NULL pointer exception on early interrupt drm/amdgpu: set metadata pointer to NULL after freeing. drm/amdgpu: make sure vertical front porch is at least 1 gpu: ipu-v3: Fix imx-ipuv3-crtc module autoloading nvmem: mxs-ocotp: fix buffer overflow in read USB: serial: cp210x: add Straizona Focusers device ids USB: serial: cp210x: add ID for Link ECU ata: ahci-platform: Add ports-implemented DT bindings. libahci: save port map for forced port map powerpc: Fix bad inline asm constraint in create_zero_mask() ACPICA: Dispatcher: Update thread ID for recursive method calls x86/sysfb_efi: Fix valid BAR address range check ARC: Add missing io barriers to io{read,write}{16,32}be() ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value propogate_mnt: Handle the first propogated copy being a slave fs/pnode.c: treat zero mnt_group_id-s as unequal x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO MAINTAINERS: Remove asterisk from EFI directory names writeback: Fix performance regression in wb_over_bg_thresh() batman-adv: Reduce refcnt of removed router when updating route batman-adv: Fix broadcast/ogm queue limit on a removed interface batman-adv: Check skb size before using encapsulated ETH+VLAN header batman-adv: fix DAT candidate selection (must use vid) mm: update min_free_kbytes from khugepaged after core initialization proc: prevent accessing /proc/<PID>/environ until it's ready Input: zforce_ts - fix dual touch recognition HID: Fix boot delay for Creative SB Omni Surround 5.1 with quirk HID: wacom: Add support for DTK-1651 xen/evtchn: fix ring resize when binding new events xen/balloon: Fix crash when ballooning on x86 32 bit PAE xen: Fix page <-> pfn conversion on 32 bit systems ARM: SoCFPGA: Fix secondary CPU startup in thumb2 kernel ARM: EXYNOS: Properly skip unitialized parent clock in power domain on mm/zswap: provide unique zpool name mm, cma: prevent nr_isolated_* counters from going negative Minimal fix-up of bad hashing behavior of hash_64() MD: make bio mergeable tracing: Don't display trigger file for events that can't be enabled mac80211: fix statistics leak if dev_alloc_name() fails ath9k: ar5008_hw_cmn_spur_mitigate: add missing mask_m & mask_p initialisation lpfc: fix misleading indentation clk: qcom: msm8960: Fix ce3_src register offset clk: versatile: sp810: support reentrance clk: qcom: msm8960: fix ce3_core clk enable register clk: meson: Fix meson_clk_register_clks() signature type mismatch clk: rockchip: free memory in error cases when registering clock branches soc: rockchip: power-domain: fix err handle while probing clk-divider: make sure read-only dividers do not write to their register CNS3xxx: Fix PCI cns3xxx_write_config() mwifiex: fix corner case association failure ata: ahci_xgene: dereferencing uninitialized pointer in probe nbd: ratelimit error msgs after socket close mfd: intel-lpss: Remove clock tree on error path ipvs: drop first packet to redirect conntrack ipvs: correct initial offset of Call-ID header search in SIP persistence engine ipvs: handle ip_vs_fill_iph_skb_off failure RDMA/iw_cxgb4: Fix bar2 virt addr calculation for T4 chips Revert: "powerpc/tm: Check for already reclaimed tasks" arm64: head.S: use memset to clear BSS efi: stub: define DISABLE_BRANCH_PROFILING for all architectures arm64: entry: remove pointless SPSR mode check arm64: mm: move pgd_cache initialisation to pgtable_cache_init arm64: module: avoid undefined shift behavior in reloc_data() arm64: module: fix relocation of movz instruction with negative immediate arm64: traps: address fallout from printk -> pr_* conversion arm64: ftrace: fix a stack tracer's output under function graph tracer arm64: pass a task parameter to unwind_frame() arm64: ftrace: modify a stack frame in a safe way arm64: remove irq_count and do_softirq_own_stack() arm64: hugetlb: add support for PTE contiguous bit arm64: Use PoU cache instr for I/D coherency arm64: Defer dcache flush in __cpu_copy_user_page arm64: reduce stack use in irq_handler arm64: Documentation: add list of software workarounds for errata arm64: mm: place __cpu_setup in .text arm64: cmpxchg: Don't incldue linux/mmdebug.h arm64: mm: fold alternatives into .init arm64: Remove redundant padding from linker script arm64: mm: remove pointless PAGE_MASKing arm64: don't call C code with el0's fp register arm64: when walking onto the task stack, check sp & fp are in current->stack arm64: Add this_cpu_ptr() assembler macro for use in entry.S arm64: irq: fix walking from irq stack to task stack arm64: Add do_softirq_own_stack() and enable irq_stacks arm64: Modify stack trace and dump for use with irq_stack arm64: Store struct thread_info in sp_el0 arm64: Add trace_hardirqs_off annotation in ret_to_user arm64: ftrace: fix the comments for ftrace_modify_code arm64: ftrace: stop using kstop_machine to enable/disable tracing arm64: spinlock: serialise spin_unlock_wait against concurrent lockers arm64: enable HAVE_IRQ_TIME_ACCOUNTING arm64: fix COMPAT_SHMLBA definition for large pages arm64: add __init/__initdata section marker to some functions/variables arm64: pgtable: implement pte_accessible() arm64: mm: allow sections for unaligned bases arm64: mm: detect bad __create_mapping uses Linux 4.4.9 extcon: max77843: Use correct size for reading the interrupt register stm class: Select CONFIG_SRCU megaraid_sas: add missing curly braces in ioctl handler sunrpc/cache: drop reference when sunrpc_cache_pipe_upcall() detects a race thermal: rockchip: fix a impossible condition caused by the warning unbreak allmodconfig KCONFIG_ALLCONFIG=... jme: Fix device PM wakeup API usage jme: Do not enable NIC WoL functions on S0 bus: imx-weim: Take the 'status' property value into account ARM: dts: pxa: fix dma engine node to pxa3xx-nand ARM: dts: armada-375: use armada-370-sata for SATA ARM: EXYNOS: select THERMAL_OF ARM: prima2: always enable reset controller ARM: OMAP3: Add cpuidle parameters table for omap3430 ext4: fix races of writeback with punch hole and zero range ext4: fix races between buffered IO and collapse / insert range ext4: move unlocked dio protection from ext4_alloc_file_blocks() ext4: fix races between page faults and hole punching perf stat: Document --detailed option perf tools: handle spaces in file names obtained from /proc/pid/maps perf hists browser: Only offer symbol scripting when a symbol is under the cursor mtd: nand: Drop mtd.owner requirement in nand_scan mtd: brcmnand: Fix v7.1 register offsets mtd: spi-nor: remove micron_quad_enable() serial: sh-sci: Remove cpufreq notifier to fix crash/deadlock ext4: fix NULL pointer dereference in ext4_mark_inode_dirty() x86/mm/kmmio: Fix mmiotrace for hugepages perf evlist: Reference count the cpu and thread maps at set_maps() drivers/misc/ad525x_dpot: AD5274 fix RDAC read back errors rtc: max77686: Properly handle regmap_irq_get_virq() error code rtc: rx8025: remove rv8803 id rtc: ds1685: passing bogus values to irq_restore rtc: vr41xx: Wire up alarm_irq_enable rtc: hym8563: fix invalid year calculation PM / Domains: Fix removal of a subdomain PM / OPP: Initialize u_volt_min/max to a valid value misc: mic/scif: fix wrap around tests misc/bmp085: Enable building as a module lib/mpi: Endianness fix fbdev: da8xx-fb: fix videomodes of lcd panels scsi_dh: force modular build if SCSI is a module paride: make 'verbose' parameter an 'int' again regulator: s5m8767: fix get_register() error handling irqchip/mxs: Fix error check of of_io_request_and_map() irqchip/sunxi-nmi: Fix error check of of_io_request_and_map() spi/rockchip: Make sure spi clk is on in rockchip_spi_set_cs locking/mcs: Fix mcs_spin_lock() ordering regulator: core: Fix nested locking of supplies regulator: core: Ensure we lock all regulators regulator: core: fix regulator_lock_supply regression Revert "regulator: core: Fix nested locking of supplies" videobuf2-v4l2: Verify planes array in buffer dequeueing videobuf2-core: Check user space planes array in dqbuf USB: usbip: fix potential out-of-bounds write cgroup: make sure a parent css isn't freed before its children mm/hwpoison: fix wrong num_poisoned_pages accounting mm: vmscan: reclaim highmem zone if buffer_heads is over limit numa: fix /proc/<pid>/numa_maps for THP mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check memcg: relocate charge moving from ->attach to ->post_attach cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback slub: clean up code for kmem cgroup support to kmem_cache_free_bulk workqueue: fix ghost PENDING flag while doing MQ IO x86/apic: Handle zero vector gracefully in clear_vector_irq() efi: Expose non-blocking set_variable() wrapper to efivars efi: Fix out-of-bounds read in variable_matches() IB/security: Restrict use of the write() interface IB/mlx5: Expose correct max_sge_rd limit cxl: Keep IRQ mappings on context teardown v4l2-dv-timings.h: fix polarity for 4k formats vb2-memops: Fix over allocation of frame vectors ASoC: rt5640: Correct the digital interface data select ASoC: dapm: Make sure we have a card when displaying component widgets ASoC: ssm4567: Reset device before regcache_sync() ASoC: s3c24xx: use const snd_soc_component_driver pointer EDAC: i7core, sb_edac: Don't return NOTIFY_BAD from mce_decoder callback toshiba_acpi: Fix regression caused by hotkey enabling value i2c: exynos5: Fix possible ABBA deadlock by keeping I2C clock prepared i2c: cpm: Fix build break due to incompatible pointer types perf intel-pt: Fix segfault tracing transactions drm/i915: Use fw_domains_put_with_fifo() on HSW drm/i915: Fixup the free space logic in ring_prepare drm/amdkfd: uninitialized variable in dbgdev_wave_control_set_registers() drm/i915: skl_update_scaler() wants a rotation bitmask instead of bit number drm/i915: Cleanup phys status page too pwm: brcmstb: Fix check of devm_ioremap_resource() return code drm/dp/mst: Get validated port ref in drm_dp_update_payload_part1() drm/dp/mst: Restore primary hub guid on resume drm/dp/mst: Validate port in drm_dp_payload_send_msg() drm/nouveau/gr/gf100: select a stream master to fixup tfb offset queries drm: Loongson-3 doesn't fully support wc memory drm/radeon: fix vertical bars appear on monitor (v2) drm/radeon: forbid mapping of userptr bo through radeon device file drm/radeon: fix initial connector audio value drm/radeon: add a quirk for a XFX R9 270X drm/amdgpu: fix regression on CIK (v2) amdgpu/uvd: add uvd fw version for amdgpu drm/amdgpu: bump the afmt limit for CZ, ST, Polaris drm/amdgpu: use defines for CRTCs and AMFT blocks drm/amdgpu: when suspending, if uvd/vce was running. need to cancel delay work. iommu/dma: Restore scatterlist offsets correctly iommu/amd: Fix checking of pci dma aliases pinctrl: single: Fix pcs_parse_bits_in_pinctrl_entry to use __ffs than ffs pinctrl: mediatek: correct debounce time unit in mtk_gpio_set_debounce xen kconfig: don't "select INPUT_XEN_KBDDEV_FRONTEND" Input: pmic8xxx-pwrkey - fix algorithm for converting trigger delay Input: gtco - fix crash on detecting device without endpoints netlink: don't send NETLINK_URELEASE for unbound sockets nl80211: check netlink protocol in socket release notification powerpc: Update TM user feature bits in scan_features() powerpc: Update cpu_user_features2 in scan_features() powerpc: scan_features() updates incorrect bits for REAL_LE crypto: talitos - fix AEAD tcrypt tests crypto: talitos - fix crash in talitos_cra_init() crypto: sha1-mb - use corrcet pointer while completing jobs crypto: ccp - Prevent information leakage on export iwlwifi: mvm: fix memory leak in paging iwlwifi: pcie: lower the debug level for RSA semaphore access s390/pci: add extra padding to function measurement block cpufreq: intel_pstate: Fix processing for turbo activation ratio Revert "drm/amdgpu: disable runtime pm on PX laptops without dGPU power control" Revert "drm/radeon: disable runtime pm on PX laptops without dGPU power control" drm/i915: Fix race condition in intel_dp_destroy_mst_connector() drm/qxl: fix cursor position with non-zero hotspot drm/nouveau/core: use vzalloc for allocating ramht futex: Acknowledge a new waiter in counter before plist futex: Handle unlock_pi race gracefully asm-generic/futex: Re-enable preemption in futex_atomic_cmpxchg_inatomic() ALSA: hda - Add dock support for ThinkPad X260 ALSA: pcxhr: Fix missing mutex unlock ALSA: hda - add PCI ID for Intel Broxton-T ALSA: hda - Keep powering up ADCs on Cirrus codecs ALSA: hda/realtek - Add ALC3234 headset mode for Optiplex 9020m ALSA: hda - Don't trust the reported actual power state x86 EDAC, sb_edac.c: Repair damage introduced when "fixing" channel address x86/mm/xen: Suppress hugetlbfs in PV guests arm64: Update PTE_RDONLY in set_pte_at() for PROT_NONE permission arm64: Honour !PTE_WRITE in set_pte_at() for kernel mappings sched/cgroup: Fix/cleanup cgroup teardown/init dmaengine: pxa_dma: fix the maximum requestor line dmaengine: hsu: correct use of channel status register dmaengine: dw: fix master selection debugfs: Make automount point inodes permanently empty lib: lz4: fixed zram with lz4 on big endian machines dm cache metadata: fix cmd_read_lock() acquiring write lock dm cache metadata: fix READ_LOCK macros and cleanup WRITE_LOCK macros usb: gadget: f_fs: Fix use-after-free usb: hcd: out of bounds access in for_each_companion xhci: fix 10 second timeout on removal of PCI hotpluggable xhci controllers usb: xhci: fix wild pointers in xhci_mem_cleanup xhci: resume USB 3 roothub first usb: xhci: applying XHCI_PME_STUCK_QUIRK to Intel BXT B0 host assoc_array: don't call compare_object() on a node ARM: OMAP2+: hwmod: Fix updating of sysconfig register ARM: OMAP2: Fix up interconnect barrier initialization for DRA7 ARM: mvebu: Correct unit address for linksys ARM: dts: AM43x-epos: Fix clk parent for synctimer KVM: arm/arm64: Handle forward time correction gracefully kvm: x86: do not leak guest xcr0 into host interrupt handlers x86/mce: Avoid using object after free in genpool block: loop: fix filesystem corruption in case of aio/dio block: partition: initialize percpuref before sending out KOBJ_ADD Conflicts: arch/arm64/Kconfig arch/arm64/include/asm/cputype.h arch/arm64/include/asm/hardirq.h arch/arm64/include/asm/irq.h arch/arm64/kernel/cpu_errata.c arch/arm64/kernel/cpuinfo.c arch/arm64/kernel/setup.c arch/arm64/kernel/smp.c arch/arm64/kernel/stacktrace.c arch/arm64/mm/init.c arch/arm64/mm/mmu.c arch/arm64/mm/pageattr.c mm/memcontrol.c CRs-Fixed: 1054234 Signed-off-by: Trilok Soni <tsoni@codeaurora.org> Change-Id: I2a7a34631ffee36ce18b9171f16d023be777392f	2016-08-18 14:50:45 -07:00
Trilok Soni	f145f41478	Merge remote-tracking branch 'msm-4.4/tmp-2bf7955' into msm-4.4 * msm-4.4/tmp-2bf7955: Linux 4.4.8 Revert "usb: hub: do not clear BOS field during reset device" usbvision: fix crash on detecting device with invalid configuration staging: android: ion: Set the length of the DMA sg entries in buffer Revert "PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()" Revert "PCI: Add helpers to manage pci_dev->irq and pci_dev->irq_managed" Revert "x86/PCI: Don't alloc pcibios-irq when MSI is enabled" HID: usbhid: fix inconsistent reset/resume/reset-resume behavior HID: wacom: fix Bamboo ONE oops ALSA: usb-audio: Skip volume controls triggers hangup on Dell USB Dock ALSA: usb-audio: Add a quirk for Plantronics BT300 ALSA: usb-audio: Add a sample rate quirk for Phoenix Audio TMX320 ALSA: hda/realtek - Enable the ALC292 dock fixup on the Thinkpad T460s ALSA: hda - fix front mic problem for a HP desktop ALSA: hda - Fix headset support and noise on HP EliteBook 755 G2 ALSA: hda - Fixup speaker pass-through control for nid 0x14 on ALC225 mmc: sdhci-pci: Add support and PCI IDs for more Broxton host controllers perf: Cure event->pending_disable race perf: Do not double free arm64: replace read_lock to rcu lock in call_step_hook Btrfs: fix file/data loss caused by fsync after rename and new inode iommu: Don't overwrite domain pointer when there is no default_domain ext4: ignore quota mount options if the quota feature is enabled ext4: add lockdep annotations for i_data_sem btrfs: fix crash/invalid memory access on fsync when using overlayfs nfs: use file_dentry() fs: add file_dentry() sd: Fix excessive capacity printing on devices with blocks bigger than 512 bytes iio: gyro: bmg160: fix endianness when reading axes iio: gyro: bmg160: fix buffer read values iio: accel: bmc150: fix endianness when reading axes iio: st_magn: always define ST_MAGN_TRIGGER_SET_STATE usb: renesas_usbhs: fix to avoid using a disabled ep in usbhsg_queue_done() usb: renesas_usbhs: disable TX IRQ before starting TX DMAC transfer usb: renesas_usbhs: avoid NULL pointer derefernce in usbhsf_pkt_handler() mac80211: fix txq queue related crashes mac80211: fix unnecessary frame drops in mesh fwding mac80211: fix ibss scan parameters mac80211: avoid excessive stack usage in sta_info mac80211: properly deal with station hashtable insert errors virtio: virtio 1.0 cs04 spec compliance for reset rbd: use GFP_NOIO consistently for request allocations pcmcia: db1xxx_ss: fix last irq_to_gpio user v4l: vsp1: Set the SRU CTRL0 register when starting the stream coda: fix error path in case of missing pdata on non-DT platform au0828: Fix dev_state handling au0828: fix au0828_v4l2_close() dev_state race condition pinctrl: freescale: imx: fix bogus check of of_iomap() return value pinctrl: nomadik: fix pull debug print inversion pinctrl: sunxi: Fix A33 external interrupts not working pinctrl: sh-pfc: only use dummy states for non-DT platforms pinctrl: pistachio: fix mfio84-89 function description and pinmux. MIPS: Fix MSA ld unaligned failure cases KVM: x86: reduce default value of halt_poll_ns parameter KVM: x86: Inject pending interrupt even if pending nmi exist cdc-acm: fix NULL pointer reference USB: uas: Add a new NO_REPORT_LUNS quirk USB: uas: Limit qdepth at the scsi-host level mpls: find_outdev: check for err ptr in addition to NULL check ipv6: Count in extension headers in skb->network_header ip6_tunnel: set rtnl_link_ops before calling register_netdevice ipv6: l2tp: fix a potential issue in l2tp_ip6_recv ipv4: l2tp: fix a potential issue in l2tp_ip_recv tuntap: restore default qdisc tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter rtnl: fix msg size calculation in if_nlmsg_size() bridge: Allow set bridge ageing time when switchdev disabled ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates qmi_wwan: add "D-Link DWM-221 B1" device id xfrm: Fix crash observed during device unregistration and decryption ppp: take reference on channels netns ipv4: initialize flowi4_flags before calling fib_lookup() ipv4: fix broadcast packets reception bonding: fix bond_get_stats() net: bcmgenet: fix dma api length mismatch qlge: Fix receive packets drop. tcp/dccp: remove obsolete WARN_ON() in icmp handlers ppp: ensure file->private_data can't be overridden ath9k: fix buffer overrun for ar9287 farsync: fix off-by-one bug in fst_add_one mlx4: add missing braces in verify_qp_parameters net: Fix use after free in the recvmmsg exit path ipv4: Don't do expensive useless work during inetdev destroy. bridge: allow zero ageing time rocker: set FDB cleanup timer according to lowest ageing time mlxsw: spectrum: Check requested ageing time is valid macvtap: always pass ethernet header in linear qlcnic: Fix mailbox completion handling during spurious interrupt qlcnic: Remove unnecessary usage of atomic_t sh_eth: advance 'rxdesc' later in sh_eth_ring_format() sh_eth: fix NULL pointer dereference in sh_eth_ring_format() bpf: avoid copying junk bytes in bpf_get_current_comm() packet: validate variable length ll headers ax25: add link layer header validation function net: validate variable length ll headers ppp: release rtnl mutex when interface creation fails tcp: fix tcpi_segs_in after connection establishment udp6: fix UDP/IPv6 encap resubmit path usbnet: cleanup after bind() in probe() cdc_ncm: toggle altsetting to force reset before setup vxlan: fix missing options_len update on RX with collect metadata ipv6: re-enable fragment header matching in ipv6_find_hdr qmi_wwan: add Sierra Wireless EM74xx device ID tipc: Revert "tipc: use existing sk_write_queue for outgoing packet chain" mld, igmp: Fix reserved tailroom calculation sctp: lack the check for ports in sctp_v6_cmp_addr net: fix bridge multicast packet checksum validation net: qca_spi: clear IFF_TX_SKB_SHARING net: qca_spi: Don't clear IFF_BROADCAST net: vrf: Remove direct access to skb->data net: jme: fix suspend/resume on JMC260 ipv4: only create late gso-skb if skb is already set up with CHECKSUM_PARTIAL tunnel: Clear IPCB(skb)->opt before dst_link_failure called tcp: convert cached rtt from usec to jiffies when feeding initial rto xen/events: Mask a moving irq drm/amdgpu/gmc: use proper register for vram type on Fiji drm/amdgpu/gmc: move vram type fetching into sw_init drm/radeon: add a dpm quirk for all R7 370 parts drm/radeon: add another R7 370 quirk drm/radeon: add a dpm quirk for sapphire Dual-X R7 370 2G D5 drm/udl: Use unlocked gem unreferencing drm/dp: move hw_mutex up the call stack arm64: opcodes.h: Add arm big-endian config options before including arm header compiler-gcc: disable -ftracer for __noclone functions libnvdimm, pfn: fix uuid validation libnvdimm: fix smart data retrieval powerpc/mm: Fixup preempt underflow with huge pages mm: fix invalid node in alloc_migrate_target() ALSA: hda - Apply fix for white noise on Asus N550JV, too ALSA: hda - Fix white noise on Asus N750JV headphone ALSA: hda - Asus N750JV external subwoofer fixup ALSA: timer: Use mod_timer() for rearming the system timer parisc: Unbreak handling exceptions from kernel modules parisc: Fix kernel crash with reversed copy_from_user() parisc: Avoid function pointers for kernel exception routines PKCS#7: pkcs7_validate_trust(): initialize the _trusted output argument hwmon: (max1111) Return -ENODEV from max1111_read_channel if not instantiated Linux 4.4.7 perf/x86/intel: Fix PEBS data source interpretation on Nehalem/Westmere perf/x86/intel: Use PAGE_SIZE for PEBS buffer size on Core2 perf/x86/intel: Fix PEBS warning by only restoring active PMU in pmi perf/x86/pebs: Add workaround for broken OVFL status on HSW+ sched/cputime: Fix steal time accounting vs. CPU hotplug scsi_common: do not clobber fixed sense information PM / sleep: Clear pm_suspend_global_flags upon hibernate intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled mtd: onenand: fix deadlock in onenand_block_markbad mm/page_alloc: prevent merging between isolated and other pageblocks ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list ocfs2/dlm: fix race between convert and recovery Input: ati_remote2 - fix crashes on detecting device with invalid descriptor Input: ims-pcu - sanity check against missing interfaces Input: synaptics - handle spurious release of trackstick buttons, again writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() ACPI / PM: Runtime resume devices when waking from hibernate ARM: dts: at91: sama5d4 Xplained: don't disable hsmci regulator ARM: dts: at91: sama5d3 Xplained: don't disable hsmci regulator nfsd: fix deadlock secinfo+readdir compound nfsd4: fix bad bounds checking iser-target: Rework connection termination iser-target: Separate flows for np listeners and connections cma events iser-target: Add new state ISER_CONN_BOUND to isert_conn iser-target: Fix identification of login rx descriptor type target: Fix target_release_cmd_kref shutdown comp leak clk: bcm2835: Fix setting of PLL divider clock rates clk: rockchip: add hclk_cpubus to the list of rk3188 critical clocks clk: rockchip: rk3368: fix hdmi_cec gate-register clk: rockchip: rk3368: fix parents of video encoder/decoder clk: rockchip: rk3368: fix cpuclk core dividers clk: rockchip: rk3368: fix cpuclk mux bit of big cpu-cluster mmc: sdhci: Fix override of timeout clk wrt max_busy_timeout mmc: sdhci: fix data timeout (part 2) mmc: sdhci: fix data timeout (part 1) mmc: mmc_spi: Add Card Detect comments and fix CD GPIO case mmc: block: fix ABI regression of mmc_blk_ioctl ideapad-laptop: Add ideapad Y700 (15) to the no_hw_rfkill DMI list MAINTAINERS: Update mailing list and web page for hwmon subsystem kbuild/mkspec: fix grub2 installkernel issue scripts/kconfig: allow building with make 3.80 again scripts/coccinelle: modernize & bitops: Do not default to __clear_bit() for __clear_bit_unlock() tracing: Fix trace_printk() to print when not using bprintk() tracing: Fix crash from reading trace_pipe with sendfile tracing: Have preempt(irqs)off trace preempt disabled functions IB/ipoib: fix for rare multicast join race condition drm/amdgpu: include the right version of gmc header files for iceland drm/amdgpu: disable runtime pm on PX laptops without dGPU power control drm/radeon: Don't drop DP 2.7 Ghz link setup on some cards. drm/radeon: disable runtime pm on PX laptops without dGPU power control iwlwifi: mvm: Fix paging memory leak ipr: Fix regression when loading firmware ipr: Fix out-of-bounds null overwrite rapidio/rionet: fix deadlock on SMP fs/coredump: prevent fsuid=0 dumps into user-controlled directories fuse: Add reference counting for fuse_io_priv fuse: do not use iocb after it may have been freed md: multipath: don't hardcopy bio in .make_request path md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list raid10: include bio_end_io_list in nr_queued to prevent freeze_array hang RAID5: revert `e9e4c377e2` to fix a livelock RAID5: check_reshape() shouldn't call mddev_suspend md/raid5: Compare apples to apples (or sectors to sectors) raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang xfs: fix two memory leaks in xfs_attr_list.c error paths quota: Fix possible GPF due to uninitialised pointers ARC: bitops: Remove non relevant comments ARC: [BE] readl()/writel() to work in Big Endian CPU configuration xtensa: clear all DBREAKC registers on start xtensa: fix preemption in {clear,copy}_user_highpage xtensa: ISS: don't hang if stdin EOF is reached splice: handle zero nr_pages in splice_to_pipe() vfs: show_vfsstat: do not ignore errors from show_devname method of: alloc anywhere from memblock if range not specified net: mvneta: enable change MAC address when interface is up cgroup: ignore css_sets associated with dead cgroups during migration Bluetooth: Fix potential buffer overflow with Add Advertising Bluetooth: Add new AR3012 ID 0489:e095 watchdog: rc32434_wdt: fix ioctl error handling watchdog: don't run proc_watchdog_update if new value is same as old ia64: define ioremap_uc() mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage mm: memcontrol: reclaim when shrinking memory.high below usage bcache: fix cache_set_flush() NULL pointer dereference on OOM bcache: fix race of writeback thread starting before complete initialization bcache: cleaned up error handling around register_cache() IB/srpt: Simplify srpt_handle_tsk_mgmt() brd: Fix discard request processing jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path tools/hv: Use include/uapi with __EXPORTED_HEADERS__ ALSA: hda - Fix unconditional GPIO toggle via automute ALSA: hda - fix the mic mute button and led problem for a Lenovo AIO ALSA: hda - Don't handle ELD notify from invalid port ALSA: intel8x0: Add clock quirk entry for AD1981B on IBM ThinkPad X41. ALSA: pcm: Avoid "BUG:" string for warnings again ALSA: hda - Apply reboot D3 fix for CX20724 codec, too mtip32xx: Cleanup queued requests after surprise removal mtip32xx: Implement timeout handler mtip32xx: Handle FTL rebuild failure state during device initialization mtip32xx: Handle safe removal during IO mtip32xx: Fix for rmmod crash when drive is in FTL rebuild mtip32xx: Print exact time when an internal command is interrupted mtip32xx: Remove unwanted code from taskfile error handler mtip32xx: Fix broken service thread handling mtip32xx: Avoid issuing standby immediate cmd during FTL rebuild media: v4l2-compat-ioctl32: fix missing length copy in put_v4l2_buffer32 coda: fix first encoded frame payload bttv: Width must be a multiple of 16 when capturing planar formats adv7511: TX_EDID_PRESENT is still 1 after a disconnect saa7134: Fix bytesperline not being set correctly for planar formats 8250: use callbacks to access UART_DLL/UART_DLM net: irda: Fix use-after-free in irtty_open() tty: Fix GPF in flush_to_ldisc(), part 2 staging: comedi: ni_mio_common: fix the ni_write[blw]() functions staging: android: ion_test: fix check of platform_device_register_simple() error code staging: comedi: ni_tiocmd: change mistaken use of start_src for start_arg HID: fix hid_ignore_special_drivers module parameter HID: multitouch: force retrieving of Win8 signature blob HID: i2c-hid: fix OOB write in i2c_hid_set_or_send_report() HID: logitech: fix Dual Action gamepad support tpm: fix the cleanup of struct tpm_chip tpm_eventlog.c: fix binary_bios_measurements tpm_crb: tpm2_shutdown() must be called before tpm_chip_unregister() tpm: fix the rollback in tpm_chip_register() mei: bus: check if the device is enabled before data transfer X.509: Fix leap year handling again crypto: marvell/cesa - forward devm_ioremap_resource() error code crypto: ux500 - fix checks of error code returned by devm_ioremap_resource() crypto: atmel - fix checks of error code returned by devm_ioremap_resource() crypto: keywrap - memzero the correct memory crypto: ccp - memset request context to zero during import crypto: ccp - Don't assume export/import areas are aligned crypto: ccp - Limit the amount of information exported crypto: ccp - Add hash state import and export support Bluetooth: btusb: Add a new AR3012 ID 13d3:3472 Bluetooth: btusb: Add a new AR3012 ID 04ca:3014 Bluetooth: btusb: Add new AR3012 ID 13d3:3395 ALSA: usb-audio: Fix double-free in error paths after snd_usb_add_audio_stream() call ALSA: usb-audio: Minor code cleanup in create_fixed_stream_quirk() ALSA: usb-audio: add Microsoft HD-5001 to quirks ALSA: usb-audio: Add sanity checks for endpoint accesses ALSA: usb-audio: Fix NULL dereference in create_fixed_stream_quirk() Input: powermate - fix oops with malicious USB descriptors pwc: Add USB id for Philips Spc880nc webcam USB: option: add "D-Link DWM-221 B1" device id USB: serial: ftdi_sio: Add support for ICP DAS I-756xU devices USB: serial: cp210x: Adding GE Healthcare Device ID USB: cypress_m8: add endpoint sanity check USB: digi_acceleport: do sanity checking for the number of ports USB: mct_u232: add sanity checking in probe USB: usb_driver_claim_interface: add sanity checking USB: iowarrior: fix oops with malicious USB descriptors USB: cdc-acm: more sanity checking USB: uas: Reduce can_queue to MAX_CMNDS usb: hub: fix a typo in hub_port_init() leading to wrong logic usb: retry reset if a device times out dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request() dm cache: make sure every metadata function checks fail_io dm thin metadata: don't issue prefetches if a transaction abort has failed dm: fix excessive dm-mq context switching dm snapshot: disallow the COW and origin devices from being identical libnvdimm: Fix security issue with DSM IOCTL. aic7xxx: Fix queue depth handling be2iscsi: set the boot_kset pointer to NULL in case of failure scsi: storvsc: fix SRB_STATUS_ABORTED handling sd: Fix discard granularity when LBPRZ=1 aacraid: Set correct msix count for EEH recovery aacraid: Fix memory leak in aac_fib_map_free aacraid: Fix RRQ overload sg: fix dxferp in from_to case x86/mm: TLB_REMOTE_SEND_IPI should count pages x86/iopl: Fix iopl capability check on Xen PV x86/iopl/64: Properly context-switch IOPL on Xen PV x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt() x86/irq: Cure live lock in fixup_irqs() PCI: ACPI: IA64: fix IO port generic range check PCI: Disable IO/MEM decoding for devices with non-compliant BARs pinctrl-bcm2835: Fix cut-and-paste error in "pull" parsing s390/pci: enforce fmb page boundary rule s390/cpumf: add missing lpp magic initialization s390: fix floating pointer register corruption (again) EDAC, amd64_edac: Shift wrapping issue in f1x_get_norm_dct_addr() EDAC/sb_edac: Fix computation of channel address sched/preempt, sh: kmap_coherent relies on disabled preemption sched/cputime: Fix steal_account_process_tick() to always return jiffies Thermal: Ignore invalid trip points perf tools: Fix python extension build perf tools: Fix checking asprintf return value perf tools: Dont stop PMU parsing on alias parse error perf/core: Fix perf_sched_count derailment KVM: VMX: fix nested vpid for old KVM guests KVM: VMX: avoid guest hang on invalid invvpid instruction KVM: VMX: avoid guest hang on invalid invept instruction KVM: fix spin_lock_init order on x86 KVM: i8254: change PIT discard tick policy KVM: x86: fix missed hardware breakpoints x86/PCI: Mark Broadwell-EP Home Agent & PCU as having non-compliant BARs perf/x86/intel: Add definition for PT PMI bit x86/entry/compat: Keep TS_COMPAT set during signal delivery x86/microcode: Untangle from BLK_DEV_INITRD x86/microcode/intel: Make early loader look for builtin microcode too mmc: sh_mmcif: Correct TX DMA channel allocation mmc: sh_mmcif: rework dma channel handling ASoC: samsung: pass DMA channels as pointers regulator: core: Fix nested locking of supplies regulator: core: avoid unused variable warning s390/cpumf: Fix lpp detection cpufreq: dt: No need to allocate resources anymore cpufreq: dt: No need to fetch voltage-tolerance cpufreq: dt: Use dev_pm_opp_set_rate() to switch frequency cpufreq: dt: Reuse dev_pm_opp_get_max_transition_latency() cpufreq: dt: Unsupported OPPs are already disabled cpufreq: dt: Pass regulator name to the OPP core cpufreq: dt: OPP layers handles clock-latency for V1 bindings as well cpufreq: dt: Rename 'need_update' to 'opp_v1' cpufreq: dt: Convert few pr_debug/err() calls to dev_dbg/err() cpufreq-dt: fix handling regulator_get_voltage() result cpufreq-dt: Supply power coefficient when registering cooling devices PM / OPP: Rename structures for clarity PM / OPP: Fix incorrect comments PM / OPP: Initialize regulator pointer to an error value PM / OPP: Initialize u_volt_min/max to a valid value PM / OPP: Fix NULL pointer dereference crash when disabling OPPs PM / OPP: Add dev_pm_opp_set_rate() PM / OPP: Manage device clk PM / OPP: Parse clock-latency and voltage-tolerance for v1 bindings PM / OPP: Introduce dev_pm_opp_get_max_transition_latency() PM / OPP: Introduce dev_pm_opp_get_max_volt_latency() PM / OPP: Disable OPPs that aren't supported by the regulator PM / OPP: get/put regulators from OPP core cpufreq: cpufreq-dt: avoid uninitialized variable warnings: PM / OPP: Use snprintf() instead of sprintf() PM / OPP: Set cpu_dev->id in cpumask first PM / OPP: Fix parsing of opp-microvolt and opp-microamp properties PM / OPP: Parse 'opp-<prop>-<name>' bindings PM / OPP: Parse 'opp-supported-hw' binding PM / OPP: Add missing doc comments PM / OPP: Rename OPP nodes as opp@<opp-hz> PM / OPP: Remove 'operating-points-names' binding PM / OPP: Add {opp-microvolt\|opp-microamp}-<name> binding PM / OPP: Add "opp-supported-hw" binding PM / OPP: Add debugfs support arm64: vdso: Mark vDSO code as read-only Conflicts: drivers/staging/android/ion/ion.c mm/page_alloc.c CRs-Fixed: 1010239 Change-Id: Id59539cad642885e1e41340cebae4159ba1f7eaf Signed-off-by: Trilok Soni <tsoni@codeaurora.org>	2016-07-22 16:45:32 -07:00
Tejun Heo	52526076a5	memcg: relocate charge moving from ->attach to ->post_attach commit 264a0ae164bc0e9144bebcd25ff030d067b1a878 upstream. Hello, So, this ended up a lot simpler than I originally expected. I tested it lightly and it seems to work fine. Petr, can you please test these two patches w/o the lru drain drop patch and see whether the problem is gone? Thanks. ------ 8< ------ If charge moving is used, memcg performs relabeling of the affected pages from its ->attach callback which is called under both cgroup_threadgroup_rwsem and thus can't create new kthreads. This is fragile as various operations may depend on workqueues making forward progress which relies on the ability to create new kthreads. There's no reason to perform charge moving from ->attach which is deep in the task migration path. Move it to ->post_attach which is called after the actual migration is finished and cgroup_threadgroup_rwsem is dropped. * move_charge_struct->mm is added and ->can_attach is now responsible for pinning and recording the target mm. mem_cgroup_clear_mc() is updated accordingly. This also simplifies mem_cgroup_move_task(). * mem_cgroup_move_task() is now called from ->post_attach instead of ->attach. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Debugged-and-tested-by: Petr Mladek <pmladek@suse.com> Reported-by: Cyril Hrubis <chrubis@suse.cz> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Fixes: `1ed1328792` ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-05-04 14:48:49 -07:00
Johannes Weiner	0ccab5b139	mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage commit b6e6edcfa40561e9c8abe5eecf1c96f8e5fd9c6f upstream. Setting the original memory.limit_in_bytes hardlimit is subject to a race condition when the desired value is below the current usage. The code tries a few times to first reclaim and then see if the usage has dropped to where we would like it to be, but there is no locking, and the workload is free to continue making new charges up to the old limit. Thus, attempting to shrink a workload relies on pure luck and hope that the workload happens to cooperate. To fix this in the cgroup2 memory.max knob, do it the other way round: set the limit first, then try enforcement. And if reclaim is not able to succeed, trigger OOM kills in the group. Keep going until the new limit is met, we run out of OOM victims and there's only unreclaimable memory left, or the task writing to memory.max is killed. This allows users to shrink groups reliably, and the behavior is consistent with what happens when new charges are attempted in excess of memory.max. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-04-12 09:08:54 -07:00
Johannes Weiner	8b42fc47e1	mm: memcontrol: reclaim when shrinking memory.high below usage commit 588083bb37a3cea8533c392370a554417c8f29cb upstream. When setting memory.high below usage, nothing happens until the next charge comes along, and then it will only reclaim its own charge and not the now potentially huge excess of the new memory.high. This can cause groups to stay in excess of their memory.high indefinitely. To fix that, when shrinking memory.high, kick off a reclaim cycle that goes after the delta. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-04-12 09:08:53 -07:00
Martijn Coenen	ececa3ebe2	memcg: only free spare array when readers are done commit 6611d8d76132f86faa501de9451a89bf23fb2371 upstream. A spare array holding mem cgroup threshold events is kept around to make sure we can always safely deregister an event and have an array to store the new set of events in. In the scenario where we're going from 1 to 0 registered events, the pointer to the primary array containing 1 event is copied to the spare slot, and then the spare slot is freed because no events are left. However, it is freed before calling synchronize_rcu(), which means readers may still be accessing threshold->primary after it is freed. Fixed by only freeing after synchronize_rcu(). Signed-off-by: Martijn Coenen <maco@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-02-25 12:01:22 -08:00
Martijn Coenen	e212fc789a	UPSTREAM: memcg: Only free spare array when readers are done A spare array holding mem cgroup threshold events is kept around to make sure we can always safely deregister an event and have an array to store the new set of events in. In the scenario where we're going from 1 to 0 registered events, the pointer to the primary array containing 1 event is copied to the spare slot, and then the spare slot is freed because no events are left. However, it is freed before calling synchronize_rcu(), which means readers may still be accessing threshold->primary after it is freed. Fixed by only freeing after synchronize_rcu(). Signed-off-by: Martijn Coenen <maco@google.com>	2016-02-16 13:53:47 -08:00
Amit Pundir	703920c14a	cgroup: refactor allow_attach handler for 4.4 Refactor *allow_attach() handler to align it with the changes from mainline commit `1f7dd3e5a6` "cgroup: fix handling of multi-destination migration from subtree_control enabling". Signed-off-by: Amit Pundir <amit.pundir@linaro.org>	2016-02-16 13:53:46 -08:00
Amit Pundir	dccfe9526b	cgroup: memcg: pass correct argument to subsys_cgroup_allow_attach Pass correct argument to subsys_cgroup_allow_attach(), which expects 'struct cgroup_subsys_state ' argument but we pass 'struct cgroup ' instead which doesn't seem right. This fixes following 'incompatible pointer type' compiler warning: ---------- CC mm/memcontrol.o mm/memcontrol.c: In function ‘mem_cgroup_allow_attach’: mm/memcontrol.c:5052:2: warning: passing argument 1 of ‘subsys_cgroup_allow_attach’ from incompatible pointer type [enabled by default] In file included from include/linux/memcontrol.h:22:0, from mm/memcontrol.c:29: include/linux/cgroup.h:953:5: note: expected ‘struct cgroup_subsys_state ’ but argument is of type ‘struct cgroup ’ ---------- Signed-off-by: Amit Pundir <amit.pundir@linaro.org>	2016-02-16 13:53:44 -08:00
Rom Lemarchand	e6f5c0c0ec	memcg: add permission check Use the 'allow_attach' handler for the 'mem' cgroup to allow non-root processes to add arbitrary processes to a 'mem' cgroup if it has the CAP_SYS_NICE capability set. Bug: 18260435 Change-Id: If7d37bf90c1544024c4db53351adba6a64966250 Signed-off-by: Rom Lemarchand <romlem@android.com>	2016-02-16 13:53:43 -08:00
Vladimir Davydov	6df38689e0	mm: memcontrol: fix possible memcg leak due to interrupted reclaim Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break() once enough pages have been reclaimed, in which case, in contrast to a full round-trip over a cgroup sub-tree, the current position stored in mem_cgroup_reclaim_iter of the target cgroup does not get invalidated and so is left holding the reference to the last scanned cgroup. If the target cgroup does not get scanned again (we might have just reclaimed the last page or all processes might exit and free their memory voluntary), we will leak it, because there is nobody to put the reference held by the iterator. The problem is easy to reproduce by running the following command sequence in a loop: mkdir /sys/fs/cgroup/memory/test echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs memhog 150M echo $$ > /sys/fs/cgroup/memory/cgroup.procs rmdir test The cgroups generated by it will never get freed. This patch fixes this issue by making mem_cgroup_iter avoid taking reference to the current position. In order not to hit use-after-free bug while running reclaim in parallel with cgroup deletion, we make use of ->css_released cgroup callback to clear references to the dying cgroup in all reclaim iterators that might refer to it. This callback is called right before scheduling rcu work which will free css, so if we access iter->position from rcu read section, we might be sure it won't go away under us. [hannes@cmpxchg.org: clean up css ref handling] Fixes: `5ac8fb31ad` ("mm: memcontrol: convert reclaim iterator to simple css refcounting") Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-12-29 17:45:49 -08:00
Hugh Dickins	25be6a6595	mm: fix kerneldoc on mem_cgroup_replace_page Whoops, I missed removing the kerneldoc comment of the lrucare arg removed from mem_cgroup_replace_page; but it's a good comment, keep it. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-12-12 10:15:34 -08:00
Vladimir Davydov	9516a18a9a	memcg: fix memory.high target When the memory.high threshold is exceeded, try_charge() schedules a task_work to reclaim the excess. The reclaim target is set to the number of pages requested by try_charge(). This is wrong, because try_charge() usually charges more pages than requested (batch > nr_pages) in order to refill per cpu stocks. As a result, a process in a cgroup can easily exceed memory.high significantly when doing a lot of charges w/o returning to userspace (e.g. reading a file in big chunks). Fix this issue by assuring that when exceeding memory.high a process reclaims as many pages as were actually charged (i.e. batch). Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-12-12 10:15:34 -08:00
Tejun Heo	1f7dd3e5a6	cgroup: fix handling of multi-destination migration from subtree_control enabling Consider the following v2 hierarchy. P0 (+memory) --- P1 (-memory) --- A \- B P0 has memory enabled in its subtree_control while P1 doesn't. If both A and B contain processes, they would belong to the memory css of P1. Now if memory is enabled on P1's subtree_control, memory csses should be created on both A and B and A's processes should be moved to the former and B's processes the latter. IOW, enabling controllers can cause atomic migrations into different csses. The core cgroup migration logic has been updated accordingly but the controller migration methods haven't and still assume that all tasks migrate to a single target css; furthermore, the methods were fed the css in which subtree_control was updated which is the parent of the target csses. pids controller depends on the migration methods to move charges and this made the controller attribute charges to the wrong csses often triggering the following warning by driving a counter negative. WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40() Modules linked in: CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29 ... ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000 ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00 ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8 Call Trace: [<ffffffff81551ffc>] dump_stack+0x4e/0x82 [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0 [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40 [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0 [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330 [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190 [<ffffffff81189016>] cgroup_attach_task+0x176/0x200 [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460 [<ffffffff81189684>] cgroup_procs_write+0x14/0x20 [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0 [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190 [<ffffffff81265f88>] __vfs_write+0x28/0xe0 [<ffffffff812666fc>] vfs_write+0xac/0x1a0 [<ffffffff81267019>] SyS_write+0x49/0xb0 [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76 This patch fixes the bug by removing @css parameter from the three migration methods, ->can_attach, ->cancel_attach() and ->attach() and updating cgroup_taskset iteration helpers also return the destination css in addition to the task being migrated. All controllers are updated accordingly. * Controllers which don't care whether there are one or multiple target csses can be converted trivially. cpu, io, freezer, perf, netclassid and netprio fall in this category. * cpuset's current implementation assumes that there's single source and destination and thus doesn't support v2 hierarchy already. The only change made by this patchset is how that single destination css is obtained. * memory migration path already doesn't do anything on v2. How the single destination css is obtained is updated and the prep stage of mem_cgroup_can_attach() is reordered to accomodate the change. * pids is the only controller which was affected by this bug. It now correctly handles multi-destination migrations and no longer causes counter underflow from incorrect accounting. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Aleksa Sarai <cyphar@cyphar.com>	2015-12-03 10:18:21 -05:00
Andrew Morton	6f6461562e	mm/memcontrol.c: uninline mem_cgroup_usage gcc version 5.2.1 20151010 (Debian 5.2.1-22) $ size mm/memcontrol.o mm/memcontrol.o.before text data bss dec hex filename 35535 7908 64 43507 a9f3 mm/memcontrol.o 35762 7908 64 43734 aad6 mm/memcontrol.o.before Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Mel Gorman	71baba4b92	mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM __GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Mel Gorman	d0164adc89	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Linus Torvalds	2e3078af2c	Merge branch 'akpm' (patches from Andrew) Merge patch-bomb from Andrew Morton: - inotify tweaks - some ocfs2 updates (many more are awaiting review) - various misc bits - kernel/watchdog.c updates - Some of mm. I have a huge number of MM patches this time and quite a lot of it is quite difficult and much will be held over to next time. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits) selftests: vm: add tests for lock on fault mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage mm: introduce VM_LOCKONFAULT mm: mlock: add new mlock system call mm: mlock: refactor mlock, munlock, and munlockall code kasan: always taint kernel on report mm, slub, kasan: enable user tracking by default with KASAN=y kasan: use IS_ALIGNED in memory_is_poisoned_8() kasan: Fix a type conversion error lib: test_kasan: add some testcases kasan: update reference to kasan prototype repo kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile kasan: various fixes in documentation kasan: update log messages kasan: accurately determine the type of the bad access kasan: update reported bug types for kernel memory accesses kasan: update reported bug types for not user nor kernel memory accesses mm/kasan: prevent deadlock in kasan reporting mm/kasan: don't use kasan shadow pointer in generic functions mm/kasan: MODULE_VADDR is not available on all archs ...	2015-11-05 23:10:54 -08:00
Michal Hocko	c12176d336	memcg: fix thresholds for 32b architectures. Commit `424cdc1413` ("memcg: convert threshold to bytes") has fixed a regression introduced by `3e32cb2e0a` ("mm: memcontrol: lockless page counters") where thresholds were silently converted to use page units rather than bytes when interpreting the user input. The fix is not complete, though, as properly pointed out by Ben Hutchings during stable backport review. The page count is converted to bytes but unsigned long is used to hold the value which would be obviously not sufficient for 32b systems with more than 4G thresholds. The same applies to usage as taken from mem_cgroup_usage which might overflow. Let's remove this bytes vs. pages internal tracking differences and handle thresholds in page units internally. Chage mem_cgroup_usage() to return the value in page units and revert `424cdc1413` because this should be sufficient for the consistent handling. mem_cgroup_read_u64 as the only users of mem_cgroup_usage outside of the threshold handling code is converted to give the proper in bytes result. It is doing that already for page_counter output so this is more consistent as well. The value presented to the userspace is still in bytes units. Fixes: `424cdc1413` ("memcg: convert threshold to bytes") Fixes: `3e32cb2e0a` ("mm: memcontrol: lockless page counters") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> From: Michal Hocko <mhocko@kernel.org> Subject: memcg-fix-thresholds-for-32b-architectures-fix Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> From: Andrew Morton <akpm@linux-foundation.org> Subject: memcg-fix-thresholds-for-32b-architectures-fix-fix don't attempt to inline mem_cgroup_usage() The compiler ignores the inline anwyay. And __always_inlining it adds 600 bytes of goop to the .o file. Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Johannes Weiner	6071ca5201	mm: page_counter: let page_counter_try_charge() return bool page_counter_try_charge() currently returns 0 on success and -ENOMEM on failure, which is surprising behavior given the function name. Make it follow the expected pattern of try_stuff() functions that return a boolean true to indicate success, or false for failure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Johannes Weiner	f5fc3c5d81	mm: memcontrol: eliminate root memory.current memory.current on the root level doesn't add anything that wouldn't be more accurate and detailed using system statistics. It already doesn't include slabs, and it'll be a pain to keep in sync when further memory types are accounted in the memory controller. Remove it. Note that this applies to the new unified hierarchy interface only. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Hugh Dickins	45637bab30	mm: rename mem_cgroup_migrate to mem_cgroup_replace_page After v4.3's commit `0610c25daa` ("memcg: fix dirty page migration") mem_cgroup_migrate() doesn't have much to offer in page migration: convert migrate_misplaced_transhuge_page() to set_page_memcg() instead. Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its remaining callers are replace_page_cache_page() and shmem_replace_page(): both of whom passed lrucare true, so just eliminate that argument. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Vladimir Davydov	df4065516b	memcg: simplify and inline __mem_cgroup_from_kmem Before the previous patch ("memcg: unify slab and other kmem pages charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab pages and pages allocated with alloc_kmem_pages - memcg in the page struct. Now we can unify it. Since after it, this function becomes tiny we can fold it into mem_cgroup_from_kmem. [hughd@google.com: move mem_cgroup_from_kmem into list_lru.c] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Vladimir Davydov	f3ccb2c422	memcg: unify slab and other kmem pages charging We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and uncharging kmem pages to memcg, but currently they are not used for charging slab pages (i.e. they are only used for charging pages allocated with alloc_kmem_pages). The only reason why the slab subsystem uses special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it needs to charge to the memcg of kmem cache while memcg_charge_kmem charges to the memcg that the current task belongs to. To remove this diversity, this patch adds an extra argument to __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is not NULL, the function tries to charge to the memcg it points to, otherwise it charge to the current context. Next, it makes the slab subsystem use this function to charge slab pages. Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since __memcg_kmem_charge stores a pointer to the memcg in the page struct, we don't need memcg_uncharge_slab anymore and can use free_kmem_pages. Besides, one can now detect which memcg a slab page belongs to by reading /proc/kpagecgroup. Note, this patch switches slab to charge-after-alloc design. Since this design is already used for all other memcg charges, it should not make any difference. [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Vladimir Davydov	d05e83a6f8	memcg: simplify charging kmem pages Charging kmem pages proceeds in two steps. First, we try to charge the allocation size to the memcg the current task belongs to, then we allocate a page and "commit" the charge storing the pointer to the memcg in the page struct. Such a design looks overcomplicated, because there is not much sense in trying charging the allocation before actually allocating a page: we won't be able to consume much memory over the limit even if we charge after doing the actual allocation, besides we already charge user pages post factum, so being pedantic with kmem pages just looks pointless. So this patch simplifies the design by merging the "charge" and the "commit" steps into the same function, which takes the allocated page. Also, rename the charge and uncharge methods to memcg_kmem_charge and memcg_kmem_uncharge and make the charge method return error code instead of bool to conform to mem_cgroup_try_charge. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Jerome Marchand	3608de0787	mm/memcontrol.c: fix order calculation in try_charge() Since commit `6539cc0538` ("mm: memcontrol: fold mem_cgroup_do_charge()"), the order to pass to mem_cgroup_oom() is calculated by passing the number of pages to get_order() instead of the expected size in bytes. AFAICT, it only affects the value displayed in the oom warning message. This patch fix this. Michal said: : We haven't noticed that just because the OOM is enabled only for page : faults of order-0 (single page) and get_order work just fine. Thanks for : noticing this. If we ever start triggering OOM on different orders this : would be broken. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Tejun Heo	10d53c748b	memcg: ratify and consolidate over-charge handling try_charge() is the main charging logic of memcg. When it hits the limit but either can't fail the allocation due to __GFP_NOFAIL or the task is likely to free memory very soon, being OOM killed, has SIGKILL pending or exiting, it "bypasses" the charge to the root memcg and returns -EINTR. While this is one approach which can be taken for these situations, it has several issues. * It unnecessarily lies about the reality. The number itself doesn't go over the limit but the actual usage does. memcg is either forced to or actively chooses to go over the limit because that is the right behavior under the circumstances, which is completely fine, but, if at all avoidable, it shouldn't be misrepresenting what's happening by sneaking the charges into the root memcg. * Despite trying, we already do over-charge. kmemcg can't deal with switching over to the root memcg by the point try_charge() returns -EINTR, so it open-codes over-charing. * It complicates the callers. Each try_charge() user has to handle the weird -EINTR exception. memcg_charge_kmem() does the manual over-charging. mem_cgroup_do_precharge() performs unnecessary uncharging of root memcg, which BTW is inconsistent with what memcg_charge_kmem() does but not broken as [un]charging are noops on root memcg. mem_cgroup_try_charge() needs to switch the returned cgroup to the root one. The reality is that in memcg there are cases where we are forced and/or willing to go over the limit. Each such case needs to be scrutinized and justified but there definitely are situations where that is the right thing to do. We alredy do this but with a superficial and inconsistent disguise which leads to unnecessary complications. This patch updates try_charge() so that it over-charges and returns 0 when deemed necessary. -EINTR return is removed along with all special case handling in the callers. While at it, remove the local variable @ret, which was initialized to zero and never changed, along with done: label which just returned the always zero @ret. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Tejun Heo	b23afb93d3	memcg: punt high overage reclaim to return-to-userland path Currently, try_charge() tries to reclaim memory synchronously when the high limit is breached; however, if the allocation doesn't have __GFP_WAIT, synchronous reclaim is skipped. If a process performs only speculative allocations, it can blow way past the high limit. This is actually easily reproducible by simply doing "find /". slab/slub allocator tries speculative allocations first, so as long as there's memory which can be consumed without blocking, it can keep allocating memory regardless of the high limit. This patch makes try_charge() always punt the over-high reclaim to the return-to-userland path. If try_charge() detects that high limit is breached, it adds the overage to current->memcg_nr_pages_over_high and schedules execution of mem_cgroup_handle_over_high() which performs synchronous reclaim from the return-to-userland path. As long as kernel doesn't have a run-away allocation spree, this should provide enough protection while making kmemcg behave more consistently. It also has the following benefits. - All over-high reclaims can use GFP_KERNEL regardless of the specific gfp mask in use, e.g. GFP_NOFS, when the limit was breached. - It copes with prio inversion. Previously, a low-prio task with small memory.high might perform over-high reclaim with a bunch of locks held. If a higher prio task needed any of these locks, it would have to wait until the low prio task finished reclaim and released the locks. By handing over-high reclaim to the task exit path this issue can be avoided. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Tejun Heo	626ebc4100	memcg: flatten task_struct->memcg_oom task_struct->memcg_oom is a sub-struct containing fields which are used for async memcg oom handling. Most task_struct fields aren't packaged this way and it can lead to unnecessary alignment paddings. This patch flattens it. * task.memcg_oom.memcg -> task.memcg_in_oom * task.memcg_oom.gfp_mask -> task.memcg_oom_gfp_mask * task.memcg_oom.order -> task.memcg_oom_order * task.memcg_oom.may_oom -> task.memcg_may_oom In addition, task.memcg_may_oom is relocated to where other bitfields are which reduces the size of task_struct. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Linus Torvalds	69234acee5	Merge branch 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "The cgroup core saw several significant updates this cycle: - percpu_rwsem for threadgroup locking is reinstated. This was temporarily dropped due to down_write latency issues. Oleg's rework of percpu_rwsem which is scheduled to be merged in this merge window resolves the issue. - On the v2 hierarchy, when controllers are enabled and disabled, all operations are atomic and can fail and revert cleanly. This allows ->can_attach() failure which is necessary for cpu RT slices. - Tasks now stay associated with the original cgroups after exit until released. This allows tracking resources held by zombies (e.g. pids) and makes it easy to find out where zombies came from on the v2 hierarchy. The pids controller was broken before these changes as zombies escaped the limits; unfortunately, updating this behavior required too many invasive changes and I don't think it's a good idea to backport them, so the pids controller on 4.3, the first version which included the pids controller, will stay broken at least until I'm sure about the cgroup core changes. - Optimization of a couple common tests using static_key" * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits) cgroup: fix race condition around termination check in css_task_iter_next() blkcg: don't create "io.stat" on the root cgroup cgroup: drop cgroup__DEVEL__legacy_files_on_dfl cgroup: replace error handling in cgroup_init() with WARN_ON()s cgroup: add cgroup_subsys->free() method and use it to fix pids controller cgroup: keep zombies associated with their original cgroups cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock cgroup: don't hold css_set_rwsem across css task iteration cgroup: reorganize css_task_iter functions cgroup: factor out css_set_move_task() cgroup: keep css_set and task lists in chronological order cgroup: make cgroup_destroy_locked() test cgroup_is_populated() cgroup: make css_sets pin the associated cgroups cgroup: relocate cgroup_[try]get/put() cgroup: move check_for_release() invocation cgroup: replace cgroup_has_tasks() with cgroup_is_populated() cgroup: make cgroup->nr_populated count the number of populated css_sets cgroup: remove an unused parameter from cgroup_task_migrate() cgroup: fix too early usage of static_branch_disable() cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically ...	2015-11-05 14:51:32 -08:00
Linus Torvalds	ea1ee5ff1b	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "A final set of fixes for 4.3. It is (again) bigger than I would have liked, but it's all been through the testing mill and has been carefully reviewed by multiple parties. Each fix is either a regression fix for this cycle, or is marked stable. You can scold me at KS. The pull request contains: - Three simple fixes for NVMe, fixing regressions since 4.3. From Arnd, Christoph, and Keith. - A single xen-blkfront fix from Cathy, fixing a NULL dereference if an error is returned through the staste change callback. - Fixup for some bad/sloppy code in nbd that got introduced earlier in this cycle. From Markus Pargmann. - A blk-mq tagset use-after-free fix from Junichi. - A backing device lifetime fix from Tejun, fixing a crash. - And finally, a set of regression/stable fixes for cgroup writeback from Tejun" * 'for-linus' of git://git.kernel.dk/linux-block: writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy() NVMe: Fix memory leak on retried commands block: don't release bdi while request_queue has live references nvme: use an integer value to Linux errno values blk-mq: fix use-after-free in blk_mq_free_tag_set() nvme: fix 32-bit build warning writeback: fix incorrect calculation of available memory for memcg domains writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions writeback: bdi_writeback iteration must not skip dying ones writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration nbd: Add locking for tasks xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing)	2015-10-24 07:20:57 +09:00
Shaohua Li	424cdc1413	memcg: convert threshold to bytes page_counter_memparse() returns pages for the threshold, while mem_cgroup_usage() returns bytes for memory usage. Convert the threshold to bytes. Fixes: `3e32cb2e0a` ("memcg: rename cgroup_event to mem_cgroup_event"). Signed-off-by: Shaohua Li <shli@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-10-16 11:42:28 -07:00
Tejun Heo	27bd4dbb8d	cgroup: replace cgroup_has_tasks() with cgroup_is_populated() Currently, cgroup_has_tasks() tests whether the target cgroup has any css_set linked to it. This works because a css_set's refcnt converges with the number of tasks linked to it and thus there's no css_set linked to a cgroup if it doesn't have any live tasks. To help tracking resource usage of zombie tasks, putting the ref of css_set will be separated from disassociating the task from the css_set which means that a cgroup may have css_sets linked to it even when it doesn't have any live tasks. This patch replaces cgroup_has_tasks() with cgroup_is_populated() which tests cgroup->nr_populated instead which locally counts the number of populated css_sets. Unlike cgroup_has_tasks(), cgroup_is_populated() is recursive - if any of the descendants is populated, the cgroup is populated too. While this changes the meaning of the test, all the existing users are okay with the change. While at it, replace the open-coded ->populated_cnt test in cgroup_events_show() with cgroup_is_populated(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org>	2015-10-15 16:41:50 -04:00
Tejun Heo	c5edf9cdc4	writeback: fix incorrect calculation of available memory for memcg domains For memcg domains, the amount of available memory was calculated as min(the amount currently in use + headroom according to memcg, total clean memory) This isn't quite correct as what should be capped by the amount of clean memory is the headroom, not the sum of memory in use and headroom. For example, if a memcg domain has a significant amount of dirty memory, the above can lead to a value which is lower than the current amount in use which doesn't make much sense. In most circumstances, the above leads to a number which is somewhat but not drastically lower. As the amount of memory which can be readily allocated to the memcg domain is capped by the amount of system-wide clean memory which is not already assigned to the memcg itself, the number we want is the amount currently in use + min(headroom according to memcg, clean memory elsewhere in the system) This patch updates mem_cgroup_wb_stats() to return the number of filepages and headroom instead of the calculated available pages. mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the above calculation from file, headroom, dirty and globally clean pages. v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading to build failure when !CGROUP_WRITEBACK. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `c2aa723a60` ("writeback: implement memcg writeback domain based throttling") Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-12 10:31:13 -06:00
Greg Thelen	ef510194ce	memcg: remove pcp_counter_lock Commit `733a572e66` ("memcg: make mem_cgroup_read_{stat\|event}() iterate possible cpus instead of online") removed the last use of the per memcg pcp_counter_lock but forgot to remove the variable. Kill the vestigial variable. Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-10-01 21:42:35 -04:00
Greg Thelen	484ebb3b8c	memcg: make mem_cgroup_read_stat() unsigned mem_cgroup_read_stat() returns a page count by summing per cpu page counters. The summing is racy wrt. updates, so a transient negative sum is possible. Callers don't want negative values: - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback. This could confuse dirty throttling. - oom reports and memory.stat shouldn't show confusing negative usage. - tree_usage() already avoids negatives. Avoid returning negative page counts from mem_cgroup_read_stat() and convert it to unsigned. [akpm@linux-foundation.org: fix old typo while we're in there] Signed-off-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-10-01 21:42:35 -04:00
Tejun Heo	4530eddb59	cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader() It wasn't explicitly documented but, when a process is being migrated, cpuset and memcg depend on cgroup_taskset_first() returning the threadgroup leader; however, this approach is somewhat ghetto and would no longer work for the planned multi-process migration. This patch introduces explicit cgroup_taskset_for_each_leader() which iterates over only the threadgroup leaders and replaces cgroup_taskset_first() usages for accessing the leader with it. This prepares both memcg and cpuset for multi-process migration. This patch also updates the documentation for cgroup_taskset_for_each() to clarify the iteration rules and removes comments mentioning task ordering in tasksets. v2: A previous patch which added threadgroup leader test was dropped. Patch updated accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org>	2015-09-22 12:46:53 -04:00
Tejun Heo	472912a2b5	memcg: generate file modified notifications on "memory.events" cgroup core only recently grew generic notification support. Wire up "memory.events" so that it triggers a file modified event whenever its content changes. v2: Refreshed on top of mem_cgroup relocation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com>	2015-09-21 15:14:47 -04:00
Tejun Heo	7dbdb199d3	cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE cftype->mode allows controllers to give arbitrary permissions to interface knobs. Except for "cgroup.event_control", the existing uses are spurious. * Some explicitly specify S_IRUGO \| S_IWUSR even though that's the default. * "cpuset.memory_pressure" specifies S_IRUGO while also setting a write callback which returns -EACCES. All it needs to do is simply not setting a write callback. "cgroup.event_control" uses cftype->mode to make the file world-writable. It's a misdesigned interface and we don't want controllers to be tweaking interface file permissions in general. This patch removes cftype->mode and all its spurious uses and implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is marked as compatibility-only. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>	2015-09-18 17:54:23 -04:00
Tejun Heo	9e10a130d9	cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() cgroup_on_dfl() tests whether the cgroup's root is the default hierarchy; however, an individual controller is only interested in whether the controller is attached to the default hierarchy and never tests a cgroup which doesn't belong to the hierarchy that the controller is attached to. This patch replaces cgroup_on_dfl() tests in controllers with faster static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as the only user of cgroup_on_dfl() and the function is moved from the header file to cgroup.c. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org>	2015-09-18 11:56:28 -04:00
Vladimir Davydov	e993d905c8	memcg: zap try_get_mem_cgroup_from_page It is only used in mem_cgroup_try_charge, so fold it in and zap it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-10 13:29:01 -07:00
Vladimir Davydov	2fc0452470	memcg: add page_cgroup_ino helper This patchset introduces a new user API for tracking user memory pages that have not been used for a given period of time. The purpose of this is to provide the userspace with the means of tracking a workload's working set, i.e. the set of pages that are actively used by the workload. Knowing the working set size can be useful for partitioning the system more efficiently, e.g. by tuning memory cgroup limits appropriately, or for job placement within a compute cluster. ==== USE CASES ==== The unified cgroup hierarchy has memory.low and memory.high knobs, which are defined as the low and high boundaries for the workload working set size. However, the working set size of a workload may be unknown or change in time. With this patch set, one can periodically estimate the amount of memory unused by each cgroup and tune their memory.low and memory.high parameters accordingly, therefore optimizing the overall memory utilization. Another use case is balancing workloads within a compute cluster. Knowing how much memory is not really used by a workload unit may help take a more optimal decision when considering migrating the unit to another node within the cluster. Also, as noted by Minchan, this would be useful for per-process reclaim (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle pages only by smart user memory manager. ==== USER API ==== The user API consists of two new files: * /sys/kernel/mm/page_idle/bitmap. This file implements a bitmap where each bit corresponds to a page, indexed by PFN. When the bit is set, the corresponding page is idle. A page is considered idle if it has not been accessed since it was marked idle. To mark a page idle one should set the bit corresponding to the page by writing to the file. A value written to the file is OR-ed with the current bitmap value. Only user memory pages can be marked idle, for other page types input is silently ignored. Writing to this file beyond max PFN results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set. This file can be used to estimate the amount of pages that are not used by a particular workload as follows: 1. mark all pages of interest idle by setting corresponding bits in the /sys/kernel/mm/page_idle/bitmap 2. wait until the workload accesses its working set 3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set * /proc/kpagecgroup. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when CONFIG_MEMCG is set. This file can be used to find all pages (including unmapped file pages) accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one can then estimate the cgroup working set size. For an example of using these files for estimating the amount of unused memory pages per each memory cgroup, please see the script attached below. ==== REASONING ==== The reason to introduce the new user API instead of using /proc/PID/{clear_refs,smaps} is that the latter has two serious drawbacks: - it does not count unmapped file pages - it affects the reclaimer logic The new API attempts to overcome them both. For more details on how it is achieved, please see the comment to patch 6. ==== PATCHSET STRUCTURE ==== The patch set is organized as follows: - patch 1 adds page_cgroup_ino() helper for the sake of /proc/kpagecgroup and patches 2-3 do related cleanup - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is charged to - patch 5 introduces a new mmu notifier callback, clear_young, which is a lightweight version of clear_flush_young; it is used in patch 6 - patch 6 implements the idle page tracking feature, including the userspace API, /sys/kernel/mm/page_idle/bitmap - patch 7 exports idle flag via /proc/kpageflags ==== SIMILAR WORKS ==== Originally, the patch for tracking idle memory was proposed back in 2011 by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main difference between Michel's patch and this one is that Michel implemented a kernel space daemon for estimating idle memory size per cgroup while this patch only provides the userspace with the minimal API for doing the job, leaving the rest up to the userspace. However, they both share the same idea of Idle/Young page flags to avoid affecting the reclaimer logic. ==== PERFORMANCE EVALUATION ==== SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the performance impact introduced by this patch set. Three runs were carried out: - base: kernel without the patch - patched: patched kernel, the feature is not used - patched-active: patched kernel, 1 minute-period daemon is used for tracking idle memory For tracking idle memory, idlememstat utility was used: https://github.com/locker/idlememstat testcase base patched patched-active compiler 537.40 ( 0.00)% 532.26 (-0.96)% 538.31 ( 0.17)% compress 305.47 ( 0.00)% 301.08 (-1.44)% 300.71 (-1.56)% crypto 284.32 ( 0.00)% 282.21 (-0.74)% 284.87 ( 0.19)% derby 411.05 ( 0.00)% 413.44 ( 0.58)% 412.07 ( 0.25)% mpegaudio 189.96 ( 0.00)% 190.87 ( 0.48)% 189.42 (-0.28)% scimark.large 46.85 ( 0.00)% 46.41 (-0.94)% 47.83 ( 2.09)% scimark.small 412.91 ( 0.00)% 415.41 ( 0.61)% 421.17 ( 2.00)% serial 204.23 ( 0.00)% 213.46 ( 4.52)% 203.17 (-0.52)% startup 36.76 ( 0.00)% 35.49 (-3.45)% 35.64 (-3.05)% sunflow 115.34 ( 0.00)% 115.08 (-0.23)% 117.37 ( 1.76)% xml 620.55 ( 0.00)% 619.95 (-0.10)% 620.39 (-0.03)% composite 211.50 ( 0.00)% 211.15 (-0.17)% 211.67 ( 0.08)% time idlememstat: 17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k 448inputs+40outputs (1major+36052minor)pagefaults 0swaps ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ==== #! /usr/bin/python # import os import stat import errno import struct CGROUP_MOUNT = "/sys/fs/cgroup/memory" BUFSIZE = 8 * 1024 # must be multiple of 8 def get_hugepage_size(): with open("/proc/meminfo", "r") as f: for s in f: k, v = s.split(":") if k == "Hugepagesize": return int(v.split()[0]) * 1024 PAGE_SIZE = os.sysconf("SC_PAGE_SIZE") HUGEPAGE_SIZE = get_hugepage_size() def set_idle(): f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE) while True: try: f.write(struct.pack("Q", pow(2, 64) - 1)) except IOError as err: if err.errno == errno.ENXIO: break raise f.close() def count_idle(): f_flags = open("/proc/kpageflags", "rb", BUFSIZE) f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE) with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f: while f.read(BUFSIZE): pass # update idle flag idlememsz = {} while True: s1, s2 = f_flags.read(8), f_cgroup.read(8) if not s1 or not s2: break flags, = struct.unpack('Q', s1) cgino, = struct.unpack('Q', s2) unevictable = (flags >> 18) & 1 huge = (flags >> 22) & 1 idle = (flags >> 25) & 1 if idle and not unevictable: idlememsz[cgino] = idlememsz.get(cgino, 0) + \ (HUGEPAGE_SIZE if huge else PAGE_SIZE) f_flags.close() f_cgroup.close() return idlememsz if __name__ == "__main__": print "Setting the idle flag for each page..." set_idle() raw_input("Wait until the workload accesses its working set, " "then press Enter") print "Counting idle pages..." idlememsz = count_idle() for dir, subdirs, files in os.walk(CGROUP_MOUNT): ino = os.stat(dir)[stat.ST_INO] print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB" ==== END SCRIPT ==== This patch (of 8): Add page_cgroup_ino() helper to memcg. This function returns the inode number of the closest online ancestor of the memory cgroup a page is charged to. It is required for exporting information about which page is charged to which cgroup to userspace, which will be introduced by a following patch. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-10 13:29:01 -07:00
Michal Hocko	e752eb6881	memcg: move memcg_proto_active from sock.h The only user is sock_update_memcg which is living in memcontrol.c so it doesn't make much sense to pollute sock.h by this inline helper. Move it to memcontrol.c and open code it into its only caller. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-08 15:35:28 -07:00
Michal Hocko	a03f1f0589	memcg, tcp_kmem: check for cg_proto in sock_update_memcg sk_prot->proto_cgroup is allowed to return NULL but sock_update_memcg doesn't check for NULL. The function relies on the mem_cgroup_is_root check because we shouldn't get NULL otherwise because mem_cgroup_from_task will always return !NULL. All other callers are checking for NULL and we can safely replace mem_cgroup_is_root() check by cg_proto != NULL which will be more straightforward (proto_cgroup returns NULL for the root memcg already). Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-08 15:35:28 -07:00
Tejun Heo	9f2115f93b	memcg: restructure mem_cgroup_can_attach() Restructure it to lower nesting level and help the planned threadgroup leader iteration changes. This is pure reorganization. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-08 15:35:28 -07:00
Michal Hocko	33398cf2f3	memcg: export struct mem_cgroup mem_cgroup structure is defined in mm/memcontrol.c currently which means that the code outside of this file has to use external API even for trivial access stuff. This patch exports mm_struct with its dependencies and makes some of the exported functions inlines. This even helps to reduce the code size a bit (make defconfig + CONFIG_MEMCG=y) text data bss dec hex filename 12355346 1823792 1089536 15268674 e8fb42 vmlinux.before 12354970 1823792 1089536 15268298 e8f9ca vmlinux.after This is not much (370B) but better than nothing. We also save a function call in some hot paths like callers of mem_cgroup_count_vm_event which is used for accounting. The patch doesn't introduce any functional changes. [vdavykov@parallels.com: inline memcg_kmem_is_active] [vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG] [akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx] [akpm@linux-foundation.org: export mem_cgroup_from_task() to modules] Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-08 15:35:28 -07:00
David Rientjes	54e9e29132	mm, oom: pass an oom order of -1 when triggered by sysrq The force_kill member of struct oom_control isn't needed if an order of -1 is used instead. This is the same as order == -1 in struct compact_control which requires full memory compaction. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-08 15:35:28 -07:00
David Rientjes	6e0fc46dc2	mm, oom: organize oom context into struct There are essential elements to an oom context that are passed around to multiple functions. Organize these elements into a new struct, struct oom_control, that specifies the context for an oom condition. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-08 15:35:28 -07:00
Sebastian Andrzej Siewior	ce9ce6659a	mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout() Clark stumbled over a VM_BUG_ON() in -RT which was then was removed by Johannes in commit `f371763a79` ("mm: memcontrol: fix false-positive VM_BUG_ON() on -rt"). The comment before that patch was a tiny bit better than it is now. While the patch claimed to fix a false-postive on -RT this was not the case. None of the -RT folks ACKed it and it was not a false positive report. That was a real problem. This patch updates the comment that is improper because it refers to "disabled preemption" as a consequence of that lock being taken. A spin_lock() disables preemption, true, but in this case the code relies on the fact that the lock _also_ disables interrupts once it is acquired. And this is the important detail (which was checked the VM_BUG_ON()) which needs to be pointed out. This is the hint one needs while looking at the code. It was explained by Johannes on the list that the per-CPU variables are protected by local_irq_save(). The BUG_ON() was helpful. This code has been workarounded in -RT in the meantime. I wouldn't mind running into more of those if the code in question uses special kind of locking since now there is no verification (in terms of lockdep or BUG_ON()) and therefore I bring the VM_BUG_ON() check back in. The two functions after the comment could also have a "local_irq_save()" dance around them in order to serialize access to the per-CPU variables. This has been avoided because the interrupts should be off. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Clark Williams <williams@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00

1 2 3 4 5 ...