Commit graph

902 commits

Author SHA1 Message Date
Trilok Soni
5ab1e18aa3 Revert "Merge remote-tracking branch 'msm-4.4/tmp-510d0a3f' into msm-4.4"
This reverts commit 9d6fd2c3e9 ("Merge remote-tracking branch
'msm-4.4/tmp-510d0a3f' into msm-4.4"), because it breaks the
dump parsing tools due to kernel can be loaded anywhere in the memory
now and not fixed at linear mapping.

Change-Id: Id416f0a249d803442847d09ac47781147b0d0ee6
Signed-off-by: Trilok Soni <tsoni@codeaurora.org>
2016-08-26 14:34:05 -07:00
Trilok Soni
9d6fd2c3e9 Merge remote-tracking branch 'msm-4.4/tmp-510d0a3f' into msm-4.4
* msm-4.4/tmp-510d0a3f:
  Linux 4.4.11
  nf_conntrack: avoid kernel pointer value leak in slab name
  drm/radeon: fix DP link training issue with second 4K monitor
  drm/i915/bdw: Add missing delay during L3 SQC credit programming
  drm/i915: Bail out of pipe config compute loop on LPT
  drm/radeon: fix PLL sharing on DCE6.1 (v2)
  Revert "[media] videobuf2-v4l2: Verify planes array in buffer dequeueing"
  Input: max8997-haptic - fix NULL pointer dereference
  get_rock_ridge_filename(): handle malformed NM entries
  tools lib traceevent: Do not reassign parg after collapse_tree()
  qla1280: Don't allocate 512kb of host tags
  atomic_open(): fix the handling of create_error
  regulator: axp20x: Fix axp22x ldo_io voltage ranges
  regulator: s2mps11: Fix invalid selector mask and voltages for buck9
  workqueue: fix rebind bound workers warning
  ARM: dts: at91: sam9x5: Fix the memory range assigned to the PMC
  vfs: rename: check backing inode being equal
  vfs: add vfs_select_inode() helper
  perf/core: Disable the event on a truncated AUX record
  regmap: spmi: Fix regmap_spmi_ext_read in multi-byte case
  pinctrl: at91-pio4: fix pull-up/down logic
  spi: spi-ti-qspi: Handle truncated frames properly
  spi: spi-ti-qspi: Fix FLEN and WLEN settings if bits_per_word is overridden
  spi: pxa2xx: Do not detect number of enabled chip selects on Intel SPT
  ALSA: hda - Fix broken reconfig
  ALSA: hda - Fix white noise on Asus UX501VW headset
  ALSA: hda - Fix subwoofer pin on ASUS N751 and N551
  ALSA: usb-audio: Yet another Phoneix Audio device quirk
  ALSA: usb-audio: Quirk for yet another Phoenix Audio devices (v2)
  crypto: testmgr - Use kmalloc memory for RSA input
  crypto: hash - Fix page length clamping in hash walk
  crypto: qat - fix invalid pf2vf_resp_wq logic
  s390/mm: fix asce_bits handling with dynamic pagetable levels
  zsmalloc: fix zs_can_compact() integer overflow
  ocfs2: fix posix_acl_create deadlock
  ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang
  net/route: enforce hoplimit max value
  tcp: refresh skb timestamp at retransmit time
  net: thunderx: avoid exposing kernel stack
  net: fix a kernel infoleak in x25 module
  uapi glibc compat: fix compile errors when glibc net/if.h included before linux/if.h MIME-Version: 1.0
  bridge: fix igmp / mld query parsing
  net: bridge: fix old ioctl unlocked net device walk
  VSOCK: do not disconnect socket when peer has shutdown SEND only
  net/mlx4_en: Fix endianness bug in IPV6 csum calculation
  net: fix infoleak in rtnetlink
  net: fix infoleak in llc
  net: fec: only clear a queue's work bit if the queue was emptied
  netem: Segment GSO packets on enqueue
  sch_dsmark: update backlog as well
  sch_htb: update backlog as well
  net_sched: update hierarchical backlog too
  net_sched: introduce qdisc_replace() helper
  gre: do not pull header in ICMP error processing
  net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case
  samples/bpf: fix trace_output example
  bpf: fix check_map_func_compatibility logic
  bpf: fix refcnt overflow
  bpf: fix double-fdput in replace_map_fd_with_map_ptr()
  net/mlx4_en: fix spurious timestamping callbacks
  ipv4/fib: don't warn when primary address is missing if in_dev is dead
  net/mlx5e: Fix minimum MTU
  net/mlx5e: Device's mtu field is u16 and not int
  openvswitch: use flow protocol when recalculating ipv6 checksums
  atl2: Disable unimplemented scatter/gather feature
  vlan: pull on __vlan_insert_tag error path and fix csum correction
  net: use skb_postpush_rcsum instead of own implementations
  cdc_mbim: apply "NDP to end" quirk to all Huawei devices
  bpf/verifier: reject invalid LD_ABS | BPF_DW instruction
  net: sched: do not requeue a NULL skb
  packet: fix heap info leak in PACKET_DIAG_MCLIST sock_diag interface
  route: do not cache fib route info on local routes with oif
  decnet: Do not build routes to devices without decnet private data.
  parisc: Use generic extable search and sort routines
  arm64: kasan: Use actual memory node when populating the kernel image shadow
  arm64: mm: treat memstart_addr as a signed quantity
  arm64: lse: deal with clobbered IP registers after branch via PLT
  arm64: mm: check at build time that PAGE_OFFSET divides the VA space evenly
  arm64: kasan: Fix zero shadow mapping overriding kernel image shadow
  arm64: consistently use p?d_set_huge
  arm64: fix KASLR boot-time I-cache maintenance
  arm64: hugetlb: partial revert of 66b3923a1a0f
  arm64: make irq_stack_ptr more robust
  arm64: efi: invoke EFI_RNG_PROTOCOL to supply KASLR randomness
  efi: stub: use high allocation for converted command line
  efi: stub: add implementation of efi_random_alloc()
  efi: stub: implement efi_get_random_bytes() based on EFI_RNG_PROTOCOL
  arm64: kaslr: randomize the linear region
  arm64: add support for kernel ASLR
  arm64: add support for building vmlinux as a relocatable PIE binary
  arm64: switch to relative exception tables
  extable: add support for relative extables to search and sort routines
  scripts/sortextable: add support for ET_DYN binaries
  arm64: futex.h: Add missing PAN toggling
  arm64: make asm/elf.h available to asm files
  arm64: avoid dynamic relocations in early boot code
  arm64: avoid R_AARCH64_ABS64 relocations for Image header fields
  arm64: add support for module PLTs
  arm64: move brk immediate argument definitions to separate header
  arm64: mm: use bit ops rather than arithmetic in pa/va translations
  arm64: mm: only perform memstart_addr sanity check if DEBUG_VM
  arm64: User die() instead of panic() in do_page_fault()
  arm64: allow kernel Image to be loaded anywhere in physical memory
  arm64: defer __va translation of initrd_start and initrd_end
  arm64: move kernel image to base of vmalloc area
  arm64: kvm: deal with kernel symbols outside of linear mapping
  arm64: decouple early fixmap init from linear mapping
  arm64: pgtable: implement static [pte|pmd|pud]_offset variants
  arm64: introduce KIMAGE_VADDR as the virtual base of the kernel region
  arm64: add support for ioremap() block mappings
  arm64: prevent potential circular header dependencies in asm/bug.h
  of/fdt: factor out assignment of initrd_start/initrd_end
  of/fdt: make memblock minimum physical address arch configurable
  arm64: Remove the get_thread_info() function
  arm64: kernel: Don't toggle PAN on systems with UAO
  arm64: cpufeature: Test 'matches' pointer to find the end of the list
  arm64: kernel: Add support for User Access Override
  arm64: add ARMv8.2 id_aa64mmfr2 boiler plate
  arm64: cpufeature: Change read_cpuid() to use sysreg's mrs_s macro
  arm64: use local label prefixes for __reg_num symbols
  arm64: vdso: Mark vDSO code as read-only
  arm64: ubsan: select ARCH_HAS_UBSAN_SANITIZE_ALL
  arm64: ptdump: Indicate whether memory should be faulting
  arm64: Add support for ARCH_SUPPORTS_DEBUG_PAGEALLOC
  arm64: Drop alloc function from create_mapping
  arm64: prefetch: add missing #include for spin_lock_prefetch
  arm64: lib: patch in prfm for copy_page if requested
  arm64: lib: improve copy_page to deal with 128 bytes at a time
  arm64: prefetch: add alternative pattern for CPUs without a prefetcher
  arm64: prefetch: don't provide spin_lock_prefetch with LSE
  arm64: allow vmalloc regions to be set with set_memory_*
  arm64: kernel: implement ACPI parking protocol
  arm64: mm: create new fine-grained mappings at boot
  arm64: ensure _stext and _etext are page-aligned
  arm64: mm: allow passing a pgdir to alloc_init_*
  arm64: mm: allocate pagetables anywhere
  arm64: mm: use fixmap when creating page tables
  arm64: mm: add functions to walk tables in fixmap
  arm64: mm: add __{pud,pgd}_populate
  arm64: mm: avoid redundant __pa(__va(x))
  arm64: mm: add functions to walk page tables by PA
  arm64: mm: move pte_* macros
  arm64: kasan: avoid TLB conflicts
  arm64: mm: add code to safely replace TTBR1_EL1
  arm64: add function to install the idmap
  arm64: unmap idmap earlier
  arm64: unify idmap removal
  arm64: mm: place empty_zero_page in bss
  arm64: mm: specialise pagetable allocators
  asm-generic: Fix local variable shadow in __set_fixmap_offset
  Eliminate the .eh_frame sections from the aarch64 vmlinux and kernel modules
  arm64: Fix an enum typo in mm/dump.c
  arm64: kasan: ensure that the KASAN zero page is mapped read-only
  arch/arm64/include/asm/pgtable.h: add pmd_mkclean for THP
  arm64: hide __efistub_ aliases from kallsyms
  Linux 4.4.10
  drm/i915/skl: Fix DMC load on Skylake J0 and K0
  lib/test-string_helpers.c: fix and improve string_get_size() tests
  ACPI / processor: Request native thermal interrupt handling via _OSC
  drm/i915: Fake HDMI live status
  drm/i915: Make RPS EI/thresholds multiple of 25 on SNB-BDW
  drm/i915: Fix eDP low vswing for Broadwell
  drm/i915/ddi: Fix eDP VDD handling during booting and suspend/resume
  drm/radeon: make sure vertical front porch is at least 1
  iio: ak8975: fix maybe-uninitialized warning
  iio: ak8975: Fix NULL pointer exception on early interrupt
  drm/amdgpu: set metadata pointer to NULL after freeing.
  drm/amdgpu: make sure vertical front porch is at least 1
  gpu: ipu-v3: Fix imx-ipuv3-crtc module autoloading
  nvmem: mxs-ocotp: fix buffer overflow in read
  USB: serial: cp210x: add Straizona Focusers device ids
  USB: serial: cp210x: add ID for Link ECU
  ata: ahci-platform: Add ports-implemented DT bindings.
  libahci: save port map for forced port map
  powerpc: Fix bad inline asm constraint in create_zero_mask()
  ACPICA: Dispatcher: Update thread ID for recursive method calls
  x86/sysfb_efi: Fix valid BAR address range check
  ARC: Add missing io barriers to io{read,write}{16,32}be()
  ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value
  propogate_mnt: Handle the first propogated copy being a slave
  fs/pnode.c: treat zero mnt_group_id-s as unequal
  x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO
  MAINTAINERS: Remove asterisk from EFI directory names
  writeback: Fix performance regression in wb_over_bg_thresh()
  batman-adv: Reduce refcnt of removed router when updating route
  batman-adv: Fix broadcast/ogm queue limit on a removed interface
  batman-adv: Check skb size before using encapsulated ETH+VLAN header
  batman-adv: fix DAT candidate selection (must use vid)
  mm: update min_free_kbytes from khugepaged after core initialization
  proc: prevent accessing /proc/<PID>/environ until it's ready
  Input: zforce_ts - fix dual touch recognition
  HID: Fix boot delay for Creative SB Omni Surround 5.1 with quirk
  HID: wacom: Add support for DTK-1651
  xen/evtchn: fix ring resize when binding new events
  xen/balloon: Fix crash when ballooning on x86 32 bit PAE
  xen: Fix page <-> pfn conversion on 32 bit systems
  ARM: SoCFPGA: Fix secondary CPU startup in thumb2 kernel
  ARM: EXYNOS: Properly skip unitialized parent clock in power domain on
  mm/zswap: provide unique zpool name
  mm, cma: prevent nr_isolated_* counters from going negative
  Minimal fix-up of bad hashing behavior of hash_64()
  MD: make bio mergeable
  tracing: Don't display trigger file for events that can't be enabled
  mac80211: fix statistics leak if dev_alloc_name() fails
  ath9k: ar5008_hw_cmn_spur_mitigate: add missing mask_m & mask_p initialisation
  lpfc: fix misleading indentation
  clk: qcom: msm8960: Fix ce3_src register offset
  clk: versatile: sp810: support reentrance
  clk: qcom: msm8960: fix ce3_core clk enable register
  clk: meson: Fix meson_clk_register_clks() signature type mismatch
  clk: rockchip: free memory in error cases when registering clock branches
  soc: rockchip: power-domain: fix err handle while probing
  clk-divider: make sure read-only dividers do not write to their register
  CNS3xxx: Fix PCI cns3xxx_write_config()
  mwifiex: fix corner case association failure
  ata: ahci_xgene: dereferencing uninitialized pointer in probe
  nbd: ratelimit error msgs after socket close
  mfd: intel-lpss: Remove clock tree on error path
  ipvs: drop first packet to redirect conntrack
  ipvs: correct initial offset of Call-ID header search in SIP persistence engine
  ipvs: handle ip_vs_fill_iph_skb_off failure
  RDMA/iw_cxgb4: Fix bar2 virt addr calculation for T4 chips
  Revert: "powerpc/tm: Check for already reclaimed tasks"
  arm64: head.S: use memset to clear BSS
  efi: stub: define DISABLE_BRANCH_PROFILING for all architectures
  arm64: entry: remove pointless SPSR mode check
  arm64: mm: move pgd_cache initialisation to pgtable_cache_init
  arm64: module: avoid undefined shift behavior in reloc_data()
  arm64: module: fix relocation of movz instruction with negative immediate
  arm64: traps: address fallout from printk -> pr_* conversion
  arm64: ftrace: fix a stack tracer's output under function graph tracer
  arm64: pass a task parameter to unwind_frame()
  arm64: ftrace: modify a stack frame in a safe way
  arm64: remove irq_count and do_softirq_own_stack()
  arm64: hugetlb: add support for PTE contiguous bit
  arm64: Use PoU cache instr for I/D coherency
  arm64: Defer dcache flush in __cpu_copy_user_page
  arm64: reduce stack use in irq_handler
  arm64: Documentation: add list of software workarounds for errata
  arm64: mm: place __cpu_setup in .text
  arm64: cmpxchg: Don't incldue linux/mmdebug.h
  arm64: mm: fold alternatives into .init
  arm64: Remove redundant padding from linker script
  arm64: mm: remove pointless PAGE_MASKing
  arm64: don't call C code with el0's fp register
  arm64: when walking onto the task stack, check sp & fp are in current->stack
  arm64: Add this_cpu_ptr() assembler macro for use in entry.S
  arm64: irq: fix walking from irq stack to task stack
  arm64: Add do_softirq_own_stack() and enable irq_stacks
  arm64: Modify stack trace and dump for use with irq_stack
  arm64: Store struct thread_info in sp_el0
  arm64: Add trace_hardirqs_off annotation in ret_to_user
  arm64: ftrace: fix the comments for ftrace_modify_code
  arm64: ftrace: stop using kstop_machine to enable/disable tracing
  arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  arm64: enable HAVE_IRQ_TIME_ACCOUNTING
  arm64: fix COMPAT_SHMLBA definition for large pages
  arm64: add __init/__initdata section marker to some functions/variables
  arm64: pgtable: implement pte_accessible()
  arm64: mm: allow sections for unaligned bases
  arm64: mm: detect bad __create_mapping uses
  Linux 4.4.9
  extcon: max77843: Use correct size for reading the interrupt register
  stm class: Select CONFIG_SRCU
  megaraid_sas: add missing curly braces in ioctl handler
  sunrpc/cache: drop reference when sunrpc_cache_pipe_upcall() detects a race
  thermal: rockchip: fix a impossible condition caused by the warning
  unbreak allmodconfig KCONFIG_ALLCONFIG=...
  jme: Fix device PM wakeup API usage
  jme: Do not enable NIC WoL functions on S0
  bus: imx-weim: Take the 'status' property value into account
  ARM: dts: pxa: fix dma engine node to pxa3xx-nand
  ARM: dts: armada-375: use armada-370-sata for SATA
  ARM: EXYNOS: select THERMAL_OF
  ARM: prima2: always enable reset controller
  ARM: OMAP3: Add cpuidle parameters table for omap3430
  ext4: fix races of writeback with punch hole and zero range
  ext4: fix races between buffered IO and collapse / insert range
  ext4: move unlocked dio protection from ext4_alloc_file_blocks()
  ext4: fix races between page faults and hole punching
  perf stat: Document --detailed option
  perf tools: handle spaces in file names obtained from /proc/pid/maps
  perf hists browser: Only offer symbol scripting when a symbol is under the cursor
  mtd: nand: Drop mtd.owner requirement in nand_scan
  mtd: brcmnand: Fix v7.1 register offsets
  mtd: spi-nor: remove micron_quad_enable()
  serial: sh-sci: Remove cpufreq notifier to fix crash/deadlock
  ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()
  x86/mm/kmmio: Fix mmiotrace for hugepages
  perf evlist: Reference count the cpu and thread maps at set_maps()
  drivers/misc/ad525x_dpot: AD5274 fix RDAC read back errors
  rtc: max77686: Properly handle regmap_irq_get_virq() error code
  rtc: rx8025: remove rv8803 id
  rtc: ds1685: passing bogus values to irq_restore
  rtc: vr41xx: Wire up alarm_irq_enable
  rtc: hym8563: fix invalid year calculation
  PM / Domains: Fix removal of a subdomain
  PM / OPP: Initialize u_volt_min/max to a valid value
  misc: mic/scif: fix wrap around tests
  misc/bmp085: Enable building as a module
  lib/mpi: Endianness fix
  fbdev: da8xx-fb: fix videomodes of lcd panels
  scsi_dh: force modular build if SCSI is a module
  paride: make 'verbose' parameter an 'int' again
  regulator: s5m8767: fix get_register() error handling
  irqchip/mxs: Fix error check of of_io_request_and_map()
  irqchip/sunxi-nmi: Fix error check of of_io_request_and_map()
  spi/rockchip: Make sure spi clk is on in rockchip_spi_set_cs
  locking/mcs: Fix mcs_spin_lock() ordering
  regulator: core: Fix nested locking of supplies
  regulator: core: Ensure we lock all regulators
  regulator: core: fix regulator_lock_supply regression
  Revert "regulator: core: Fix nested locking of supplies"
  videobuf2-v4l2: Verify planes array in buffer dequeueing
  videobuf2-core: Check user space planes array in dqbuf
  USB: usbip: fix potential out-of-bounds write
  cgroup: make sure a parent css isn't freed before its children
  mm/hwpoison: fix wrong num_poisoned_pages accounting
  mm: vmscan: reclaim highmem zone if buffer_heads is over limit
  numa: fix /proc/<pid>/numa_maps for THP
  mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check
  memcg: relocate charge moving from ->attach to ->post_attach
  cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback
  slub: clean up code for kmem cgroup support to kmem_cache_free_bulk
  workqueue: fix ghost PENDING flag while doing MQ IO
  x86/apic: Handle zero vector gracefully in clear_vector_irq()
  efi: Expose non-blocking set_variable() wrapper to efivars
  efi: Fix out-of-bounds read in variable_matches()
  IB/security: Restrict use of the write() interface
  IB/mlx5: Expose correct max_sge_rd limit
  cxl: Keep IRQ mappings on context teardown
  v4l2-dv-timings.h: fix polarity for 4k formats
  vb2-memops: Fix over allocation of frame vectors
  ASoC: rt5640: Correct the digital interface data select
  ASoC: dapm: Make sure we have a card when displaying component widgets
  ASoC: ssm4567: Reset device before regcache_sync()
  ASoC: s3c24xx: use const snd_soc_component_driver pointer
  EDAC: i7core, sb_edac: Don't return NOTIFY_BAD from mce_decoder callback
  toshiba_acpi: Fix regression caused by hotkey enabling value
  i2c: exynos5: Fix possible ABBA deadlock by keeping I2C clock prepared
  i2c: cpm: Fix build break due to incompatible pointer types
  perf intel-pt: Fix segfault tracing transactions
  drm/i915: Use fw_domains_put_with_fifo() on HSW
  drm/i915: Fixup the free space logic in ring_prepare
  drm/amdkfd: uninitialized variable in dbgdev_wave_control_set_registers()
  drm/i915: skl_update_scaler() wants a rotation bitmask instead of bit number
  drm/i915: Cleanup phys status page too
  pwm: brcmstb: Fix check of devm_ioremap_resource() return code
  drm/dp/mst: Get validated port ref in drm_dp_update_payload_part1()
  drm/dp/mst: Restore primary hub guid on resume
  drm/dp/mst: Validate port in drm_dp_payload_send_msg()
  drm/nouveau/gr/gf100: select a stream master to fixup tfb offset queries
  drm: Loongson-3 doesn't fully support wc memory
  drm/radeon: fix vertical bars appear on monitor (v2)
  drm/radeon: forbid mapping of userptr bo through radeon device file
  drm/radeon: fix initial connector audio value
  drm/radeon: add a quirk for a XFX R9 270X
  drm/amdgpu: fix regression on CIK (v2)
  amdgpu/uvd: add uvd fw version for amdgpu
  drm/amdgpu: bump the afmt limit for CZ, ST, Polaris
  drm/amdgpu: use defines for CRTCs and AMFT blocks
  drm/amdgpu: when suspending, if uvd/vce was running. need to cancel delay work.
  iommu/dma: Restore scatterlist offsets correctly
  iommu/amd: Fix checking of pci dma aliases
  pinctrl: single: Fix pcs_parse_bits_in_pinctrl_entry to use __ffs than ffs
  pinctrl: mediatek: correct debounce time unit in mtk_gpio_set_debounce
  xen kconfig: don't "select INPUT_XEN_KBDDEV_FRONTEND"
  Input: pmic8xxx-pwrkey - fix algorithm for converting trigger delay
  Input: gtco - fix crash on detecting device without endpoints
  netlink: don't send NETLINK_URELEASE for unbound sockets
  nl80211: check netlink protocol in socket release notification
  powerpc: Update TM user feature bits in scan_features()
  powerpc: Update cpu_user_features2 in scan_features()
  powerpc: scan_features() updates incorrect bits for REAL_LE
  crypto: talitos - fix AEAD tcrypt tests
  crypto: talitos - fix crash in talitos_cra_init()
  crypto: sha1-mb - use corrcet pointer while completing jobs
  crypto: ccp - Prevent information leakage on export
  iwlwifi: mvm: fix memory leak in paging
  iwlwifi: pcie: lower the debug level for RSA semaphore access
  s390/pci: add extra padding to function measurement block
  cpufreq: intel_pstate: Fix processing for turbo activation ratio
  Revert "drm/amdgpu: disable runtime pm on PX laptops without dGPU power control"
  Revert "drm/radeon: disable runtime pm on PX laptops without dGPU power control"
  drm/i915: Fix race condition in intel_dp_destroy_mst_connector()
  drm/qxl: fix cursor position with non-zero hotspot
  drm/nouveau/core: use vzalloc for allocating ramht
  futex: Acknowledge a new waiter in counter before plist
  futex: Handle unlock_pi race gracefully
  asm-generic/futex: Re-enable preemption in futex_atomic_cmpxchg_inatomic()
  ALSA: hda - Add dock support for ThinkPad X260
  ALSA: pcxhr: Fix missing mutex unlock
  ALSA: hda - add PCI ID for Intel Broxton-T
  ALSA: hda - Keep powering up ADCs on Cirrus codecs
  ALSA: hda/realtek - Add ALC3234 headset mode for Optiplex 9020m
  ALSA: hda - Don't trust the reported actual power state
  x86 EDAC, sb_edac.c: Repair damage introduced when "fixing" channel address
  x86/mm/xen: Suppress hugetlbfs in PV guests
  arm64: Update PTE_RDONLY in set_pte_at() for PROT_NONE permission
  arm64: Honour !PTE_WRITE in set_pte_at() for kernel mappings
  sched/cgroup: Fix/cleanup cgroup teardown/init
  dmaengine: pxa_dma: fix the maximum requestor line
  dmaengine: hsu: correct use of channel status register
  dmaengine: dw: fix master selection
  debugfs: Make automount point inodes permanently empty
  lib: lz4: fixed zram with lz4 on big endian machines
  dm cache metadata: fix cmd_read_lock() acquiring write lock
  dm cache metadata: fix READ_LOCK macros and cleanup WRITE_LOCK macros
  usb: gadget: f_fs: Fix use-after-free
  usb: hcd: out of bounds access in for_each_companion
  xhci: fix 10 second timeout on removal of PCI hotpluggable xhci controllers
  usb: xhci: fix wild pointers in xhci_mem_cleanup
  xhci: resume USB 3 roothub first
  usb: xhci: applying XHCI_PME_STUCK_QUIRK to Intel BXT B0 host
  assoc_array: don't call compare_object() on a node
  ARM: OMAP2+: hwmod: Fix updating of sysconfig register
  ARM: OMAP2: Fix up interconnect barrier initialization for DRA7
  ARM: mvebu: Correct unit address for linksys
  ARM: dts: AM43x-epos: Fix clk parent for synctimer
  KVM: arm/arm64: Handle forward time correction gracefully
  kvm: x86: do not leak guest xcr0 into host interrupt handlers
  x86/mce: Avoid using object after free in genpool
  block: loop: fix filesystem corruption in case of aio/dio
  block: partition: initialize percpuref before sending out KOBJ_ADD

Conflicts:
	arch/arm64/Kconfig
	arch/arm64/include/asm/cputype.h
	arch/arm64/include/asm/hardirq.h
	arch/arm64/include/asm/irq.h
	arch/arm64/kernel/cpu_errata.c
	arch/arm64/kernel/cpuinfo.c
	arch/arm64/kernel/setup.c
	arch/arm64/kernel/smp.c
	arch/arm64/kernel/stacktrace.c
	arch/arm64/mm/init.c
	arch/arm64/mm/mmu.c
	arch/arm64/mm/pageattr.c
	mm/memcontrol.c

CRs-Fixed: 1054234
Signed-off-by: Trilok Soni <tsoni@codeaurora.org>
Change-Id: I2a7a34631ffee36ce18b9171f16d023be777392f
2016-08-18 14:50:45 -07:00
Trilok Soni
f145f41478 Merge remote-tracking branch 'msm-4.4/tmp-2bf7955' into msm-4.4
* msm-4.4/tmp-2bf7955:
  Linux 4.4.8
  Revert "usb: hub: do not clear BOS field during reset device"
  usbvision: fix crash on detecting device with invalid configuration
  staging: android: ion: Set the length of the DMA sg entries in buffer
  Revert "PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()"
  Revert "PCI: Add helpers to manage pci_dev->irq and pci_dev->irq_managed"
  Revert "x86/PCI: Don't alloc pcibios-irq when MSI is enabled"
  HID: usbhid: fix inconsistent reset/resume/reset-resume behavior
  HID: wacom: fix Bamboo ONE oops
  ALSA: usb-audio: Skip volume controls triggers hangup on Dell USB Dock
  ALSA: usb-audio: Add a quirk for Plantronics BT300
  ALSA: usb-audio: Add a sample rate quirk for Phoenix Audio TMX320
  ALSA: hda/realtek - Enable the ALC292 dock fixup on the Thinkpad T460s
  ALSA: hda - fix front mic problem for a HP desktop
  ALSA: hda - Fix headset support and noise on HP EliteBook 755 G2
  ALSA: hda - Fixup speaker pass-through control for nid 0x14 on ALC225
  mmc: sdhci-pci: Add support and PCI IDs for more Broxton host controllers
  perf: Cure event->pending_disable race
  perf: Do not double free
  arm64: replace read_lock to rcu lock in call_step_hook
  Btrfs: fix file/data loss caused by fsync after rename and new inode
  iommu: Don't overwrite domain pointer when there is no default_domain
  ext4: ignore quota mount options if the quota feature is enabled
  ext4: add lockdep annotations for i_data_sem
  btrfs: fix crash/invalid memory access on fsync when using overlayfs
  nfs: use file_dentry()
  fs: add file_dentry()
  sd: Fix excessive capacity printing on devices with blocks bigger than 512 bytes
  iio: gyro: bmg160: fix endianness when reading axes
  iio: gyro: bmg160: fix buffer read values
  iio: accel: bmc150: fix endianness when reading axes
  iio: st_magn: always define ST_MAGN_TRIGGER_SET_STATE
  usb: renesas_usbhs: fix to avoid using a disabled ep in usbhsg_queue_done()
  usb: renesas_usbhs: disable TX IRQ before starting TX DMAC transfer
  usb: renesas_usbhs: avoid NULL pointer derefernce in usbhsf_pkt_handler()
  mac80211: fix txq queue related crashes
  mac80211: fix unnecessary frame drops in mesh fwding
  mac80211: fix ibss scan parameters
  mac80211: avoid excessive stack usage in sta_info
  mac80211: properly deal with station hashtable insert errors
  virtio: virtio 1.0 cs04 spec compliance for reset
  rbd: use GFP_NOIO consistently for request allocations
  pcmcia: db1xxx_ss: fix last irq_to_gpio user
  v4l: vsp1: Set the SRU CTRL0 register when starting the stream
  coda: fix error path in case of missing pdata on non-DT platform
  au0828: Fix dev_state handling
  au0828: fix au0828_v4l2_close() dev_state race condition
  pinctrl: freescale: imx: fix bogus check of of_iomap() return value
  pinctrl: nomadik: fix pull debug print inversion
  pinctrl: sunxi: Fix A33 external interrupts not working
  pinctrl: sh-pfc: only use dummy states for non-DT platforms
  pinctrl: pistachio: fix mfio84-89 function description and pinmux.
  MIPS: Fix MSA ld unaligned failure cases
  KVM: x86: reduce default value of halt_poll_ns parameter
  KVM: x86: Inject pending interrupt even if pending nmi exist
  cdc-acm: fix NULL pointer reference
  USB: uas: Add a new NO_REPORT_LUNS quirk
  USB: uas: Limit qdepth at the scsi-host level
  mpls: find_outdev: check for err ptr in addition to NULL check
  ipv6: Count in extension headers in skb->network_header
  ip6_tunnel: set rtnl_link_ops before calling register_netdevice
  ipv6: l2tp: fix a potential issue in l2tp_ip6_recv
  ipv4: l2tp: fix a potential issue in l2tp_ip_recv
  tuntap: restore default qdisc
  tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
  rtnl: fix msg size calculation in if_nlmsg_size()
  bridge: Allow set bridge ageing time when switchdev disabled
  ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
  qmi_wwan: add "D-Link DWM-221 B1" device id
  xfrm: Fix crash observed during device unregistration and decryption
  ppp: take reference on channels netns
  ipv4: initialize flowi4_flags before calling fib_lookup()
  ipv4: fix broadcast packets reception
  bonding: fix bond_get_stats()
  net: bcmgenet: fix dma api length mismatch
  qlge: Fix receive packets drop.
  tcp/dccp: remove obsolete WARN_ON() in icmp handlers
  ppp: ensure file->private_data can't be overridden
  ath9k: fix buffer overrun for ar9287
  farsync: fix off-by-one bug in fst_add_one
  mlx4: add missing braces in verify_qp_parameters
  net: Fix use after free in the recvmmsg exit path
  ipv4: Don't do expensive useless work during inetdev destroy.
  bridge: allow zero ageing time
  rocker: set FDB cleanup timer according to lowest ageing time
  mlxsw: spectrum: Check requested ageing time is valid
  macvtap: always pass ethernet header in linear
  qlcnic: Fix mailbox completion handling during spurious interrupt
  qlcnic: Remove unnecessary usage of atomic_t
  sh_eth: advance 'rxdesc' later in sh_eth_ring_format()
  sh_eth: fix NULL pointer dereference in sh_eth_ring_format()
  bpf: avoid copying junk bytes in bpf_get_current_comm()
  packet: validate variable length ll headers
  ax25: add link layer header validation function
  net: validate variable length ll headers
  ppp: release rtnl mutex when interface creation fails
  tcp: fix tcpi_segs_in after connection establishment
  udp6: fix UDP/IPv6 encap resubmit path
  usbnet: cleanup after bind() in probe()
  cdc_ncm: toggle altsetting to force reset before setup
  vxlan: fix missing options_len update on RX with collect metadata
  ipv6: re-enable fragment header matching in ipv6_find_hdr
  qmi_wwan: add Sierra Wireless EM74xx device ID
  tipc: Revert "tipc: use existing sk_write_queue for outgoing packet chain"
  mld, igmp: Fix reserved tailroom calculation
  sctp: lack the check for ports in sctp_v6_cmp_addr
  net: fix bridge multicast packet checksum validation
  net: qca_spi: clear IFF_TX_SKB_SHARING
  net: qca_spi: Don't clear IFF_BROADCAST
  net: vrf: Remove direct access to skb->data
  net: jme: fix suspend/resume on JMC260
  ipv4: only create late gso-skb if skb is already set up with CHECKSUM_PARTIAL
  tunnel: Clear IPCB(skb)->opt before dst_link_failure called
  tcp: convert cached rtt from usec to jiffies when feeding initial rto
  xen/events: Mask a moving irq
  drm/amdgpu/gmc: use proper register for vram type on Fiji
  drm/amdgpu/gmc: move vram type fetching into sw_init
  drm/radeon: add a dpm quirk for all R7 370 parts
  drm/radeon: add another R7 370 quirk
  drm/radeon: add a dpm quirk for sapphire Dual-X R7 370 2G D5
  drm/udl: Use unlocked gem unreferencing
  drm/dp: move hw_mutex up the call stack
  arm64: opcodes.h: Add arm big-endian config options before including arm header
  compiler-gcc: disable -ftracer for __noclone functions
  libnvdimm, pfn: fix uuid validation
  libnvdimm: fix smart data retrieval
  powerpc/mm: Fixup preempt underflow with huge pages
  mm: fix invalid node in alloc_migrate_target()
  ALSA: hda - Apply fix for white noise on Asus N550JV, too
  ALSA: hda - Fix white noise on Asus N750JV headphone
  ALSA: hda - Asus N750JV external subwoofer fixup
  ALSA: timer: Use mod_timer() for rearming the system timer
  parisc: Unbreak handling exceptions from kernel modules
  parisc: Fix kernel crash with reversed copy_from_user()
  parisc: Avoid function pointers for kernel exception routines
  PKCS#7: pkcs7_validate_trust(): initialize the _trusted output argument
  hwmon: (max1111) Return -ENODEV from max1111_read_channel if not instantiated
  Linux 4.4.7
  perf/x86/intel: Fix PEBS data source interpretation on Nehalem/Westmere
  perf/x86/intel: Use PAGE_SIZE for PEBS buffer size on Core2
  perf/x86/intel: Fix PEBS warning by only restoring active PMU in pmi
  perf/x86/pebs: Add workaround for broken OVFL status on HSW+
  sched/cputime: Fix steal time accounting vs. CPU hotplug
  scsi_common: do not clobber fixed sense information
  PM / sleep: Clear pm_suspend_global_flags upon hibernate
  intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
  mtd: onenand: fix deadlock in onenand_block_markbad
  mm/page_alloc: prevent merging between isolated and other pageblocks
  ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list
  ocfs2/dlm: fix race between convert and recovery
  Input: ati_remote2 - fix crashes on detecting device with invalid descriptor
  Input: ims-pcu - sanity check against missing interfaces
  Input: synaptics - handle spurious release of trackstick buttons, again
  writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode
  writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list()
  ACPI / PM: Runtime resume devices when waking from hibernate
  ARM: dts: at91: sama5d4 Xplained: don't disable hsmci regulator
  ARM: dts: at91: sama5d3 Xplained: don't disable hsmci regulator
  nfsd: fix deadlock secinfo+readdir compound
  nfsd4: fix bad bounds checking
  iser-target: Rework connection termination
  iser-target: Separate flows for np listeners and connections cma events
  iser-target: Add new state ISER_CONN_BOUND to isert_conn
  iser-target: Fix identification of login rx descriptor type
  target: Fix target_release_cmd_kref shutdown comp leak
  clk: bcm2835: Fix setting of PLL divider clock rates
  clk: rockchip: add hclk_cpubus to the list of rk3188 critical clocks
  clk: rockchip: rk3368: fix hdmi_cec gate-register
  clk: rockchip: rk3368: fix parents of video encoder/decoder
  clk: rockchip: rk3368: fix cpuclk core dividers
  clk: rockchip: rk3368: fix cpuclk mux bit of big cpu-cluster
  mmc: sdhci: Fix override of timeout clk wrt max_busy_timeout
  mmc: sdhci: fix data timeout (part 2)
  mmc: sdhci: fix data timeout (part 1)
  mmc: mmc_spi: Add Card Detect comments and fix CD GPIO case
  mmc: block: fix ABI regression of mmc_blk_ioctl
  ideapad-laptop: Add ideapad Y700 (15) to the no_hw_rfkill DMI list
  MAINTAINERS: Update mailing list and web page for hwmon subsystem
  kbuild/mkspec: fix grub2 installkernel issue
  scripts/kconfig: allow building with make 3.80 again
  scripts/coccinelle: modernize &
  bitops: Do not default to __clear_bit() for __clear_bit_unlock()
  tracing: Fix trace_printk() to print when not using bprintk()
  tracing: Fix crash from reading trace_pipe with sendfile
  tracing: Have preempt(irqs)off trace preempt disabled functions
  IB/ipoib: fix for rare multicast join race condition
  drm/amdgpu: include the right version of gmc header files for iceland
  drm/amdgpu: disable runtime pm on PX laptops without dGPU power control
  drm/radeon: Don't drop DP 2.7 Ghz link setup on some cards.
  drm/radeon: disable runtime pm on PX laptops without dGPU power control
  iwlwifi: mvm: Fix paging memory leak
  ipr: Fix regression when loading firmware
  ipr: Fix out-of-bounds null overwrite
  rapidio/rionet: fix deadlock on SMP
  fs/coredump: prevent fsuid=0 dumps into user-controlled directories
  fuse: Add reference counting for fuse_io_priv
  fuse: do not use iocb after it may have been freed
  md: multipath: don't hardcopy bio in .make_request path
  md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list
  raid10: include bio_end_io_list in nr_queued to prevent freeze_array hang
  RAID5: revert e9e4c377e2 to fix a livelock
  RAID5: check_reshape() shouldn't call mddev_suspend
  md/raid5: Compare apples to apples (or sectors to sectors)
  raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang
  xfs: fix two memory leaks in xfs_attr_list.c error paths
  quota: Fix possible GPF due to uninitialised pointers
  ARC: bitops: Remove non relevant comments
  ARC: [BE] readl()/writel() to work in Big Endian CPU configuration
  xtensa: clear all DBREAKC registers on start
  xtensa: fix preemption in {clear,copy}_user_highpage
  xtensa: ISS: don't hang if stdin EOF is reached
  splice: handle zero nr_pages in splice_to_pipe()
  vfs: show_vfsstat: do not ignore errors from show_devname method
  of: alloc anywhere from memblock if range not specified
  net: mvneta: enable change MAC address when interface is up
  cgroup: ignore css_sets associated with dead cgroups during migration
  Bluetooth: Fix potential buffer overflow with Add Advertising
  Bluetooth: Add new AR3012 ID 0489:e095
  watchdog: rc32434_wdt: fix ioctl error handling
  watchdog: don't run proc_watchdog_update if new value is same as old
  ia64: define ioremap_uc()
  mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage
  mm: memcontrol: reclaim when shrinking memory.high below usage
  bcache: fix cache_set_flush() NULL pointer dereference on OOM
  bcache: fix race of writeback thread starting before complete initialization
  bcache: cleaned up error handling around register_cache()
  IB/srpt: Simplify srpt_handle_tsk_mgmt()
  brd: Fix discard request processing
  jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path
  tools/hv: Use include/uapi with __EXPORTED_HEADERS__
  ALSA: hda - Fix unconditional GPIO toggle via automute
  ALSA: hda - fix the mic mute button and led problem for a Lenovo AIO
  ALSA: hda - Don't handle ELD notify from invalid port
  ALSA: intel8x0: Add clock quirk entry for AD1981B on IBM ThinkPad X41.
  ALSA: pcm: Avoid "BUG:" string for warnings again
  ALSA: hda - Apply reboot D3 fix for CX20724 codec, too
  mtip32xx: Cleanup queued requests after surprise removal
  mtip32xx: Implement timeout handler
  mtip32xx: Handle FTL rebuild failure state during device initialization
  mtip32xx: Handle safe removal during IO
  mtip32xx: Fix for rmmod crash when drive is in FTL rebuild
  mtip32xx: Print exact time when an internal command is interrupted
  mtip32xx: Remove unwanted code from taskfile error handler
  mtip32xx: Fix broken service thread handling
  mtip32xx: Avoid issuing standby immediate cmd during FTL rebuild
  media: v4l2-compat-ioctl32: fix missing length copy in put_v4l2_buffer32
  coda: fix first encoded frame payload
  bttv: Width must be a multiple of 16 when capturing planar formats
  adv7511: TX_EDID_PRESENT is still 1 after a disconnect
  saa7134: Fix bytesperline not being set correctly for planar formats
  8250: use callbacks to access UART_DLL/UART_DLM
  net: irda: Fix use-after-free in irtty_open()
  tty: Fix GPF in flush_to_ldisc(), part 2
  staging: comedi: ni_mio_common: fix the ni_write[blw]() functions
  staging: android: ion_test: fix check of platform_device_register_simple() error code
  staging: comedi: ni_tiocmd: change mistaken use of start_src for start_arg
  HID: fix hid_ignore_special_drivers module parameter
  HID: multitouch: force retrieving of Win8 signature blob
  HID: i2c-hid: fix OOB write in i2c_hid_set_or_send_report()
  HID: logitech: fix Dual Action gamepad support
  tpm: fix the cleanup of struct tpm_chip
  tpm_eventlog.c: fix binary_bios_measurements
  tpm_crb: tpm2_shutdown() must be called before tpm_chip_unregister()
  tpm: fix the rollback in tpm_chip_register()
  mei: bus: check if the device is enabled before data transfer
  X.509: Fix leap year handling again
  crypto: marvell/cesa - forward devm_ioremap_resource() error code
  crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
  crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
  crypto: keywrap - memzero the correct memory
  crypto: ccp - memset request context to zero during import
  crypto: ccp - Don't assume export/import areas are aligned
  crypto: ccp - Limit the amount of information exported
  crypto: ccp - Add hash state import and export support
  Bluetooth: btusb: Add a new AR3012 ID 13d3:3472
  Bluetooth: btusb: Add a new AR3012 ID 04ca:3014
  Bluetooth: btusb: Add new AR3012 ID 13d3:3395
  ALSA: usb-audio: Fix double-free in error paths after snd_usb_add_audio_stream() call
  ALSA: usb-audio: Minor code cleanup in create_fixed_stream_quirk()
  ALSA: usb-audio: add Microsoft HD-5001 to quirks
  ALSA: usb-audio: Add sanity checks for endpoint accesses
  ALSA: usb-audio: Fix NULL dereference in create_fixed_stream_quirk()
  Input: powermate - fix oops with malicious USB descriptors
  pwc: Add USB id for Philips Spc880nc webcam
  USB: option: add "D-Link DWM-221 B1" device id
  USB: serial: ftdi_sio: Add support for ICP DAS I-756xU devices
  USB: serial: cp210x: Adding GE Healthcare Device ID
  USB: cypress_m8: add endpoint sanity check
  USB: digi_acceleport: do sanity checking for the number of ports
  USB: mct_u232: add sanity checking in probe
  USB: usb_driver_claim_interface: add sanity checking
  USB: iowarrior: fix oops with malicious USB descriptors
  USB: cdc-acm: more sanity checking
  USB: uas: Reduce can_queue to MAX_CMNDS
  usb: hub: fix a typo in hub_port_init() leading to wrong logic
  usb: retry reset if a device times out
  dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request()
  dm cache: make sure every metadata function checks fail_io
  dm thin metadata: don't issue prefetches if a transaction abort has failed
  dm: fix excessive dm-mq context switching
  dm snapshot: disallow the COW and origin devices from being identical
  libnvdimm: Fix security issue with DSM IOCTL.
  aic7xxx: Fix queue depth handling
  be2iscsi: set the boot_kset pointer to NULL in case of failure
  scsi: storvsc: fix SRB_STATUS_ABORTED handling
  sd: Fix discard granularity when LBPRZ=1
  aacraid: Set correct msix count for EEH recovery
  aacraid: Fix memory leak in aac_fib_map_free
  aacraid: Fix RRQ overload
  sg: fix dxferp in from_to case
  x86/mm: TLB_REMOTE_SEND_IPI should count pages
  x86/iopl: Fix iopl capability check on Xen PV
  x86/iopl/64: Properly context-switch IOPL on Xen PV
  x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt()
  x86/irq: Cure live lock in fixup_irqs()
  PCI: ACPI: IA64: fix IO port generic range check
  PCI: Disable IO/MEM decoding for devices with non-compliant BARs
  pinctrl-bcm2835: Fix cut-and-paste error in "pull" parsing
  s390/pci: enforce fmb page boundary rule
  s390/cpumf: add missing lpp magic initialization
  s390: fix floating pointer register corruption (again)
  EDAC, amd64_edac: Shift wrapping issue in f1x_get_norm_dct_addr()
  EDAC/sb_edac: Fix computation of channel address
  sched/preempt, sh: kmap_coherent relies on disabled preemption
  sched/cputime: Fix steal_account_process_tick() to always return jiffies
  Thermal: Ignore invalid trip points
  perf tools: Fix python extension build
  perf tools: Fix checking asprintf return value
  perf tools: Dont stop PMU parsing on alias parse error
  perf/core: Fix perf_sched_count derailment
  KVM: VMX: fix nested vpid for old KVM guests
  KVM: VMX: avoid guest hang on invalid invvpid instruction
  KVM: VMX: avoid guest hang on invalid invept instruction
  KVM: fix spin_lock_init order on x86
  KVM: i8254: change PIT discard tick policy
  KVM: x86: fix missed hardware breakpoints
  x86/PCI: Mark Broadwell-EP Home Agent & PCU as having non-compliant BARs
  perf/x86/intel: Add definition for PT PMI bit
  x86/entry/compat: Keep TS_COMPAT set during signal delivery
  x86/microcode: Untangle from BLK_DEV_INITRD
  x86/microcode/intel: Make early loader look for builtin microcode too
  mmc: sh_mmcif: Correct TX DMA channel allocation
  mmc: sh_mmcif: rework dma channel handling
  ASoC: samsung: pass DMA channels as pointers
  regulator: core: Fix nested locking of supplies
  regulator: core: avoid unused variable warning
  s390/cpumf: Fix lpp detection
  cpufreq: dt: No need to allocate resources anymore
  cpufreq: dt: No need to fetch voltage-tolerance
  cpufreq: dt: Use dev_pm_opp_set_rate() to switch frequency
  cpufreq: dt: Reuse dev_pm_opp_get_max_transition_latency()
  cpufreq: dt: Unsupported OPPs are already disabled
  cpufreq: dt: Pass regulator name to the OPP core
  cpufreq: dt: OPP layers handles clock-latency for V1 bindings as well
  cpufreq: dt: Rename 'need_update' to 'opp_v1'
  cpufreq: dt: Convert few pr_debug/err() calls to dev_dbg/err()
  cpufreq-dt: fix handling regulator_get_voltage() result
  cpufreq-dt: Supply power coefficient when registering cooling devices
  PM / OPP: Rename structures for clarity
  PM / OPP: Fix incorrect comments
  PM / OPP: Initialize regulator pointer to an error value
  PM / OPP: Initialize u_volt_min/max to a valid value
  PM / OPP: Fix NULL pointer dereference crash when disabling OPPs
  PM / OPP: Add dev_pm_opp_set_rate()
  PM / OPP: Manage device clk
  PM / OPP: Parse clock-latency and voltage-tolerance for v1 bindings
  PM / OPP: Introduce dev_pm_opp_get_max_transition_latency()
  PM / OPP: Introduce dev_pm_opp_get_max_volt_latency()
  PM / OPP: Disable OPPs that aren't supported by the regulator
  PM / OPP: get/put regulators from OPP core
  cpufreq: cpufreq-dt: avoid uninitialized variable warnings:
  PM / OPP: Use snprintf() instead of sprintf()
  PM / OPP: Set cpu_dev->id in cpumask first
  PM / OPP: Fix parsing of opp-microvolt and opp-microamp properties
  PM / OPP: Parse 'opp-<prop>-<name>' bindings
  PM / OPP: Parse 'opp-supported-hw' binding
  PM / OPP: Add missing doc comments
  PM / OPP: Rename OPP nodes as opp@<opp-hz>
  PM / OPP: Remove 'operating-points-names' binding
  PM / OPP: Add {opp-microvolt|opp-microamp}-<name> binding
  PM / OPP: Add "opp-supported-hw" binding
  PM / OPP: Add debugfs support
  arm64: vdso: Mark vDSO code as read-only

Conflicts:
	drivers/staging/android/ion/ion.c
	mm/page_alloc.c

CRs-Fixed: 1010239
Change-Id: Id59539cad642885e1e41340cebae4159ba1f7eaf
Signed-off-by: Trilok Soni <tsoni@codeaurora.org>
2016-07-22 16:45:32 -07:00
Tejun Heo
52526076a5 memcg: relocate charge moving from ->attach to ->post_attach
commit 264a0ae164bc0e9144bebcd25ff030d067b1a878 upstream.

Hello,

So, this ended up a lot simpler than I originally expected.  I tested
it lightly and it seems to work fine.  Petr, can you please test these
two patches w/o the lru drain drop patch and see whether the problem
is gone?

Thanks.
------ 8< ------
If charge moving is used, memcg performs relabeling of the affected
pages from its ->attach callback which is called under both
cgroup_threadgroup_rwsem and thus can't create new kthreads.  This is
fragile as various operations may depend on workqueues making forward
progress which relies on the ability to create new kthreads.

There's no reason to perform charge moving from ->attach which is deep
in the task migration path.  Move it to ->post_attach which is called
after the actual migration is finished and cgroup_threadgroup_rwsem is
dropped.

* move_charge_struct->mm is added and ->can_attach is now responsible
  for pinning and recording the target mm.  mem_cgroup_clear_mc() is
  updated accordingly.  This also simplifies mem_cgroup_move_task().

* mem_cgroup_move_task() is now called from ->post_attach instead of
  ->attach.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Debugged-and-tested-by: Petr Mladek <pmladek@suse.com>
Reported-by: Cyril Hrubis <chrubis@suse.cz>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Fixes: 1ed1328792 ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04 14:48:49 -07:00
Johannes Weiner
0ccab5b139 mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage
commit b6e6edcfa40561e9c8abe5eecf1c96f8e5fd9c6f upstream.

Setting the original memory.limit_in_bytes hardlimit is subject to a
race condition when the desired value is below the current usage.  The
code tries a few times to first reclaim and then see if the usage has
dropped to where we would like it to be, but there is no locking, and
the workload is free to continue making new charges up to the old limit.
Thus, attempting to shrink a workload relies on pure luck and hope that
the workload happens to cooperate.

To fix this in the cgroup2 memory.max knob, do it the other way round:
set the limit first, then try enforcement.  And if reclaim is not able
to succeed, trigger OOM kills in the group.  Keep going until the new
limit is met, we run out of OOM victims and there's only unreclaimable
memory left, or the task writing to memory.max is killed.  This allows
users to shrink groups reliably, and the behavior is consistent with
what happens when new charges are attempted in excess of memory.max.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:54 -07:00
Johannes Weiner
8b42fc47e1 mm: memcontrol: reclaim when shrinking memory.high below usage
commit 588083bb37a3cea8533c392370a554417c8f29cb upstream.

When setting memory.high below usage, nothing happens until the next
charge comes along, and then it will only reclaim its own charge and not
the now potentially huge excess of the new memory.high.  This can cause
groups to stay in excess of their memory.high indefinitely.

To fix that, when shrinking memory.high, kick off a reclaim cycle that
goes after the delta.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:53 -07:00
Martijn Coenen
ececa3ebe2 memcg: only free spare array when readers are done
commit 6611d8d76132f86faa501de9451a89bf23fb2371 upstream.

A spare array holding mem cgroup threshold events is kept around to make
sure we can always safely deregister an event and have an array to store
the new set of events in.

In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold->primary after it is freed.

Fixed by only freeing after synchronize_rcu().

Signed-off-by: Martijn Coenen <maco@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25 12:01:22 -08:00
Martijn Coenen
e212fc789a UPSTREAM: memcg: Only free spare array when readers are done
A spare array holding mem cgroup threshold events is kept around
to make sure we can always safely deregister an event and have an
array to store the new set of events in.

In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold->primary after it is freed.

Fixed by only freeing after synchronize_rcu().

Signed-off-by: Martijn Coenen <maco@google.com>
2016-02-16 13:53:47 -08:00
Amit Pundir
703920c14a cgroup: refactor allow_attach handler for 4.4
Refactor *allow_attach() handler to align it with the changes
from mainline commit 1f7dd3e5a6 "cgroup: fix handling of
multi-destination migration from subtree_control enabling".

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-02-16 13:53:46 -08:00
Amit Pundir
dccfe9526b cgroup: memcg: pass correct argument to subsys_cgroup_allow_attach
Pass correct argument to subsys_cgroup_allow_attach(), which
expects 'struct cgroup_subsys_state *' argument but we pass
'struct cgroup *' instead which doesn't seem right.

This fixes following 'incompatible pointer type' compiler warning:
----------
  CC      mm/memcontrol.o
mm/memcontrol.c: In function ‘mem_cgroup_allow_attach’:
mm/memcontrol.c:5052:2: warning: passing argument 1 of ‘subsys_cgroup_allow_attach’ from incompatible pointer type [enabled by default]
In file included from include/linux/memcontrol.h:22:0,
                 from mm/memcontrol.c:29:
include/linux/cgroup.h:953:5: note: expected ‘struct cgroup_subsys_state *’ but argument is of type ‘struct cgroup *’
----------

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-02-16 13:53:44 -08:00
Rom Lemarchand
e6f5c0c0ec memcg: add permission check
Use the 'allow_attach' handler for the 'mem' cgroup to allow
non-root processes to add arbitrary processes to a 'mem' cgroup
if it has the CAP_SYS_NICE capability set.

Bug: 18260435
Change-Id: If7d37bf90c1544024c4db53351adba6a64966250
Signed-off-by: Rom Lemarchand <romlem@android.com>
2016-02-16 13:53:43 -08:00
Vladimir Davydov
6df38689e0 mm: memcontrol: fix possible memcg leak due to interrupted reclaim
Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
once enough pages have been reclaimed, in which case, in contrast to a
full round-trip over a cgroup sub-tree, the current position stored in
mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
and so is left holding the reference to the last scanned cgroup.  If the
target cgroup does not get scanned again (we might have just reclaimed
the last page or all processes might exit and free their memory
voluntary), we will leak it, because there is nobody to put the
reference held by the iterator.

The problem is easy to reproduce by running the following command
sequence in a loop:

    mkdir /sys/fs/cgroup/memory/test
    echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
    memhog 150M
    echo $$ > /sys/fs/cgroup/memory/cgroup.procs
    rmdir test

The cgroups generated by it will never get freed.

This patch fixes this issue by making mem_cgroup_iter avoid taking
reference to the current position.  In order not to hit use-after-free
bug while running reclaim in parallel with cgroup deletion, we make use
of ->css_released cgroup callback to clear references to the dying
cgroup in all reclaim iterators that might refer to it.  This callback
is called right before scheduling rcu work which will free css, so if we
access iter->position from rcu read section, we might be sure it won't
go away under us.

[hannes@cmpxchg.org: clean up css ref handling]
Fixes: 5ac8fb31ad ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>	[3.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-29 17:45:49 -08:00
Hugh Dickins
25be6a6595 mm: fix kerneldoc on mem_cgroup_replace_page
Whoops, I missed removing the kerneldoc comment of the lrucare arg
removed from mem_cgroup_replace_page; but it's a good comment, keep it.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12 10:15:34 -08:00
Vladimir Davydov
9516a18a9a memcg: fix memory.high target
When the memory.high threshold is exceeded, try_charge() schedules a
task_work to reclaim the excess.  The reclaim target is set to the
number of pages requested by try_charge().

This is wrong, because try_charge() usually charges more pages than
requested (batch > nr_pages) in order to refill per cpu stocks.  As a
result, a process in a cgroup can easily exceed memory.high
significantly when doing a lot of charges w/o returning to userspace
(e.g.  reading a file in big chunks).

Fix this issue by assuring that when exceeding memory.high a process
reclaims as many pages as were actually charged (i.e.  batch).

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12 10:15:34 -08:00
Tejun Heo
1f7dd3e5a6 cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.

  P0 (+memory) --- P1 (-memory) --- A
                                 \- B
       
P0 has memory enabled in its subtree_control while P1 doesn't.  If
both A and B contain processes, they would belong to the memory css of
P1.  Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter.  IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses.  pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

 WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
 Modules linked in:
 CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
 ...
  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
 Call Trace:
  [<ffffffff81551ffc>] dump_stack+0x4e/0x82
  [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
  [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
  [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
  [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
  [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
  [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
  [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
  [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
  [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
  [<ffffffff81265f88>] __vfs_write+0x28/0xe0
  [<ffffffff812666fc>] vfs_write+0xac/0x1a0
  [<ffffffff81267019>] SyS_write+0x49/0xb0
  [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated.  All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
  target csses can be converted trivially.  cpu, io, freezer, perf,
  netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
  and destination and thus doesn't support v2 hierarchy already.  The
  only change made by this patchset is how that single destination css
  is obtained.

* memory migration path already doesn't do anything on v2.  How the
  single destination css is obtained is updated and the prep stage of
  mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug.  It now
  correctly handles multi-destination migrations and no longer causes
  counter underflow from incorrect accounting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 10:18:21 -05:00
Andrew Morton
6f6461562e mm/memcontrol.c: uninline mem_cgroup_usage
gcc version 5.2.1 20151010 (Debian 5.2.1-22)
$ size mm/memcontrol.o mm/memcontrol.o.before
   text    data     bss     dec     hex filename
  35535    7908      64   43507    a9f3 mm/memcontrol.o
  35762    7908      64   43734    aad6 mm/memcontrol.o.before

Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Mel Gorman
71baba4b92 mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM
__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep.  Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep.  The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Mel Gorman
d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Linus Torvalds
2e3078af2c Merge branch 'akpm' (patches from Andrew)
Merge patch-bomb from Andrew Morton:

 - inotify tweaks

 - some ocfs2 updates (many more are awaiting review)

 - various misc bits

 - kernel/watchdog.c updates

 - Some of mm.  I have a huge number of MM patches this time and quite a
   lot of it is quite difficult and much will be held over to next time.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
  selftests: vm: add tests for lock on fault
  mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
  mm: introduce VM_LOCKONFAULT
  mm: mlock: add new mlock system call
  mm: mlock: refactor mlock, munlock, and munlockall code
  kasan: always taint kernel on report
  mm, slub, kasan: enable user tracking by default with KASAN=y
  kasan: use IS_ALIGNED in memory_is_poisoned_8()
  kasan: Fix a type conversion error
  lib: test_kasan: add some testcases
  kasan: update reference to kasan prototype repo
  kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
  kasan: various fixes in documentation
  kasan: update log messages
  kasan: accurately determine the type of the bad access
  kasan: update reported bug types for kernel memory accesses
  kasan: update reported bug types for not user nor kernel memory accesses
  mm/kasan: prevent deadlock in kasan reporting
  mm/kasan: don't use kasan shadow pointer in generic functions
  mm/kasan: MODULE_VADDR is not available on all archs
  ...
2015-11-05 23:10:54 -08:00
Michal Hocko
c12176d336 memcg: fix thresholds for 32b architectures.
Commit 424cdc1413 ("memcg: convert threshold to bytes") has fixed a
regression introduced by 3e32cb2e0a ("mm: memcontrol: lockless page
counters") where thresholds were silently converted to use page units
rather than bytes when interpreting the user input.

The fix is not complete, though, as properly pointed out by Ben Hutchings
during stable backport review.  The page count is converted to bytes but
unsigned long is used to hold the value which would be obviously not
sufficient for 32b systems with more than 4G thresholds.  The same applies
to usage as taken from mem_cgroup_usage which might overflow.

Let's remove this bytes vs.  pages internal tracking differences and
handle thresholds in page units internally.  Chage mem_cgroup_usage() to
return the value in page units and revert 424cdc1413 because this should
be sufficient for the consistent handling.  mem_cgroup_read_u64 as the
only users of mem_cgroup_usage outside of the threshold handling code is
converted to give the proper in bytes result.  It is doing that already
for page_counter output so this is more consistent as well.

The value presented to the userspace is still in bytes units.

Fixes: 424cdc1413 ("memcg: convert threshold to bytes")
Fixes: 3e32cb2e0a ("mm: memcontrol: lockless page counters")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
From: Michal Hocko <mhocko@kernel.org>
Subject: memcg-fix-thresholds-for-32b-architectures-fix

Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
From: Andrew Morton <akpm@linux-foundation.org>
Subject: memcg-fix-thresholds-for-32b-architectures-fix-fix

don't attempt to inline mem_cgroup_usage()

The compiler ignores the inline anwyay.  And __always_inlining it adds 600
bytes of goop to the .o file.

Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Johannes Weiner
6071ca5201 mm: page_counter: let page_counter_try_charge() return bool
page_counter_try_charge() currently returns 0 on success and -ENOMEM on
failure, which is surprising behavior given the function name.

Make it follow the expected pattern of try_stuff() functions that return a
boolean true to indicate success, or false for failure.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Johannes Weiner
f5fc3c5d81 mm: memcontrol: eliminate root memory.current
memory.current on the root level doesn't add anything that wouldn't be
more accurate and detailed using system statistics.  It already doesn't
include slabs, and it'll be a pain to keep in sync when further memory
types are accounted in the memory controller.  Remove it.

Note that this applies to the new unified hierarchy interface only.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Hugh Dickins
45637bab30 mm: rename mem_cgroup_migrate to mem_cgroup_replace_page
After v4.3's commit 0610c25daa ("memcg: fix dirty page migration")
mem_cgroup_migrate() doesn't have much to offer in page migration: convert
migrate_misplaced_transhuge_page() to set_page_memcg() instead.

Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
remaining callers are replace_page_cache_page() and shmem_replace_page():
both of whom passed lrucare true, so just eliminate that argument.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Vladimir Davydov
df4065516b memcg: simplify and inline __mem_cgroup_from_kmem
Before the previous patch ("memcg: unify slab and other kmem pages
charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab
pages and pages allocated with alloc_kmem_pages - memcg in the page
struct.  Now we can unify it.  Since after it, this function becomes tiny
we can fold it into mem_cgroup_from_kmem.

[hughd@google.com: move mem_cgroup_from_kmem into list_lru.c]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Vladimir Davydov
f3ccb2c422 memcg: unify slab and other kmem pages charging
We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
uncharging kmem pages to memcg, but currently they are not used for
charging slab pages (i.e.  they are only used for charging pages allocated
with alloc_kmem_pages).  The only reason why the slab subsystem uses
special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
to the memcg that the current task belongs to.

To remove this diversity, this patch adds an extra argument to
__memcg_kmem_charge that can be a pointer to a memcg or NULL.  If it is
not NULL, the function tries to charge to the memcg it points to,
otherwise it charge to the current context.  Next, it makes the slab
subsystem use this function to charge slab pages.

Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined.  Since
__memcg_kmem_charge stores a pointer to the memcg in the page struct, we
don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
Besides, one can now detect which memcg a slab page belongs to by reading
/proc/kpagecgroup.

Note, this patch switches slab to charge-after-alloc design.  Since this
design is already used for all other memcg charges, it should not make any
difference.

[hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Vladimir Davydov
d05e83a6f8 memcg: simplify charging kmem pages
Charging kmem pages proceeds in two steps.  First, we try to charge the
allocation size to the memcg the current task belongs to, then we allocate
a page and "commit" the charge storing the pointer to the memcg in the
page struct.

Such a design looks overcomplicated, because there is not much sense in
trying charging the allocation before actually allocating a page: we won't
be able to consume much memory over the limit even if we charge after
doing the actual allocation, besides we already charge user pages post
factum, so being pedantic with kmem pages just looks pointless.

So this patch simplifies the design by merging the "charge" and the
"commit" steps into the same function, which takes the allocated page.

Also, rename the charge and uncharge methods to memcg_kmem_charge and
memcg_kmem_uncharge and make the charge method return error code instead
of bool to conform to mem_cgroup_try_charge.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Jerome Marchand
3608de0787 mm/memcontrol.c: fix order calculation in try_charge()
Since commit 6539cc0538 ("mm: memcontrol: fold mem_cgroup_do_charge()"),
the order to pass to mem_cgroup_oom() is calculated by passing the
number of pages to get_order() instead of the expected size in bytes.
AFAICT, it only affects the value displayed in the oom warning message.
This patch fix this.

Michal said:

: We haven't noticed that just because the OOM is enabled only for page
: faults of order-0 (single page) and get_order work just fine.  Thanks for
: noticing this.  If we ever start triggering OOM on different orders this
: would be broken.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Tejun Heo
10d53c748b memcg: ratify and consolidate over-charge handling
try_charge() is the main charging logic of memcg.  When it hits the limit
but either can't fail the allocation due to __GFP_NOFAIL or the task is
likely to free memory very soon, being OOM killed, has SIGKILL pending or
exiting, it "bypasses" the charge to the root memcg and returns -EINTR.
While this is one approach which can be taken for these situations, it has
several issues.

* It unnecessarily lies about the reality.  The number itself doesn't
  go over the limit but the actual usage does.  memcg is either forced
  to or actively chooses to go over the limit because that is the
  right behavior under the circumstances, which is completely fine,
  but, if at all avoidable, it shouldn't be misrepresenting what's
  happening by sneaking the charges into the root memcg.

* Despite trying, we already do over-charge.  kmemcg can't deal with
  switching over to the root memcg by the point try_charge() returns
  -EINTR, so it open-codes over-charing.

* It complicates the callers.  Each try_charge() user has to handle
  the weird -EINTR exception.  memcg_charge_kmem() does the manual
  over-charging.  mem_cgroup_do_precharge() performs unnecessary
  uncharging of root memcg, which BTW is inconsistent with what
  memcg_charge_kmem() does but not broken as [un]charging are noops on
  root memcg.  mem_cgroup_try_charge() needs to switch the returned
  cgroup to the root one.

The reality is that in memcg there are cases where we are forced and/or
willing to go over the limit.  Each such case needs to be scrutinized and
justified but there definitely are situations where that is the right
thing to do.  We alredy do this but with a superficial and inconsistent
disguise which leads to unnecessary complications.

This patch updates try_charge() so that it over-charges and returns 0 when
deemed necessary.  -EINTR return is removed along with all special case
handling in the callers.

While at it, remove the local variable @ret, which was initialized to zero
and never changed, along with done: label which just returned the always
zero @ret.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Tejun Heo
b23afb93d3 memcg: punt high overage reclaim to return-to-userland path
Currently, try_charge() tries to reclaim memory synchronously when the
high limit is breached; however, if the allocation doesn't have
__GFP_WAIT, synchronous reclaim is skipped.  If a process performs only
speculative allocations, it can blow way past the high limit.  This is
actually easily reproducible by simply doing "find /".  slab/slub
allocator tries speculative allocations first, so as long as there's
memory which can be consumed without blocking, it can keep allocating
memory regardless of the high limit.

This patch makes try_charge() always punt the over-high reclaim to the
return-to-userland path.  If try_charge() detects that high limit is
breached, it adds the overage to current->memcg_nr_pages_over_high and
schedules execution of mem_cgroup_handle_over_high() which performs
synchronous reclaim from the return-to-userland path.

As long as kernel doesn't have a run-away allocation spree, this should
provide enough protection while making kmemcg behave more consistently.
It also has the following benefits.

- All over-high reclaims can use GFP_KERNEL regardless of the specific
  gfp mask in use, e.g. GFP_NOFS, when the limit was breached.

- It copes with prio inversion.  Previously, a low-prio task with
  small memory.high might perform over-high reclaim with a bunch of
  locks held.  If a higher prio task needed any of these locks, it
  would have to wait until the low prio task finished reclaim and
  released the locks.  By handing over-high reclaim to the task exit
  path this issue can be avoided.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Tejun Heo
626ebc4100 memcg: flatten task_struct->memcg_oom
task_struct->memcg_oom is a sub-struct containing fields which are used
for async memcg oom handling.  Most task_struct fields aren't packaged
this way and it can lead to unnecessary alignment paddings.  This patch
flattens it.

* task.memcg_oom.memcg          -> task.memcg_in_oom
* task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
* task.memcg_oom.order          -> task.memcg_oom_order
* task.memcg_oom.may_oom        -> task.memcg_may_oom

In addition, task.memcg_may_oom is relocated to where other bitfields are
which reduces the size of task_struct.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Linus Torvalds
69234acee5 Merge branch 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "The cgroup core saw several significant updates this cycle:

   - percpu_rwsem for threadgroup locking is reinstated.  This was
     temporarily dropped due to down_write latency issues.  Oleg's
     rework of percpu_rwsem which is scheduled to be merged in this
     merge window resolves the issue.

   - On the v2 hierarchy, when controllers are enabled and disabled, all
     operations are atomic and can fail and revert cleanly.  This allows
     ->can_attach() failure which is necessary for cpu RT slices.

   - Tasks now stay associated with the original cgroups after exit
     until released.  This allows tracking resources held by zombies
     (e.g.  pids) and makes it easy to find out where zombies came from
     on the v2 hierarchy.  The pids controller was broken before these
     changes as zombies escaped the limits; unfortunately, updating this
     behavior required too many invasive changes and I don't think it's
     a good idea to backport them, so the pids controller on 4.3, the
     first version which included the pids controller, will stay broken
     at least until I'm sure about the cgroup core changes.

   - Optimization of a couple common tests using static_key"

* 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
  cgroup: fix race condition around termination check in css_task_iter_next()
  blkcg: don't create "io.stat" on the root cgroup
  cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
  cgroup: replace error handling in cgroup_init() with WARN_ON()s
  cgroup: add cgroup_subsys->free() method and use it to fix pids controller
  cgroup: keep zombies associated with their original cgroups
  cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
  cgroup: don't hold css_set_rwsem across css task iteration
  cgroup: reorganize css_task_iter functions
  cgroup: factor out css_set_move_task()
  cgroup: keep css_set and task lists in chronological order
  cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
  cgroup: make css_sets pin the associated cgroups
  cgroup: relocate cgroup_[try]get/put()
  cgroup: move check_for_release() invocation
  cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
  cgroup: make cgroup->nr_populated count the number of populated css_sets
  cgroup: remove an unused parameter from cgroup_task_migrate()
  cgroup: fix too early usage of static_branch_disable()
  cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
  ...
2015-11-05 14:51:32 -08:00
Linus Torvalds
ea1ee5ff1b Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A final set of fixes for 4.3.

  It is (again) bigger than I would have liked, but it's all been
  through the testing mill and has been carefully reviewed by multiple
  parties.  Each fix is either a regression fix for this cycle, or is
  marked stable.  You can scold me at KS.  The pull request contains:

   - Three simple fixes for NVMe, fixing regressions since 4.3.  From
     Arnd, Christoph, and Keith.

   - A single xen-blkfront fix from Cathy, fixing a NULL dereference if
     an error is returned through the staste change callback.

   - Fixup for some bad/sloppy code in nbd that got introduced earlier
     in this cycle.  From Markus Pargmann.

   - A blk-mq tagset use-after-free fix from Junichi.

   - A backing device lifetime fix from Tejun, fixing a crash.

   - And finally, a set of regression/stable fixes for cgroup writeback
     from Tejun"

* 'for-linus' of git://git.kernel.dk/linux-block:
  writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy()
  NVMe: Fix memory leak on retried commands
  block: don't release bdi while request_queue has live references
  nvme: use an integer value to Linux errno values
  blk-mq: fix use-after-free in blk_mq_free_tag_set()
  nvme: fix 32-bit build warning
  writeback: fix incorrect calculation of available memory for memcg domains
  writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions
  writeback: bdi_writeback iteration must not skip dying ones
  writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()
  writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration
  nbd: Add locking for tasks
  xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing)
2015-10-24 07:20:57 +09:00
Shaohua Li
424cdc1413 memcg: convert threshold to bytes
page_counter_memparse() returns pages for the threshold, while
mem_cgroup_usage() returns bytes for memory usage.  Convert the
threshold to bytes.

Fixes: 3e32cb2e0a ("memcg: rename cgroup_event to mem_cgroup_event").
Signed-off-by: Shaohua Li <shli@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-16 11:42:28 -07:00
Tejun Heo
27bd4dbb8d cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
Currently, cgroup_has_tasks() tests whether the target cgroup has any
css_set linked to it.  This works because a css_set's refcnt converges
with the number of tasks linked to it and thus there's no css_set
linked to a cgroup if it doesn't have any live tasks.

To help tracking resource usage of zombie tasks, putting the ref of
css_set will be separated from disassociating the task from the
css_set which means that a cgroup may have css_sets linked to it even
when it doesn't have any live tasks.

This patch replaces cgroup_has_tasks() with cgroup_is_populated()
which tests cgroup->nr_populated instead which locally counts the
number of populated css_sets.  Unlike cgroup_has_tasks(),
cgroup_is_populated() is recursive - if any of the descendants is
populated, the cgroup is populated too.  While this changes the
meaning of the test, all the existing users are okay with the change.

While at it, replace the open-coded ->populated_cnt test in
cgroup_events_show() with cgroup_is_populated().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
2015-10-15 16:41:50 -04:00
Tejun Heo
c5edf9cdc4 writeback: fix incorrect calculation of available memory for memcg domains
For memcg domains, the amount of available memory was calculated as

 min(the amount currently in use + headroom according to memcg,
     total clean memory)

This isn't quite correct as what should be capped by the amount of
clean memory is the headroom, not the sum of memory in use and
headroom.  For example, if a memcg domain has a significant amount of
dirty memory, the above can lead to a value which is lower than the
current amount in use which doesn't make much sense.  In most
circumstances, the above leads to a number which is somewhat but not
drastically lower.

As the amount of memory which can be readily allocated to the memcg
domain is capped by the amount of system-wide clean memory which is
not already assigned to the memcg itself, the number we want is

 the amount currently in use +
 min(headroom according to memcg, clean memory elsewhere in the system)

This patch updates mem_cgroup_wb_stats() to return the number of
filepages and headroom instead of the calculated available pages.
mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
above calculation from file, headroom, dirty and globally clean pages.

v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
    to build failure when !CGROUP_WRITEBACK.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: c2aa723a60 ("writeback: implement memcg writeback domain based throttling")
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12 10:31:13 -06:00
Greg Thelen
ef510194ce memcg: remove pcp_counter_lock
Commit 733a572e66 ("memcg: make mem_cgroup_read_{stat|event}() iterate
possible cpus instead of online") removed the last use of the per memcg
pcp_counter_lock but forgot to remove the variable.

Kill the vestigial variable.

Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01 21:42:35 -04:00
Greg Thelen
484ebb3b8c memcg: make mem_cgroup_read_stat() unsigned
mem_cgroup_read_stat() returns a page count by summing per cpu page
counters.  The summing is racy wrt.  updates, so a transient negative
sum is possible.  Callers don't want negative values:

 - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback.
   This could confuse dirty throttling.

 - oom reports and memory.stat shouldn't show confusing negative usage.

 - tree_usage() already avoids negatives.

Avoid returning negative page counts from mem_cgroup_read_stat() and
convert it to unsigned.

[akpm@linux-foundation.org: fix old typo while we're in there]
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>	[4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01 21:42:35 -04:00
Tejun Heo
4530eddb59 cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader()
It wasn't explicitly documented but, when a process is being migrated,
cpuset and memcg depend on cgroup_taskset_first() returning the
threadgroup leader; however, this approach is somewhat ghetto and
would no longer work for the planned multi-process migration.

This patch introduces explicit cgroup_taskset_for_each_leader() which
iterates over only the threadgroup leaders and replaces
cgroup_taskset_first() usages for accessing the leader with it.

This prepares both memcg and cpuset for multi-process migration.  This
patch also updates the documentation for cgroup_taskset_for_each() to
clarify the iteration rules and removes comments mentioning task
ordering in tasksets.

v2: A previous patch which added threadgroup leader test was dropped.
    Patch updated accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2015-09-22 12:46:53 -04:00
Tejun Heo
472912a2b5 memcg: generate file modified notifications on "memory.events"
cgroup core only recently grew generic notification support.  Wire up
"memory.events" so that it triggers a file modified event whenever its
content changes.

v2: Refreshed on top of mem_cgroup relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
2015-09-21 15:14:47 -04:00
Tejun Heo
7dbdb199d3 cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE
cftype->mode allows controllers to give arbitrary permissions to
interface knobs.  Except for "cgroup.event_control", the existing uses
are spurious.

* Some explicitly specify S_IRUGO | S_IWUSR even though that's the
  default.

* "cpuset.memory_pressure" specifies S_IRUGO while also setting a
  write callback which returns -EACCES.  All it needs to do is simply
  not setting a write callback.

"cgroup.event_control" uses cftype->mode to make the file
world-writable.  It's a misdesigned interface and we don't want
controllers to be tweaking interface file permissions in general.
This patch removes cftype->mode and all its spurious uses and
implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is
marked as compatibility-only.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2015-09-18 17:54:23 -04:00
Tejun Heo
9e10a130d9 cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl()
cgroup_on_dfl() tests whether the cgroup's root is the default
hierarchy; however, an individual controller is only interested in
whether the controller is attached to the default hierarchy and never
tests a cgroup which doesn't belong to the hierarchy that the
controller is attached to.

This patch replaces cgroup_on_dfl() tests in controllers with faster
static_key based cgroup_subsys_on_dfl().  This leaves cgroup core as
the only user of cgroup_on_dfl() and the function is moved from the
header file to cgroup.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
2015-09-18 11:56:28 -04:00
Vladimir Davydov
e993d905c8 memcg: zap try_get_mem_cgroup_from_page
It is only used in mem_cgroup_try_charge, so fold it in and zap it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Vladimir Davydov
2fc0452470 memcg: add page_cgroup_ino helper
This patchset introduces a new user API for tracking user memory pages
that have not been used for a given period of time.  The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e.  the set of pages that are actively used by the
workload.  Knowing the working set size can be useful for partitioning the
system more efficiently, e.g.  by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.

==== USE CASES ====

The unified cgroup hierarchy has memory.low and memory.high knobs, which
are defined as the low and high boundaries for the workload working set
size.  However, the working set size of a workload may be unknown or
change in time.  With this patch set, one can periodically estimate the
amount of memory unused by each cgroup and tune their memory.low and
memory.high parameters accordingly, therefore optimizing the overall
memory utilization.

Another use case is balancing workloads within a compute cluster.  Knowing
how much memory is not really used by a workload unit may help take a more
optimal decision when considering migrating the unit to another node
within the cluster.

Also, as noted by Minchan, this would be useful for per-process reclaim
(https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
pages only by smart user memory manager.

==== USER API ====

The user API consists of two new files:

 * /sys/kernel/mm/page_idle/bitmap.  This file implements a bitmap where each
   bit corresponds to a page, indexed by PFN. When the bit is set, the
   corresponding page is idle. A page is considered idle if it has not been
   accessed since it was marked idle. To mark a page idle one should set the
   bit corresponding to the page by writing to the file. A value written to the
   file is OR-ed with the current bitmap value. Only user memory pages can be
   marked idle, for other page types input is silently ignored. Writing to this
   file beyond max PFN results in the ENXIO error. Only available when
   CONFIG_IDLE_PAGE_TRACKING is set.

   This file can be used to estimate the amount of pages that are not
   used by a particular workload as follows:

   1. mark all pages of interest idle by setting corresponding bits in the
      /sys/kernel/mm/page_idle/bitmap
   2. wait until the workload accesses its working set
   3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set

 * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
   memory cgroup each page is charged to, indexed by PFN. Only available when
   CONFIG_MEMCG is set.

   This file can be used to find all pages (including unmapped file pages)
   accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
   can then estimate the cgroup working set size.

For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.

==== REASONING ====

The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 6.

==== PATCHSET STRUCTURE ====

The patch set is organized as follows:

 - patch 1 adds page_cgroup_ino() helper for the sake of
   /proc/kpagecgroup and patches 2-3 do related cleanup
 - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
   charged to
 - patch 5 introduces a new mmu notifier callback, clear_young, which is
   a lightweight version of clear_flush_young; it is used in patch 6
 - patch 6 implements the idle page tracking feature, including the
   userspace API, /sys/kernel/mm/page_idle/bitmap
 - patch 7 exports idle flag via /proc/kpageflags

==== SIMILAR WORKS ====

Originally, the patch for tracking idle memory was proposed back in 2011
by Michel Lespinasse (see http://lwn.net/Articles/459269/).  The main
difference between Michel's patch and this one is that Michel implemented
a kernel space daemon for estimating idle memory size per cgroup while
this patch only provides the userspace with the minimal API for doing the
job, leaving the rest up to the userspace.  However, they both share the
same idea of Idle/Young page flags to avoid affecting the reclaimer logic.

==== PERFORMANCE EVALUATION ====

SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
performance impact introduced by this patch set.  Three runs were carried
out:

 - base: kernel without the patch
 - patched: patched kernel, the feature is not used
 - patched-active: patched kernel, 1 minute-period daemon is used for
   tracking idle memory

For tracking idle memory, idlememstat utility was used:
https://github.com/locker/idlememstat

testcase            base            patched        patched-active

compiler       537.40 ( 0.00)%   532.26 (-0.96)%   538.31 ( 0.17)%
compress       305.47 ( 0.00)%   301.08 (-1.44)%   300.71 (-1.56)%
crypto         284.32 ( 0.00)%   282.21 (-0.74)%   284.87 ( 0.19)%
derby          411.05 ( 0.00)%   413.44 ( 0.58)%   412.07 ( 0.25)%
mpegaudio      189.96 ( 0.00)%   190.87 ( 0.48)%   189.42 (-0.28)%
scimark.large   46.85 ( 0.00)%    46.41 (-0.94)%    47.83 ( 2.09)%
scimark.small  412.91 ( 0.00)%   415.41 ( 0.61)%   421.17 ( 2.00)%
serial         204.23 ( 0.00)%   213.46 ( 4.52)%   203.17 (-0.52)%
startup         36.76 ( 0.00)%    35.49 (-3.45)%    35.64 (-3.05)%
sunflow        115.34 ( 0.00)%   115.08 (-0.23)%   117.37 ( 1.76)%
xml            620.55 ( 0.00)%   619.95 (-0.10)%   620.39 (-0.03)%

composite      211.50 ( 0.00)%   211.15 (-0.17)%   211.67 ( 0.08)%

time idlememstat:

17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
448inputs+40outputs (1major+36052minor)pagefaults 0swaps

==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ====
#! /usr/bin/python
#

import os
import stat
import errno
import struct

CGROUP_MOUNT = "/sys/fs/cgroup/memory"
BUFSIZE = 8 * 1024  # must be multiple of 8

def get_hugepage_size():
    with open("/proc/meminfo", "r") as f:
        for s in f:
            k, v = s.split(":")
            if k == "Hugepagesize":
                return int(v.split()[0]) * 1024

PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
HUGEPAGE_SIZE = get_hugepage_size()

def set_idle():
    f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE)
    while True:
        try:
            f.write(struct.pack("Q", pow(2, 64) - 1))
        except IOError as err:
            if err.errno == errno.ENXIO:
                break
            raise
    f.close()

def count_idle():
    f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
    f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)

    with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f:
        while f.read(BUFSIZE): pass  # update idle flag

    idlememsz = {}
    while True:
        s1, s2 = f_flags.read(8), f_cgroup.read(8)
        if not s1 or not s2:
            break

        flags, = struct.unpack('Q', s1)
        cgino, = struct.unpack('Q', s2)

        unevictable = (flags >> 18) & 1
        huge = (flags >> 22) & 1
        idle = (flags >> 25) & 1

        if idle and not unevictable:
            idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                (HUGEPAGE_SIZE if huge else PAGE_SIZE)

    f_flags.close()
    f_cgroup.close()
    return idlememsz

if __name__ == "__main__":
    print "Setting the idle flag for each page..."
    set_idle()

    raw_input("Wait until the workload accesses its working set, "
              "then press Enter")

    print "Counting idle pages..."
    idlememsz = count_idle()

    for dir, subdirs, files in os.walk(CGROUP_MOUNT):
        ino = os.stat(dir)[stat.ST_INO]
        print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
==== END SCRIPT ====

This patch (of 8):

Add page_cgroup_ino() helper to memcg.

This function returns the inode number of the closest online ancestor of
the memory cgroup a page is charged to.  It is required for exporting
information about which page is charged to which cgroup to userspace,
which will be introduced by a following patch.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Michal Hocko
e752eb6881 memcg: move memcg_proto_active from sock.h
The only user is sock_update_memcg which is living in memcontrol.c so it
doesn't make much sense to pollute sock.h by this inline helper.  Move it
to memcontrol.c and open code it into its only caller.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Michal Hocko
a03f1f0589 memcg, tcp_kmem: check for cg_proto in sock_update_memcg
sk_prot->proto_cgroup is allowed to return NULL but sock_update_memcg
doesn't check for NULL.  The function relies on the mem_cgroup_is_root
check because we shouldn't get NULL otherwise because mem_cgroup_from_task
will always return !NULL.

All other callers are checking for NULL and we can safely replace
mem_cgroup_is_root() check by cg_proto != NULL which will be more
straightforward (proto_cgroup returns NULL for the root memcg already).

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Tejun Heo
9f2115f93b memcg: restructure mem_cgroup_can_attach()
Restructure it to lower nesting level and help the planned threadgroup
leader iteration changes.

This is pure reorganization.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Michal Hocko
33398cf2f3 memcg: export struct mem_cgroup
mem_cgroup structure is defined in mm/memcontrol.c currently which means
that the code outside of this file has to use external API even for
trivial access stuff.

This patch exports mm_struct with its dependencies and makes some of the
exported functions inlines.  This even helps to reduce the code size a bit
(make defconfig + CONFIG_MEMCG=y)

  text		data    bss     dec     	 hex 	filename
  12355346        1823792 1089536 15268674         e8fb42 vmlinux.before
  12354970        1823792 1089536 15268298         e8f9ca vmlinux.after

This is not much (370B) but better than nothing.

We also save a function call in some hot paths like callers of
mem_cgroup_count_vm_event which is used for accounting.

The patch doesn't introduce any functional changes.

[vdavykov@parallels.com: inline memcg_kmem_is_active]
[vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG]
[akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx]
[akpm@linux-foundation.org: export mem_cgroup_from_task() to modules]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
David Rientjes
54e9e29132 mm, oom: pass an oom order of -1 when triggered by sysrq
The force_kill member of struct oom_control isn't needed if an order of -1
is used instead.  This is the same as order == -1 in struct
compact_control which requires full memory compaction.

This patch introduces no functional change.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
David Rientjes
6e0fc46dc2 mm, oom: organize oom context into struct
There are essential elements to an oom context that are passed around to
multiple functions.

Organize these elements into a new struct, struct oom_control, that
specifies the context for an oom condition.

This patch introduces no functional change.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Sebastian Andrzej Siewior
ce9ce6659a mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout()
Clark stumbled over a VM_BUG_ON() in -RT which was then was removed by
Johannes in commit f371763a79 ("mm: memcontrol: fix false-positive
VM_BUG_ON() on -rt").  The comment before that patch was a tiny bit better
than it is now.  While the patch claimed to fix a false-postive on -RT
this was not the case.  None of the -RT folks ACKed it and it was not a
false positive report.  That was a *real* problem.

This patch updates the comment that is improper because it refers to
"disabled preemption" as a consequence of that lock being taken.  A
spin_lock() disables preemption, true, but in this case the code relies on
the fact that the lock _also_ disables interrupts once it is acquired.
And this is the important detail (which was checked the VM_BUG_ON()) which
needs to be pointed out.  This is the hint one needs while looking at the
code.  It was explained by Johannes on the list that the per-CPU variables
are protected by local_irq_save().  The BUG_ON() was helpful.  This code
has been workarounded in -RT in the meantime.  I wouldn't mind running
into more of those if the code in question uses *special* kind of locking
since now there is no verification (in terms of lockdep or BUG_ON()) and
therefore I bring the VM_BUG_ON() check back in.

The two functions after the comment could also have a "local_irq_save()"
dance around them in order to serialize access to the per-CPU variables.
This has been avoided because the interrupts should be off.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Clark Williams <williams@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00