* refs/heads/tmp-3f51ea2
Linux 4.4.133
x86/kexec: Avoid double free_page() upon do_kexec_load() failure
hfsplus: stop workqueue when fill_super() failed
cfg80211: limit wiphy names to 128 bytes
gpio: rcar: Add Runtime PM handling for interrupts
time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting
dmaengine: ensure dmaengine helpers check valid callback
scsi: zfcp: fix infinite iteration on ERP ready list
scsi: sg: allocate with __GFP_ZERO in sg_build_indirect()
scsi: libsas: defer ata device eh commands to libata
s390: use expoline thunks in the BPF JIT
s390: extend expoline to BC instructions
s390: move spectre sysfs attribute code
s390/kernel: use expoline for indirect branches
s390/lib: use expoline for indirect branches
s390: move expoline assembler macros to a header
s390: add assembler macros for CPU alternatives
ext2: fix a block leak
tcp: purge write queue in tcp_connect_init()
sock_diag: fix use-after-free read in __sk_free
packet: in packet_snd start writing at link layer allocation
net: test tailroom before appending to linear skb
btrfs: fix reading stale metadata blocks after degraded raid1 mounts
btrfs: fix crash when trying to resume balance without the resume flag
Btrfs: fix xattr loss after power failure
ARM: 8772/1: kprobes: Prohibit kprobes on get_user functions
ARM: 8770/1: kprobes: Prohibit probing on optimized_callback
ARM: 8769/1: kprobes: Fix to use get_kprobe_ctlblk after irq-disabed
tick/broadcast: Use for_each_cpu() specially on UP kernels
ARM: 8771/1: kprobes: Prohibit kprobes on do_undefinstr
efi: Avoid potential crashes, fix the 'struct efi_pci_io_protocol_32' definition for mixed mode
s390: remove indirect branch from do_softirq_own_stack
s390/qdio: don't release memory in qdio_setup_irq()
s390/cpum_sf: ensure sample frequency of perf event attributes is non-zero
s390/qdio: fix access to uninitialized qdio_q fields
mm: don't allow deferred pages with NEED_PER_CPU_KM
powerpc/powernv: Fix NVRAM sleep in invalid context when crashing
procfs: fix pthread cross-thread naming if !PR_DUMPABLE
proc read mm's {arg,env}_{start,end} with mmap semaphore taken.
tracing/x86/xen: Remove zero data size trace events trace_xen_mmu_flush_tlb{_all}
cpufreq: intel_pstate: Enable HWP by default
signals: avoid unnecessary taking of sighand->siglock
mm: filemap: avoid unnecessary calls to lock_page when waiting for IO to complete during a read
mm: filemap: remove redundant code in do_read_cache_page
proc: meminfo: estimate available memory more conservatively
vmscan: do not force-scan file lru if its absolute size is small
powerpc: Don't preempt_disable() in show_cpuinfo()
cpuidle: coupled: remove unused define cpuidle_coupled_lock
powerpc/powernv: remove FW_FEATURE_OPALv3 and just use FW_FEATURE_OPAL
powerpc/powernv: Remove OPALv2 firmware define and references
powerpc/powernv: panic() on OPAL < V3
spi: pxa2xx: Allow 64-bit DMA
ALSA: control: fix a redundant-copy issue
ALSA: hda: Add Lenovo C50 All in one to the power_save blacklist
ALSA: usb: mixer: volume quirk for CM102-A+/102S+
usbip: usbip_host: fix bad unlock balance during stub_probe()
usbip: usbip_host: fix NULL-ptr deref and use-after-free errors
usbip: usbip_host: run rebind from exit when module is removed
usbip: usbip_host: delete device from busid_table after rebind
usbip: usbip_host: refine probe and disconnect debug msgs to be useful
kernel/exit.c: avoid undefined behaviour when calling wait4()
futex: futex_wake_op, fix sign_extend32 sign bits
pipe: cap initial pipe capacity according to pipe-max-size limit
l2tp: revert "l2tp: fix missing print session offset info"
Revert "ARM: dts: imx6qdl-wandboard: Fix audio channel swap"
lockd: lost rollback of set_grace_period() in lockd_down_net()
xfrm: fix xfrm_do_migrate() with AEAD e.g(AES-GCM)
futex: Remove duplicated code and fix undefined behaviour
futex: Remove unnecessary warning from get_futex_key
arm64: Add work around for Arm Cortex-A55 Erratum 1024718
arm64: introduce mov_q macro to move a constant into a 64-bit register
audit: move calcs after alloc and check when logging set loginuid
ALSA: timer: Call notifier in the same spinlock
sctp: delay the authentication for the duplicated cookie-echo chunk
sctp: fix the issue that the cookie-ack with auth can't get processed
tcp: ignore Fast Open on repair mode
bonding: do not allow rlb updates to invalid mac
tg3: Fix vunmap() BUG_ON() triggered from tg3_free_consistent().
sctp: use the old asoc when making the cookie-ack chunk in dupcook_d
sctp: handle two v4 addrs comparison in sctp_inet6_cmp_addr
r8169: fix powering up RTL8168h
qmi_wwan: do not steal interfaces from class drivers
openvswitch: Don't swap table in nlattr_set() after OVS_ATTR_NESTED is found
net: support compat 64-bit time in {s,g}etsockopt
net_sched: fq: take care of throttled flows before reuse
net/mlx4_en: Verify coalescing parameters are in range
net: ethernet: sun: niu set correct packet size in skb
llc: better deal with too small mtu
ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg
dccp: fix tasklet usage
bridge: check iface upper dev when setting master via ioctl
8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
BACKPORT, FROMLIST: fscrypt: add Speck128/256 support
cgroup: Disable IRQs while holding css_set_lock
Revert "cgroup: Disable IRQs while holding css_set_lock"
cgroup: Disable IRQs while holding css_set_lock
ANDROID: proc: fix undefined behavior in proc_uid_base_readdir
x86: vdso: Fix leaky vdso linker with CC=clang.
ANDROID: build: cuttlefish: Upgrade clang to newer version.
ANDROID: build: cuttlefish: Upgrade clang to newer version.
ANDROID: build: cuttlefish: Fix path to clang.
UPSTREAM: dm bufio: avoid sleeping while holding the dm_bufio lock
ANDROID: sdcardfs: Don't d_drop in d_revalidate
Conflicts:
arch/arm64/include/asm/cputype.h
fs/ext4/crypto.c
fs/ext4/ext4.h
kernel/cgroup.c
mm/vmscan.c
Change-Id: Ic10c5722b6439af1cf423fd949c493f786764d7e
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
commit 84ad5802a33a4964a49b8f7d24d80a214a096b19 upstream.
The MemAvailable item in /proc/meminfo is to give users a hint of how
much memory is allocatable without causing swapping, so it excludes the
zones' low watermarks as unavailable to userspace.
However, for a userspace allocation, kswapd will actually reclaim until
the free pages hit a combination of the high watermark and the page
allocator's lowmem protection that keeps a certain amount of DMA and
DMA32 memory from userspace as well.
Subtract the full amount we know to be unavailable to userspace from the
number of free pages when calculating MemAvailable.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Adjust /proc/meminfo MemAvailable calculation by adding the amount of
indirectly reclaimable memory (rounded to the PAGE_SIZE).
This change is referenced from commit 034ebf65c3c2 ("mm: treat indirectly
reclaimable memory as available in MemAvailable") on upstream, as node
based vmstat global_node_page_state is not present, zone based vmstat
global_page_state is used instead.
Change-Id: I7303d0f8ccd5993c7234a5187430d418d49e5763
Signed-off-by: Vijayanand Jitta <vjitta@codeaurora.org>
It turns out that at least some versions of glibc end up reading
/proc/meminfo at every single startup, because glibc wants to know the
amount of memory the machine has. And while that's arguably insane,
it's just how things are.
And it turns out that it's not all that expensive most of the time, but
the vmalloc information statistics (amount of virtual memory used in the
vmalloc space, and the biggest remaining chunk) can be rather expensive
to compute.
The 'get_vmalloc_info()' function actually showed up on my profiles as
4% of the CPU usage of "make test" in the git source repository, because
the git tests are lots of very short-lived shell-scripts etc.
It turns out that apparently this same silly vmalloc info gathering
shows up on the facebook servers too, according to Dave Jones. So it's
not just "make test" for git.
We had two patches to just cache the information (one by me, one by
Ingo) to mitigate this issue, but the whole vmalloc information of of
rather dubious value to begin with, and people who *actually* want to
know what the situation is wrt the vmalloc area should just look at the
much more complete /proc/vmallocinfo instead.
In fact, according to my testing - and perhaps more importantly,
according to that big search engine in the sky: Google - there is
nothing out there that actually cares about those two expensive fields:
VmallocUsed and VmallocChunk.
So let's try to just remove them entirely. Actually, this just removes
the computation and reports the numbers as zero for now, just to try to
be minimally intrusive.
If this breaks anything, we'll obviously have to re-introduce the code
to compute this all and add the caching patches on top. But if given
the option, I'd really prefer to just remove this bad idea entirely
rather than add even more code to work around our historical mistake
that likely nobody really cares about.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch include CMA info (CMATotal, CMAFree) in /proc/meminfo.
Currently, in a CMA enabled system, if somebody wants to know the total
CMA size declared, there is no way to tell, other than the dmesg or
/var/log/messages logs.
With this patch we are showing the CMA info as part of meminfo, so that it
can be determined at any point of time. This will be populated only when
CMA is enabled.
Below is the sample output from a ARM based device with RAM:512MB and CMA:16MB.
MemTotal: 471172 kB
MemFree: 111712 kB
MemAvailable: 271172 kB
.
.
.
CmaTotal: 16384 kB
CmaFree: 6144 kB
This patch also fix below checkpatch errors that were found during these changes.
ERROR: space required after that ',' (ctx:ExV)
199: FILE: fs/proc/meminfo.c:199:
+ ,atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
^
ERROR: space required after that ',' (ctx:ExV)
202: FILE: fs/proc/meminfo.c:202:
+ ,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
^
ERROR: space required after that ',' (ctx:ExV)
206: FILE: fs/proc/meminfo.c:206:
+ ,K(totalcma_pages)
^
total: 3 errors, 0 warnings, 2 checks, 236 lines checked
Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Historically, we exported shared pages to userspace via sysinfo(2)
sharedram and /proc/meminfo's "MemShared" fields. With the advent of
tmpfs, from kernel v2.4 onward, that old way for accounting shared mem
was deemed inaccurate and we started to export a hard-coded 0 for
sysinfo.sharedram. Later on, during the 2.6 timeframe, "MemShared" got
re-introduced to /proc/meminfo re-branded as "Shmem", but we're still
reporting sysinfo.sharedmem as that old hard-coded zero, which makes the
"shared memory" report inconsistent across interfaces.
This patch leverages the addition of explicit accounting for pages used
by shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem vmstat" -- in
order to make the users of sysinfo(2) and si_meminfo*() friends aware of
that vmstat entry and make them report it consistently across the
interfaces, as well to make sysinfo(2) returned data consistent with our
current API documentation states.
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It should read "reclaimable slab" and not "reclaimable swap".
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PROC_FS is a bool, so this code is either present or absent. It will
never be modular, so using module_init as an alias for __initcall is
rather misleading.
Fix this up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h to
obviously non-modular code, and that would be ugly at best.
Note that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of fs_initcall (which makes sense for fs code)
will thus change these registrations from level 6-device to level 5-fs
(i.e. slightly earlier). However no observable impact of that small
difference has been observed during testing, or is expected.
Also note that this change uncovers a missing semicolon bug in the
registration of vmcore_init as an initcall.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many load balancing and workload placing programs check /proc/meminfo to
estimate how much free memory is available. They generally do this by
adding up "free" and "cached", which was fine ten years ago, but is
pretty much guaranteed to be wrong today.
It is wrong because Cached includes memory that is not freeable as page
cache, for example shared memory segments, tmpfs, and ramfs, and it does
not include reclaimable slab memory, which can take up a large fraction
of system memory on mostly idle systems with lots of files.
Currently, the amount of memory that is available for a new workload,
without pushing the system into swap, can be estimated from MemFree,
Active(file), Inactive(file), and SReclaimable, as well as the "low"
watermarks from /proc/zoneinfo.
However, this may change in the future, and user space really should not
be expected to know kernel internals to come up with an estimate for the
amount of free memory.
It is more convenient to provide such an estimate in /proc/meminfo. If
things change in the future, we only have to change it in one place.
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Erik Mouw <erik.mouw_2@nxp.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The same calculation is currently done in three differents places.
Factor that code so future changes has to be made at only one place.
[akpm@linux-foundation.org: uninline vm_commit_limit()]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We use NR_ANON_PAGES as base for reporting AnonPages to user. There's
not much sense in not accounting transparent huge pages there, but add
them on printing to user.
Let's account transparent huge pages in NR_ANON_PAGES in the first place.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Ning Qu <quning@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now get_vmalloc_info() is in fs/proc/mmu.c. There is no reason that this
code must be here and it's implementation needs vmlist_lock and it iterate
a vmlist which may be internal data structure for vmalloc.
It is preferable that vmlist_lock and vmlist is only used in vmalloc.c
for maintainability. So move the code to vmalloc.c
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I use several fast SSD to do swap, swapper_space.tree_lock is
heavily contended. This makes each swap partition have one
address_space to reduce the lock contention. There is an array of
address_space for swap. The swap entry type is the index to the array.
In my test with 3 SSD, this increases the swapout throughput 20%.
[akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since MCE is an x86 concept, and this code is in mm/, it would be better
to use the name num_poisoned_pages instead of mce_bad_pages.
[akpm@linux-foundation.org: fix mm/sparse.c]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the error message "directives may not be used inside a macro argument"
which appears when the kernel is compiled for the cris architecture.
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add hugepage stat information to /proc/vmstat and /proc/meminfo.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Given such a long name, the kB count in /proc/meminfo's HardwareCorrupted
line is being shown too far right (it does align with x86_64's VmallocChunk
above, but I hope nobody will ever have that much corrupted!). Align it.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
Recently we encountered OOM problems due to memory use of the GEM cache.
Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
shortage problem.
We often use the following calculation to determine the amount of shmem
pages:
shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES
however the expression does not consider isolated and mlocked pages.
This patch adds explicit accounting for pages used by shmem and tmpfs.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The amount of memory allocated to kernel stacks can become significant and
cause OOM conditions. However, we do not display the amount of memory
consumed by stacks.
Add code to display the amount of memory used for stacks in /proc/meminfo.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.
This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.
The code that does this is portable and lives in mm/memory-failure.c
To quote the overview comment:
High level machine check handler. Handles pages reported by the
hardware as being corrupted usually due to a 2bit ECC memory or cache
failure.
This focuses on pages detected as corrupted in the background.
When the current CPU tries to consume corruption the currently
running process can just be killed directly instead. This implies
that if the error cannot be handled for some reason it's safe to
just ignore it because no corruption has been consumed yet. Instead
when that happens another machine check will happen.
Handles page cache pages in various states. The tricky part
here is that we can access any page asynchronous to other VM
users, because memory failures could happen anytime and anywhere,
possibly violating some of their assumptions. This is why this code
has to be extremely careful. Generally it tries to use normal locking
rules, as in get the standard locks, even if that means the
error handling takes potentially a long time.
Some of the operations here are somewhat inefficient and have non
linear algorithmic complexity, because the data structures have not
been optimized for this case. This is in particular the case
for the mapping from a vma to a process. Since this case is expected
to be rare we hope we can get away with this.
There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.
The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways
Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
Nick Piggin (who did a lot of great work) and others.
Cc: npiggin@suse.de
Cc: riel@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
The Committed_AS field can underflow in certain situations:
> # while true; do cat /proc/meminfo | grep _AS; sleep 1; done | uniq -c
> 1 Committed_AS: 18446744073709323392 kB
> 11 Committed_AS: 18446744073709455488 kB
> 6 Committed_AS: 35136 kB
> 5 Committed_AS: 18446744073709454400 kB
> 7 Committed_AS: 35904 kB
> 3 Committed_AS: 18446744073709453248 kB
> 2 Committed_AS: 34752 kB
> 9 Committed_AS: 18446744073709453248 kB
> 8 Committed_AS: 34752 kB
> 3 Committed_AS: 18446744073709320960 kB
> 7 Committed_AS: 18446744073709454080 kB
> 3 Committed_AS: 18446744073709320960 kB
> 5 Committed_AS: 18446744073709454080 kB
> 6 Committed_AS: 18446744073709320960 kB
Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
not check for underflow.
But NR_CPUS proportional isn't good calculation. In general,
possibility of lock contention is proportional to the number of online
cpus, not theorical maximum cpus (NR_CPUS).
The current kernel has generic percpu-counter stuff. using it is right
way. it makes code simplify and percpu_counter_read_positive() don't
make underflow issue.
Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org> [All kernel versions]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a number of issues with the per-MM VMA patch:
(1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
a NOMMU system with more than 2G pages. Makes no difference on a 32-bit
system.
(2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
lest it overflow.
(3) Move the allocation of the vm_area_struct slab back for fork.c.
(4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.
(5) Use BUG_ON() rather than if () BUG().
(6) Make the default validate_nommu_regions() a static inline rather than a
#define.
(7) Make free_page_series()'s objection to pages with a refcount != 1 more
informative.
(8) Adjust the __put_nommu_region() banner comment to indicate that the
semaphore must be held for writing.
(9) Limit the number of warnings about munmaps of non-mmapped regions.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make VMAs per mm_struct as for MMU-mode linux. This solves two problems:
(1) In SYSV SHM where nattch for a segment does not reflect the number of
shmat's (and forks) done.
(2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
that a VMA might be shared and already have its vm_mm assigned to another
process or a dead process.
A new struct (vm_region) is introduced to track a mapped region and to remember
the circumstances under which it may be shared and the vm_list_struct structure
is discarded as it's no longer required.
This patch makes the following additional changes:
(1) Regions are now allocated with alloc_pages() rather than kmalloc() and
with no recourse to __GFP_COMP, so the pages are not composite. Instead,
each page has a reference on it held by the region. Anything else that is
interested in such a page will have to get a reference on it to retain it.
When the pages are released due to unmapping, each page is passed to
put_page() and will be freed when the page usage count reaches zero.
(2) Excess pages are trimmed after an allocation as the allocation must be
made as a power-of-2 quantity of pages.
(3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may
end up with overlapping VMAs within the tree, the VMA struct address is
appended to the sort key.
(4) Non-anonymous VMAs are now added to the backing inode's prio list.
(5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
the backing region. The VMA and region structs will be split if
necessary.
(6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
segment instead of all the attachments at that addresss. Multiple
shmat()'s return the same address under NOMMU-mode instead of different
virtual addresses as under MMU-mode.
(7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.
(8) /proc/maps is now the global list of mapped regions, and may list bits
that aren't actually mapped anywhere.
(9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
of RAM currently allocated by mmap to hold mappable regions that can't be
mapped directly. These are copies of the backing device or file if not
anonymous.
These changes make NOMMU mode more similar to MMU mode. The downside is that
NOMMU mode requires some extra memory to track things over NOMMU without this
patch (VMAs are no longer shared, and there are now region structs).
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>