is_vmalloc_addr currently assumes that all vmalloc addresses
exist between VMALLOC_START and VMALLOC_END. This may not be
the case when interleaving vmalloc and lowmem. Update the
is_vmalloc_addr to properly check for this.
Correspondingly we need to ensure that VMALLOC_TOTAL accounts
for all the vmalloc regions when CONFIG_ENABLE_VMALLOC_SAVING
is enabled.
Change-Id: I5def3d6ae1a4de59ea36f095b8c73649a37b1f36
Signed-off-by: Susheel Khiani <skhiani@codeaurora.org>
Currently on 32 bit systems, virtual space above
PAGE_OFFSET is reserved for direct mapped lowmem
and part of virtual address space is reserved for
vmalloc. We want to optimize such as to have as
much direct mapped memory as possible since there is
penalty for mapping/unmapping highmem. Now, we may
have an image that is expected to have a lifetime of
the entire system and is reserved in physical region
that would be part of direct mapped lowmem. The
physical memory which is thus reserved is never used
by Linux. This means that even though the system is
not actually accessing the virtual memory
corresponding to the reserved physical memory, we
are still losing that portion of direct mapped lowmem
space.
So by allowing lowmem to be non contiguous we can
give this unused virtual address space of reserved
region back for use in vmalloc.
Change-Id: I980b3dfafac71884dcdcb8cd2e4a6363cde5746a
Signed-off-by: Susheel Khiani <skhiani@codeaurora.org>
Even though lowmem is accounted for in vmalloc space,
allocation comes only from the region bounded by
VMALLOC_START and VMALLOC_END. The kernel virtual area
can now allocate from any unmapped region starting
from PAGE_OFFSET.
Change-Id: I291b9eb443d3f7445fd979bd7b09e9241ff22ba3
Signed-off-by: Neeti Desai <neetid@codeaurora.org>
Signed-off-by: Susheel Khiani <skhiani@codeaurora.org>
There are places in kernel like the lowmemorykiller which
invokes show_mem_call_notifiers from an atomic context.
So move from a blocking notifier to atomic. At present
the notifier callbacks does not call sleeping functions,
but it should be made sure, it does not happen in future also.
Change-Id: I9668e67463ab8a6a60be55dbc86b88f45be8b041
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
It was found that a number of tasks were blocked in the reclaim path
(throttle_vm_writeout) for seconds, because of vmstat_diff not being
synced in time. Fix that by adding a new function
global_page_state_snapshot.
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Change-Id: Iec167635ad724a55c27bdbd49eb8686e7857216c
Commit "mm: vmscan: fix the page state calculation in too_many_isolated"
fixed an issue where a number of tasks were blocked in reclaim path
for seconds, because of vmstat_diff not being synced in time.
A similar problem can happen in isolate_migratepages_block, where
similar calculation is performed. This patch fixes that.
Change-Id: Ie74f108ef770da688017b515fe37faea6f384589
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
At present any vmpressure value is scaled up if the pages are
reclaimed through direct reclaim. This can result in false
vmpressure values. Consider a case where a device is booted up
and most of the memory is occuppied by file pages. kswapd will
make sure that high watermark is maintained. Now when a sudden
huge allocation request comes in, the system will definitely
have to get into direct reclaims. The vmpressures can be very low,
but because of allocstall accounting logic even these low values
will be scaled to values nearing 100. This can result in
unnecessary LMK kills for example. So define a tunable threshold
for vmpressure above which the allocstalls will be accounted.
CRs-fixed: 893699
Change-Id: Idd7c6724264ac89f1f68f2e9d70a32390ffca3e5
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
There are couple of issues with swapcache usage when ZRAM is used
as swap device.
1) Kernel does a swap readahead which can be around 6 to 8 pages
depending on total ram, which is not required for zram since
accesses are fast.
2) Kernel delays the freeing up of swapcache expecting a later hit,
which again is useless in the case of zram.
3) This is not related to swapcache, but zram usage itself.
As mentioned in (2) kernel delays freeing of swapcache, but along with
that it delays zram compressed page free also. i.e. there can be 2 copies,
though one is compressed.
This patch addresses these issues using two new flags
QUEUE_FLAG_FAST and SWP_FAST, to indicate that accesses to the device
will be fast and cheap, and instructs the swap layer to free up
swap space agressively, and not to do read ahead.
Change-Id: I5d2d5176a5f9420300bb2f843f6ecbdb25ea80e4
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
The existing calculation of vmpressure takes into account only
the ratio of reclaimed to scanned pages, but not the time spent
or the difficulty in reclaiming those pages. For e.g. when there
are quite a number of file pages in the system, an allocation
request can be satisfied by reclaiming the file pages alone. If
such a reclaim is successful, the vmpressure value will remain low
irrespective of the time spent by the reclaim code to free up the
file pages. With a feature like lowmemorykiller, killing a task
can be faster than reclaiming the file pages alone. So if the
vmpressure values reflect the reclaim difficulty level, clients
can make a decision based on that, for e.g. to kill a task early.
This patch monitors the number of pages scanned in the direct
reclaim path and scales the vmpressure level according to that.
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Change-Id: I6e643d29a9a1aa0814309253a8b690ad86ec0b13
Currently, vmpressure is tied to memcg and its events are
available only to userspace clients. This patch removes
the dependency on CONFIG_MEMCG and adds a mechanism for
in-kernel clients to subscribe for vmpressure events (in
fact raw vmpressure values are delivered instead of vmpressure
levels, to provide clients more flexibility to take actions
on custom pressure levels which are not currently defined
by vmpressure module).
Change-Id: I38010f166546e8d7f12f5f355b5dbfd6ba04d587
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Add a new config item, CONFIG_FORCE_ALLOC_FROM_DMA_ZONE, which
can be used to optionally force certain allocators to always
return memory from ZONE_DMA.
This option helps ensure that clients who require ZONE_DMA
memory are always using ZONE_DMA memory.
Change-Id: Id2d36214307789f27aa775c2bef2dab5047c4ff0
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Currently we have kmemleak_stack_scan enabled by default.
This can hog the cpu with pre-emption disabled for a long
time starving other tasks.
Make this optional at compile time, since if required
we can always write to sysfs entry and enable this option.
Change-Id: Ie30447861c942337c7ff25ac269b6025a527e8eb
Signed-off-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
When CONFIG_PAGE_POISONING is enabled, the pages are poisoned
after setting free page in KASan Shadow memory and KASan reports
the read after free warning. The same thing happens in the allocation
path. So change the order of calling KASan_alloc/free API so that
pages poisoning happens when the pages are in alloc status in KASan
shadow memory.
following is the KASan report for reference.
==================================================================
BUG: KASan: use after free in memset+0x24/0x44 at addr ffffffc000000000
Write of size 4096 by task swapper/0
page:ffffffbac5000000 count:0 mapcount:0 mapping: (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.0-g5a4a5d5-07242-g6938a8b-dirty #1
Hardware name: Qualcomm Technologies, Inc. MSM 8996 v2 + PMI8994 MTP (DT)
Call trace:
[<ffffffc000089ea4>] dump_backtrace+0x0/0x1c4
[<ffffffc00008a078>] show_stack+0x10/0x1c
[<ffffffc0010ecfd8>] dump_stack+0x74/0xc8
[<ffffffc00020faec>] kasan_report_error+0x2b0/0x408
[<ffffffc00020fd20>] kasan_report+0x34/0x40
[<ffffffc00020f138>] __asan_storeN+0x15c/0x168
[<ffffffc00020f374>] memset+0x20/0x44
[<ffffffc0002086e0>] kernel_map_pages+0x238/0x2a8
[<ffffffc0001ba738>] free_pages_prepare+0x21c/0x25c
[<ffffffc0001bc7e4>] __free_pages_ok+0x20/0xf0
[<ffffffc0001bd3bc>] __free_pages+0x34/0x44
[<ffffffc0001bd5d8>] __free_pages_bootmem+0xf4/0x110
[<ffffffc001ca9050>] free_all_bootmem+0x160/0x1f4
[<ffffffc001c97b30>] mem_init+0x70/0x1ec
[<ffffffc001c909f8>] start_kernel+0x2b8/0x4e4
[<ffffffc001c987dc>] kasan_early_init+0x154/0x160
Change-Id: Idbd3dc629be57ed55a383b069a735ae3ee7b9f05
Signed-off-by: Se Wang (Patrick) Oh <sewango@codeaurora.org>
Change the logic which determines the initial readahead window size
such that for small requests (one page) the initial window size
will be x4 the size of the original request, regardless of the
VM_MAX_READAHEAD value. This prevents a rapid ramp-up
that could be caused due to increasing VM_MAX_READAHEAD.
Change-Id: I93d59c515d7e6c6d62348790980ff7bd4f434997
Signed-off-by: Lee Susman <lsusman@codeaurora.org>
Memory watermarks were sometimes preventing CMA allocations
in low memory.
Change-Id: I550ec987cbd6bc6dadd72b4a764df20cd0758479
Signed-off-by: Liam Mark <lmark@codeaurora.org>
CMA allocations rely on being able to migrate pages out
quickly to fulfill the allocations. Most use cases for
movable allocations meet this requirement. File system
allocations may take an unaccpetably long time to
migrate, which creates delays from CMA. Prevent CMA
pages from ending up on the per-cpu lists to avoid
code paths grabbing CMA pages on the fast path. CMA
pages can still be allocated as a fallback under tight
memory pressure.
CRs-Fixed: 452508
Change-Id: I79a28f697275a2a1870caabae53c8ea345b4b47d
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Bring back the is_cma_pageblock definition for determining if a
page is CMA or not.
Change-Id: I39fd546e22e240b752244832c79514f109c8e84b
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Allow the kswapd cpu affinity to be configured.
There can be power benefits on certain targets when limiting kswapd
to run only on certain cores.
CRs-fixed: 752344
Change-Id: I8a83337ff313a7e0324361140398226a09f8be0f
Signed-off-by: Liam Mark <lmark@codeaurora.org>
[imaund@codeaurora.org: Resolved trivial context conflicts.]
Signed-off-by: Ian Maund <imaund@codeaurora.org>
A workaround was added ealier to move a page to active
list if swapping to devices like zram fails. But this
can result in try_to_free_swap being called from
shrink_page_list, without a properly locked page.
Lock the page when we indicate to activate a page
in pageout().
Add a check to ensure that error is on swap, and
clear the error flag before moving the page to
active list.
CRs-fixed: 760049
Change-Id: I77a8bbd6ed13efdec943298fe9448412feeac176
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Ensure that shrinkers are given the option to completely drop
their caches even when their caches are smaller than the batch size.
This change helps improve memory headroom by ensuring that under
significant memory pressure shrinkers can drop all of their caches.
This change only attempts to more aggressively call the shrinkers
during background memory reclaim, inorder to avoid hurting the
perforamnce of direct memory reclaim.
Change-Id: I8dbc29c054add639e4810e36fd2c8a063e5c52f3
Signed-off-by: Liam Mark <lmark@codeaurora.org>
If the SLUB_DEBUG_PANIC_ON Kconfig option is
selected, also panic for object and slab
errors to allow capturing relevant debug
data.
Change-Id: Idc582ef48d3c0d866fa89cf8660ff0a5402f7e15
Signed-off-by: David Keitel <dkeitel@codeaurora.org>
Add the DEBUG_SLUB_PANIC_ON option to KCONFIG preventing
the existing defconfig option from being overwritten
by make config.
This will induce a panic if slab debug catches corruptions
within the padding of a given object.
The intention here is to induce collection of data
immediately after the corruption is detected with
the goal to catch the possible source of the corruption.
Change-Id: Ide0102d0761022c643a761989360ae5c853870a8
Signed-off-by: David Keitel <dkeitel@codeaurora.org>
[imaund@codeaurora.org: Resolved trivial merge conflicts.]
Signed-off-by: Ian Maund <imaund@codeaurora.org>
[lmark@codeaurora.org: ensure change does not create
arch/arm64/configs/msm8994_defconfig file]
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Allow other functions to dump the list of tasks.
Useful for when debugging memory leaks.
Change-Id: I76c33a118a9765b4c2276e8c76de36399c78dbf6
Signed-off-by: Liam Mark <lmark@codeaurora.org>
KSM is yet another framework which may obfuscate some memory
problems. Use the showmem notifier to show how KSM is being
used to give some insight into potential issues or non-issues.
Change-Id: If82405dc33f212d085e6847f7c511fd4d0a32a10
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Resiliency of slub was added for production systems in an
attempt to restore corruptions and allow production environments
to continue to run.
In debug setups, this may no be desirable. Thus rather than
attempting to restore corrupted bytes in poisoned zones, panic
to attempt to catch more context of what was going on in the
system at the time.
Add the CONFIG_SLUB_DEBUG_PANIC_ON defconfig option to allow
debug builds to turn on this panic option.
Change-Id: I01763e8eea40a4544e9b7e48c4e4d40840b6c82d
Signed-off-by: David Keitel <dkeitel@codeaurora.org>
KSM thread to scan pages is getting schedule on definite timeout.
That wakes up CPU from idle state and hence may affect the power
consumption. Provide an optional support to use deferred timer
which suites low-power use-cases.
To enable deferred timers,
$ echo 1 > /sys/kernel/mm/ksm/deferred_timer
Change-Id: I07fe199f97fe1f72f9a9e1b0b757a3ac533719e8
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
When performing memory reclaim support treating anonymous and
file backed pages equally.
Swapping anonymous pages out to memory can be efficient enough
to justify treating anonymous and file backed pages equally.
CRs-Fixed: 648984
Change-Id: I6315b8557020d1e27a34225bb9cefbef1fb43266
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Move pages that fail swapout to the LRU active list to reduce
pressure on swap device when swapping out is already failing.
This helps when using a pseudo swap device such as zram which
starts failing when memory is low.
Change-Id: Ib136cd0a744378aa93d837a24b9143ee818c80b3
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
free_bootmem_late is currently set up to only be used in init
functions. Some clients need to use this function past initcalls.
The functions themselves have no restrictions on being used later
minus the __init annotations so remove the annotation.
Change-Id: I7c7e15cf2780a8843ebb4610da5b633c9abb0b3d
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
[abhimany@codeaurora.org: resolve minor conflict
and remove __init from nobootmem.c]
Signed-off-by: Abhimanyu Kapur <abhimany@codeaurora.org>
Drivers have a tendency to scribble on everything, including free
pages. Make life easier by marking free pages as read only when
on the buddy list and re-marking as read/write when allocating.
Change-Id: I978ed2921394919917307b9c99217fdc22f82c59
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
(cherry picked from commit 752f5aecb0511c4d661dce2538c723675c1e6449)
There are many drivers in the kernel which can hold on
to lots of memory. It can be useful to dump out all those
drivers at key points in the kernel. Introduct a notifier
framework for dumping this information. When the notifiers
are called, drivers can dump out the state of any memory
they may be using.
Change-Id: Ifb2946964bf5d072552dd56d8d6dfdd794af6d84
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Add a new function, memblock_overlaps_memory(), to check if a
region overlaps with a memory bank. This will be used by
peripheral loader code to detect when kernel memory would be
overwritten.
Change-Id: I851f8f416a0f36e85c0e19536b5209f7d4bd431c
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
(cherry picked from commit cc2753448d9f2adf48295f935a7eee36023ba8d3)
Signed-off-by: Josh Cartwright <joshc@codeaurora.org>
(cherry picked from commit https://lkml.org/lkml/2015/12/21/337)
ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such
a way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.
Bug: 24047224
Signed-off-by: Daniel Cashman <dcashman@android.com>
Signed-off-by: Daniel Cashman <dcashman@google.com>
Change-Id: Ibf9ed3d4390e9686f5cc34f605d509a20d40e6c2
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory. When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.
This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma. vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.
The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas. If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".
The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen. This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging. The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.
Includes fix from Jed Davis <jld@mozilla.com> for typo in
prctl_set_vma_anon_name, which could attempt to set the name
across two vmas at the same time due to a typo, which might
corrupt the vma list. Fix it to use tmp instead of end to limit
the name setting to a single vma at a time.
Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.
This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period. In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.
It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.
[ccross]
Revived for use on old kernels where no other solution exists.
The tunable will be removed on kernels that do better at avoiding
direct reclaim.
Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e
Signed-off-by: Rik van Riel<riel@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
This patch adds a debugfs file called "shrinker" when read this calls
all the shrinkers in the system with nr_to_scan set to zero and prints
the result. These results are the number of objects the shrinkers have
available and can thus be used an indication of the total memory
that would be availble to the system if a shrink occurred.
Change-Id: Ied0ee7caff3d2fc1cb4bb839aaafee81b5b0b143
Signed-off-by: Rebecca Schultz Zavin <rebecca@android.com>
A spare array holding mem cgroup threshold events is kept around
to make sure we can always safely deregister an event and have an
array to store the new set of events in.
In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold->primary after it is freed.
Fixed by only freeing after synchronize_rcu().
Signed-off-by: Martijn Coenen <maco@google.com>
Refactor *allow_attach() handler to align it with the changes
from mainline commit 1f7dd3e5a6 "cgroup: fix handling of
multi-destination migration from subtree_control enabling".
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Pass correct argument to subsys_cgroup_allow_attach(), which
expects 'struct cgroup_subsys_state *' argument but we pass
'struct cgroup *' instead which doesn't seem right.
This fixes following 'incompatible pointer type' compiler warning:
----------
CC mm/memcontrol.o
mm/memcontrol.c: In function ‘mem_cgroup_allow_attach’:
mm/memcontrol.c:5052:2: warning: passing argument 1 of ‘subsys_cgroup_allow_attach’ from incompatible pointer type [enabled by default]
In file included from include/linux/memcontrol.h:22:0,
from mm/memcontrol.c:29:
include/linux/cgroup.h:953:5: note: expected ‘struct cgroup_subsys_state *’ but argument is of type ‘struct cgroup *’
----------
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Use the 'allow_attach' handler for the 'mem' cgroup to allow
non-root processes to add arbitrary processes to a 'mem' cgroup
if it has the CAP_SYS_NICE capability set.
Bug: 18260435
Change-Id: If7d37bf90c1544024c4db53351adba6a64966250
Signed-off-by: Rom Lemarchand <romlem@android.com>
NOT FOR STAGING
This patch re-adds the original shmem_set_file to mm/shmem.c
and converts ashmem.c back to using it.
CC: Brian Swetland <swetland@google.com>
CC: Colin Cross <ccross@android.com>
CC: Arve Hjønnevåg <arve@android.com>
CC: Dima Zavin <dima@android.com>
CC: Robert Love <rlove@google.com>
CC: Greg KH <greg@kroah.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
kernel test robot has reported the following crash:
BUG: unable to handle kernel NULL pointer dereference at 00000100
IP: [<c1074df6>] __queue_work+0x26/0x390
*pdpt = 0000000000000000 *pde = f000ff53f000ff53 *pde = f000ff53f000ff53
Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP
CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.4.0-rc4-00139-g373ccbe #1
Workqueue: events vmstat_shepherd
task: cb684600 ti: cb7ba000 task.ti: cb7ba000
EIP: 0060:[<c1074df6>] EFLAGS: 00010046 CPU: 0
EIP is at __queue_work+0x26/0x390
EAX: 00000046 EBX: cbb37800 ECX: cbb37800 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: cb7bbe68 ESP: cb7bbe38
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 00000100 CR3: 01fd5000 CR4: 000006b0
Stack:
Call Trace:
__queue_delayed_work+0xa1/0x160
queue_delayed_work_on+0x36/0x60
vmstat_shepherd+0xad/0xf0
process_one_work+0x1aa/0x4c0
worker_thread+0x41/0x440
kthread+0xb0/0xd0
ret_from_kernel_thread+0x21/0x40
The reason is that start_shepherd_timer schedules the shepherd work item
which uses vmstat_wq (vmstat_shepherd) before setup_vmstat allocates
that workqueue so if the further initialization takes more than HZ we
might end up scheduling on a NULL vmstat_wq. This is really unlikely
but not impossible.
Fixes: 373ccbe592 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mod_zone_page_state() takes a "delta" integer argument. delta contains
the number of pages that should be added or subtracted from a struct
zone's vm_stat field.
If a zone is larger than 8TB this will cause overflows. E.g. for a
zone with a size slightly larger than 8TB the line
mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
in mm/page_alloc.c:free_area_init_core() will result in a negative
result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB
contain 0x8xxxxxxx pages which will be sign extended to a negative
value.
Fix this by changing the delta argument to long type.
This could fix an early boot problem seen on s390, where we have a 9TB
system with only one node. ZONE_DMA contains 2GB and ZONE_NORMAL the
rest. The system is trying to allocate a GFP_DMA page but ZONE_DMA is
completely empty, so it tries to reclaim pages in an endless loop.
This was seen on a heavily patched 3.10 kernel. One possible
explaination seem to be the overflows caused by mod_zone_page_state().
Unfortunately I did not have the chance to verify that this patch
actually fixes the problem, since I don't have access to the system
right now. However the overflow problem does exist anyway.
Given the description that a system with slightly less than 8TB does
work, this seems to be a candidate for the observed problem.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
test_pages_in_a_zone() does not account for the possibility of missing
sections in the given pfn range. pfn_valid_within always returns 1 when
CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
sections to pass the test, leading to a kernel oops.
Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
for missing sections before proceeding into the zone-check code.
This also prevents a crash from offlining memory devices with missing
sections. Despite this, it may be a good idea to keep the related patch
'[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
missing sections' because missing sections in a memory block may lead to
other problems not covered by the scope of this fix.
Signed-off-by: Andrew Banman <abanman@sgi.com>
Acked-by: Alex Thorlton <athorlton@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Greg KH <greg@kroah.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
once enough pages have been reclaimed, in which case, in contrast to a
full round-trip over a cgroup sub-tree, the current position stored in
mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
and so is left holding the reference to the last scanned cgroup. If the
target cgroup does not get scanned again (we might have just reclaimed
the last page or all processes might exit and free their memory
voluntary), we will leak it, because there is nobody to put the
reference held by the iterator.
The problem is easy to reproduce by running the following command
sequence in a loop:
mkdir /sys/fs/cgroup/memory/test
echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
memhog 150M
echo $$ > /sys/fs/cgroup/memory/cgroup.procs
rmdir test
The cgroups generated by it will never get freed.
This patch fixes this issue by making mem_cgroup_iter avoid taking
reference to the current position. In order not to hit use-after-free
bug while running reclaim in parallel with cgroup deletion, we make use
of ->css_released cgroup callback to clear references to the dying
cgroup in all reclaim iterators that might refer to it. This callback
is called right before scheduling rcu work which will free css, so if we
access iter->position from rcu read section, we might be sure it won't
go away under us.
[hannes@cmpxchg.org: clean up css ref handling]
Fixes: 5ac8fb31ad ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org> [3.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the use of strncmp in zswap_pool_find_get() to strcmp.
The use of strncmp is no longer correct, now that zswap_zpool_type is
not an array; sizeof() will return the size of a pointer, which isn't
the right length to compare. We don't need to use strncmp anyway,
because the existing params and the passed in params are all guaranteed
to be null terminated, so strcmp should be used.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Reported-by: Weijie Yang <weijie.yang@samsung.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's possible that an oom killed victim shares an ->mm with the init
process and thus oom_kill_process() would end up trying to kill init as
well.
This has been shown in practice:
Out of memory: Kill process 9134 (init) score 3 or sacrifice child
Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB
Kill process 1 (init) sharing same memory
...
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
And this will result in a kernel panic.
If a process is forked by init and selected for oom kill while still
sharing init_mm, then it's likely this system is in a recoverable state.
However, it's better not to try to kill init and allow the machine to
panic due to unkillable processes.
[rientjes@google.com: rewrote changelog]
[akpm@linux-foundation.org: fix inverted test, per Ben]
Signed-off-by: Chen Jie <chenjie6@huawei.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitry Vyukov provides a little program, autogenerated by syzkaller,
which races a fault on a mapping of a sparse memfd object, against
truncation of that object below the fault address: run repeatedly for a
few minutes, it reliably generates shmem_evict_inode()'s
WARN_ON(inode->i_blocks).
(But there's nothing specific to memfd here, nor to the fstat which it
happened to use to generate the fault: though that looked suspicious,
since a shmem_recalc_inode() had been added there recently. The same
problem can be reproduced with open+unlink in place of memfd_create, and
with fstatfs in place of fstat.)
v3.7 commit 0f3c42f522 ("tmpfs: change final i_blocks BUG to WARNING")
explains one cause of such a warning (a race with shmem_writepage to
swap), and possible solutions; but we never took it further, and this
syzkaller incident turns out to have a different cause.
shmem_getpage_gfp()'s error recovery, when a freshly allocated page is
then found to be beyond eof, looks plausible - decrementing the alloced
count that was just before incremented - but in fact can go wrong, if a
racing thread (the truncator, for example) gets its shmem_recalc_inode()
in just after our delete_from_page_cache(). delete_from_page_cache()
decrements nrpages, that shmem_recalc_inode() will balance the books by
decrementing alloced itself, then our decrement of alloced take it one
too low: leading to the WARNING when the object is finally evicted.
Once the new page has been exposed in the page cache,
shmem_getpage_gfp() must leave it to shmem_recalc_inode() itself to get
the accounting right in all cases (and not fall through from "trunc:" to
"decused:"). Adjust that error recovery block; and the reinitialization
of info and sbinfo can be removed too.
While we're here, fix shmem_writepage() to avoid the original issue: it
will be safe against a racing shmem_recalc_inode(), if it merely
increments swapped before the shmem_delete_from_page_cache() which
decrements nrpages (but it must then do its own shmem_recalc_inode()
before that, while still in balance, instead of after). (Aside: why do
we shmem_recalc_inode() here in the swap path? Because its raison d'etre
is to cope with clean sparse shmem pages being reclaimed behind our
back: so here when swapping is a good place to look for that case.) But
I've not now managed to reproduce this bug, even without the patch.
I don't see why I didn't do that earlier: perhaps inhibited by the
preference to eliminate shmem_recalc_inode() altogether. Driven by this
incident, I do now have a patch to do so at last; but still want to sit
on it for a bit, there's a couple of questions yet to be resolved.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitry Vyukov reported the following memory leak
unreferenced object 0xffff88002eaafd88 (size 32):
comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s)
hex dump (first 32 bytes):
28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff (.Nc....(.Nc....
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
kmalloc include/linux/slab.h:458
region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
__vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
vma_needs_reservation mm/hugetlb.c:1813
alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
hugetlb_no_page mm/hugetlb.c:3543
hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
__get_user_pages+0x542/0xf30 mm/gup.c:497
populate_vma_page_range+0xde/0x110 mm/gup.c:919
__mm_populate+0x1c7/0x310 mm/gup.c:969
do_mlock+0x291/0x360 mm/mlock.c:637
SYSC_mlock2 mm/mlock.c:658
SyS_mlock2+0x4b/0x70 mm/mlock.c:648
Dmitry identified a potential memory leak in the routine region_chg,
where a region descriptor is not free'ed on an error path.
However, the root cause for the above memory leak resides in region_del.
In this specific case, a "placeholder" entry is created in region_chg.
The associated page allocation fails, and the placeholder entry is left
in the reserve map. This is "by design" as the entry should be deleted
when the map is released. The bug is in the region_del routine which is
used to delete entries within a specific range (and when the map is
released). region_del did not handle the case where a placeholder entry
exactly matched the start of the range range to be deleted. In this
case, the entry would not be deleted and leaked. The fix is to take
these special placeholder entries into account in region_del.
The region_chg error path leak is also fixed.
Fixes: feba16e25a ("mm/hugetlb: add region_del() to delete a specific range of entries")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org> [4.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>