android_kernel_oneplus_msm8998/mm at e85d624f7164b51b3a9c081d0d9d7e3b5187dab8 - evie/android_kernel_oneplus_msm8998 - Gay Catgirls Forgejo: gay catgirls having sex

evie/android_kernel_oneplus_msm8998

History

Andrea Arcangeli 877813e010 mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings commit ac5b2c18911ffe95c08d69273917f90212cf5659 upstream. THP allocation might be really disruptive when allocated on NUMA system with the local node full or hard to reclaim. Stefan has posted an allocation stall report on 4.12 based SLES kernel which suggests the same issue: kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM\|__GFP_IO\|__GFP_FS\|__GFP_COMP\|__GFP_NOMEMALLOC\|__GFP_HARDWALL\|__GFP_THISNODE\|__GFP_MOVABLE\|__GFP_DIRECT_RECLAIM), nodemask=(null) kvm cpuset=/ mems_allowed=0-1 CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased) Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017 Call Trace: dump_stack+0x5c/0x84 warn_alloc+0xe0/0x180 __alloc_pages_slowpath+0x820/0xc90 __alloc_pages_nodemask+0x1cc/0x210 alloc_pages_vma+0x1e5/0x280 do_huge_pmd_wp_page+0x83f/0xf00 __handle_mm_fault+0x93d/0x1060 handle_mm_fault+0xc6/0x1b0 __do_page_fault+0x230/0x430 do_page_fault+0x2a/0x70 page_fault+0x7b/0x80 [...] Mem-Info: active_anon:126315487 inactive_anon:1612476 isolated_anon:5 active_file:60183 inactive_file:245285 isolated_file:0 unevictable:15657 dirty:286 writeback:1 unstable:0 slab_reclaimable:75543 slab_unreclaimable:2509111 mapped:81814 shmem:31764 pagetables:370616 bounce:0 free:32294031 free_pcp:6233 free_cma:0 Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no The defrag mode is "madvise" and from the above report it is clear that the THP has been allocated for MADV_HUGEPAGA vma. Andrea has identified that the main source of the problem is __GFP_THISNODE usage: : The problem is that direct compaction combined with the NUMA : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very : hard the local node, instead of failing the allocation if there's no : THP available in the local node. : : Such logic was ok until __GFP_THISNODE was added to the THP allocation : path even with MPOL_DEFAULT. : : The idea behind the __GFP_THISNODE addition, is that it is better to : provide local memory in PAGE_SIZE units than to use remote NUMA THP : backed memory. That largely depends on the remote latency though, on : threadrippers for example the overhead is relatively low in my : experience. : : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in : extremely slow qemu startup with vfio, if the VM is larger than the : size of one host NUMA node. This is because it will try very hard to : unsuccessfully swapout get_user_pages pinned pages as result of the : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE : allocations and instead of trying to allocate THP on other nodes (it : would be even worse without vfio type1 GUP pins of course, except it'd : be swapping heavily instead). Fix this by removing __GFP_THISNODE for THP requests which are requesting the direct reclaim. This effectivelly reverts `5265047ac3` on the grounds that the zone/node reclaim was known to be disruptive due to premature reclaim when there was memory free. While it made sense at the time for HPC workloads without NUMA awareness on rare machines, it was ultimately harmful in the majority of cases. The existing behaviour is similar, if not as widespare as it applies to a corner case but crucially, it cannot be tuned around like zone_reclaim_mode can. The default behaviour should always be to cause the least harm for the common case. If there are specialised use cases out there that want zone_reclaim_mode in specific cases, then it can be built on top. Longterm we should consider a memory policy which allows for the node reclaim like behavior for the specific memory ranges which would allow a [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com Mel said: : Both patches look correct to me but I'm responding to this one because : it's the fix. The change makes sense and moves further away from the : severe stalling behaviour we used to see with both THP and zone reclaim : mode. : : I put together a basic experiment with usemem configured to reference a : buffer multiple times that is 80% the size of main memory on a 2-socket : box with symmetric node sizes and defrag set to "always". The defrag : setting is not the default but it would be functionally similar to : accessing a buffer with madvise(MADV_HUGEPAGE). Usemem is configured to : reference the buffer multiple times and while it's not an interesting : workload, it would be expected to complete reasonably quickly as it fits : within memory. The results were; : : usemem : vanilla noreclaim-v1 : Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%) : Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%) : Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%) : : This shows the elapsed time in seconds for 1 thread, 3 threads and 4 : threads referencing buffers 80% the size of memory. With the patches : applied, it's 37.18% faster for the single thread and 73% faster with two : threads. Note that 4 threads showing little difference does not indicate : the problem is related to thread counts. It's simply the case that 4 : threads gets spread so their workload mostly fits in one node. : : The overall view from /proc/vmstats is more startling : : 4.19.0-rc1 4.19.0-rc1 : vanillanoreclaim-v1r1 : Minor Faults 35593425 708164 : Major Faults 484088 36 : Swap Ins 3772837 0 : Swap Outs 3932295 0 : : Massive amounts of swap in/out without the patch : : Direct pages scanned 6013214 0 : Kswapd pages scanned 0 0 : Kswapd pages reclaimed 0 0 : Direct pages reclaimed 4033009 0 : : Lots of reclaim activity without the patch : : Kswapd efficiency 100% 100% : Kswapd velocity 0.000 0.000 : Direct efficiency 67% 100% : Direct velocity 11191.956 0.000 : : Mostly from direct reclaim context as you'd expect without the patch. : : Page writes by reclaim 3932314.000 0.000 : Page writes file 19 0 : Page writes anon 3932295 0 : Page reclaim immediate 42336 0 : : Writes from reclaim context is never good but the patch eliminates it. : : We should never have default behaviour to thrash the system for such a : basic workload. If zone reclaim mode behaviour is ever desired but on a : single task instead of a global basis then the sensible option is to build : a mempolicy that enforces that behaviour. This was a severe regression compared to previous kernels that made important workloads unusable and it starts when __GFP_THISNODE was added to THP allocations under MADV_HUGEPAGE. It is not a significant risk to go to the previous behavior before __GFP_THISNODE was added, it worked like that for years. This was simply an optimization to some lucky workloads that can fit in a single node, but it ended up breaking the VM for others that can't possibly fit in a single node, so going back is safe. [mhocko@suse.com: rewrote the changelog based on the one from Andrea] Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org Fixes: `5265047ac3` ("mm, thp: really limit transparent hugepage allocation to local node") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Debugged-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Mel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: <stable@vger.kernel.org> [4.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>		2018-11-21 09:27:42 +01:00
..
kasan	kasan: fix shadow_size calculation error in kasan_module_alloc	2018-08-24 13:26:58 +02:00
backing-dev.c	writeback: fix the wrong congested state variable definition	2018-04-08 11:51:56 +02:00
balloon_compaction.c	virtio_balloon: fix race between migration and ballooning	2016-03-03 15:07:18 -08:00
bootmem.c	bootmem: avoid freeing to bootmem after bootmem is done	2015-09-08 15:35:28 -07:00
cleancache.c	cleancache: remove limit on the number of cleancache enabled filesystems	2015-04-14 16:49:03 -07:00
cma.c	cma: fix calculation of aligned offset	2018-01-31 12:06:09 +01:00
cma.h	mm: cma: mark cma_bitmap_maxno() inline in header	2015-08-14 15:56:32 -07:00
cma_debug.c	mm/cma_debug: correct size input to bitmap function	2015-07-17 16:39:54 -07:00
compaction.c	mm/compaction: pass only pageblock aligned range to pageblock_pfn_to_page	2018-01-17 09:35:26 +01:00
debug-pagealloc.c	mm, hwpoison: fixup "mm: check the return value of lookup_page_ext for all call sites"	2017-11-24 11:26:29 +01:00
debug.c	mm: get rid of vmacache_flush_all() entirely	2018-09-19 22:49:00 +02:00
dmapool.c	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00
early_ioremap.c	mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep	2018-02-25 11:03:41 +01:00
fadvise.c	mm/fadvise.c: fix signed overflow UBSAN complaint	2018-09-15 09:40:38 +02:00
failslab.c	mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM	2015-11-06 17:50:42 -08:00
filemap.c	mm: filemap: avoid unnecessary calls to lock_page when waiting for IO to complete during a read	2018-05-26 08:48:54 +02:00
frame_vector.c	mm: fix docbook comment for get_vaddr_frames()	2015-11-05 19:34:48 -08:00
frontswap.c	frontswap: allow multiple backends	2015-06-24 17:49:45 -07:00
gup.c	mm: do not bug_on on incorrect length in __mm_populate()	2018-11-21 09:27:41 +01:00
highmem.c	mm/highmem: make kmap cache coloring aware	2014-08-06 18:01:22 -07:00
huge_memory.c	mremap: properly flush TLB before releasing the page	2018-11-10 07:41:42 -08:00
hugetlb.c	hugetlbfs: dirty pages as they are added to pagecache	2018-11-21 09:27:35 +01:00
hugetlb_cgroup.c	mm: make compound_head() robust	2015-11-06 17:50:42 -08:00
hwpoison-inject.c	hwpoison: use page_cgroup_ino for filtering by memcg	2015-09-10 13:29:01 -07:00
init-mm.c	mm: Add a user_ns owner to mm_struct and fix ptrace permission checks	2017-01-06 11:16:11 +01:00
internal.h	mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries	2017-08-11 09:08:50 -07:00
interval_tree.c	mm: replace vma->sharead.linear with vma->shared	2015-02-10 14:30:31 -08:00
Kconfig	mm: don't allow deferred pages with NEED_PER_CPU_KM	2018-05-26 08:48:55 +02:00
Kconfig.debug	mm/debug_pagealloc: remove obsolete Kconfig options	2015-01-08 15:10:52 -08:00
kmemcheck.c	mm/slab_common: move kmem_cache definition to internal header	2014-10-09 22:25:50 -04:00
kmemleak-test.c
kmemleak.c	mm/kmemleak.c: wait for scan completion before disabling free	2018-05-30 07:49:06 +02:00
ksm.c	mm/ksm: fix interaction with THP	2018-05-30 07:49:08 +02:00
list_lru.c	mm/list_lru.c: fix list_lru_count_node() to be race free	2017-07-21 07:44:56 +02:00
maccess.c	mm/maccess.c: actually return -EFAULT from strncpy_from_unsafe	2015-11-05 19:34:48 -08:00
madvise.c	mm: madvise(MADV_DODUMP): allow hugetlbfs pages	2018-10-10 08:52:11 +02:00
Makefile	media updates for v4.3-rc1	2015-09-11 16:42:39 -07:00
memblock.c	mm: consider memblock reservations for deferred memory initialization sizing	2017-06-14 13:16:26 +02:00
memcontrol.c	mm: memcg: fix use after free in mem_cgroup_iter()	2018-07-25 10:18:16 +02:00
memory-failure.c	hwpoison, memcg: forcibly uncharge LRU pages	2018-01-31 12:06:09 +01:00
memory.c	mm/tlb: Remove tlb_remove_table() non-concurrent condition	2018-09-09 20:04:34 +02:00
memory_hotplug.c	base/memory, hotplug: fix a kernel oops in show_valid_zones()	2017-02-09 08:02:47 +01:00
mempolicy.c	mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings	2018-11-21 09:27:42 +01:00
mempool.c	mm/mempool: avoid KASAN marking mempool poison checks as use-after-free	2017-08-12 19:29:09 -07:00
memtest.c	memtest: remove unused header files	2015-09-08 15:35:28 -07:00
migrate.c	Sanitize 'move_pages()' permission checks	2017-08-24 17:02:36 -07:00
mincore.c	mm/mincore: use offset_in_page macro	2015-11-05 19:34:48 -08:00
mlock.c	mlock: fix mlock count can not decrease in race condition	2017-06-07 12:06:01 +02:00
mm_init.c	mm: meminit: remove mminit_verify_page_links	2015-06-30 19:44:56 -07:00
mmap.c	mm: do not bug_on on incorrect length in __mm_populate()	2018-11-21 09:27:41 +01:00
mmu_context.c	mm/mmu_context, sched/core: Fix mmu_context.h assumption	2017-12-25 14:22:09 +01:00
mmu_notifier.c	mmu-notifier: add clear_young callback	2015-09-10 13:29:01 -07:00
mmzone.c	mm: microoptimize zonelist operations	2015-02-11 17:06:02 -08:00
mprotect.c	x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings	2018-08-15 17:42:10 +02:00
mremap.c	mremap: properly flush TLB before releasing the page	2018-11-10 07:41:42 -08:00
msync.c	mm/msync: use offset_in_page macro	2015-11-05 19:34:48 -08:00
nobootmem.c	mm: page_alloc: pass PFN to __free_pages_bootmem	2015-06-30 19:44:55 -07:00
nommu.c	mm/nommu.c: drop unlikely inside BUG_ON()	2015-11-05 19:34:48 -08:00
oom_kill.c	mm/oom_kill.c: avoid attempting to kill init sharing same memory	2015-12-12 10:15:34 -08:00
page-writeback.c	writeback: safer lock nesting	2018-04-24 09:32:12 +02:00
page_alloc.c	mm, page_alloc: do not break __GFP_THISNODE by zonelist reset	2018-07-11 16:03:51 +02:00
page_counter.c	mm: page_counter: let page_counter_try_charge() return bool	2015-11-05 19:34:48 -08:00
page_ext.c	mm/page_ext.c: check if page_ext is not prepared	2017-11-24 08:32:25 +01:00
page_idle.c	mm: introduce idle page tracking	2015-09-10 13:29:01 -07:00
page_io.c	fs: use helper bio_add_page() instead of open coding on bi_io_vec	2015-08-13 12:32:00 -06:00
page_isolation.c	mm: fix invalid node in alloc_migrate_target()	2016-04-20 15:41:53 +09:00
page_owner.c	mm: check the return value of lookup_page_ext for all call sites	2017-11-24 08:32:25 +01:00
pagewalk.c	mm/pagewalk.c: report holes in hugetlb ranges	2017-11-24 08:32:25 +01:00
percpu-km.c	percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated	2014-09-02 14:46:05 -04:00
percpu-vm.c	percpu: move region iterations out of pcpu_[de]populate_chunk()	2014-09-02 14:46:02 -04:00
percpu.c	percpu: include linux/sched.h for cond_resched()	2018-05-16 10:06:46 +02:00
pgtable-generic.c	mm,thp: khugepaged: call pte flush at the time of collapse	2016-02-25 12:01:23 -08:00
process_vm_access.c	ptrace: use fsuid, fsgid, effective creds for fs access checks	2016-02-25 12:01:16 -08:00
quicklist.c
readahead.c	mm, fs: introduce mapping_gfp_constraint()	2015-11-06 17:50:42 -08:00
rmap.c	mm/rmap: batched invalidations should use existing api	2017-12-25 14:22:09 +01:00
shmem.c	mm: shmem.c: Correctly annotate new inodes for lockdep	2018-09-29 03:08:52 -07:00
slab.c	mm, slab: reschedule cache_reap() on the same CPU	2018-04-24 09:32:05 +02:00
slab.h	slab/slub: adjust kmem_cache_alloc_bulk API	2015-11-22 11:58:44 -08:00
slab_common.c	slub: do not merge cache if slub_debug contains a never-merge flag	2017-10-21 17:09:05 +02:00
slob.c	slab/slub: adjust kmem_cache_alloc_bulk API	2015-11-22 11:58:44 -08:00
slub.c	slub: make ->cpu_partial unsigned int	2018-10-10 08:52:08 +02:00
sparse-vmemmap.c
sparse.c
swap.c	mm: make compound_head() robust	2015-11-06 17:50:42 -08:00
swap_cgroup.c	mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()	2017-07-05 14:37:15 +02:00
swap_state.c	mm: swap: zswap: maybe_preload & refactoring	2015-09-08 15:35:28 -07:00
swapfile.c	x86/speculation/l1tf: Limit swap file size to MAX_PA/2	2018-08-15 17:42:10 +02:00
truncate.c	fs: add i_blocksize()	2017-06-14 13:16:24 +02:00
userfaultfd.c	userfaultfd: avoid mmap_sem read recursion in mcopy_atomic	2015-09-04 16:54:41 -07:00
util.c	proc read mm's {arg,env}_{start,end} with mmap semaphore taken.	2018-05-26 08:48:55 +02:00
vmacache.c	mm: get rid of vmacache_flush_all() entirely	2018-09-19 22:49:00 +02:00
vmalloc.c	mm: vmalloc: avoid racy handling of debugobjects in vunmap	2018-08-06 16:24:30 +02:00
vmpressure.c	mm: vmpressure: fix sending wrong events on underflow	2017-03-12 06:37:25 +01:00
vmscan.c	mm: fix the NULL mapping case in __isolate_lru_page()	2018-06-06 16:46:23 +02:00
vmstat.c	mm/vmstat.c: fix outdated vmstat_text	2018-10-20 09:52:34 +02:00
workingset.c	mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()	2016-10-28 03:01:34 -04:00
zbud.c	mm: zsmalloc: constify struct zs_pool name	2015-11-06 17:50:42 -08:00
zpool.c	mm: zsmalloc: constify struct zs_pool name	2015-11-06 17:50:42 -08:00
zsmalloc.c	zsmalloc: fix zs_can_compact() integer overflow	2016-05-18 17:06:44 -07:00
zswap.c	zswap: re-check zswap_is_full() after do zswap_shrink()	2018-09-05 09:18:36 +02:00