android_kernel_oneplus_msm8998/mm
Vlastimil Babka 18be75139f mm, kswapd: replace kswapd compaction with waking up kcompactd
Similarly to direct reclaim/compaction, kswapd attempts to combine
reclaim and compaction to attempt making memory allocation of given
order available.

The details differ from direct reclaim e.g. in having high watermark as
a goal.  The code involved in kswapd's reclaim/compaction decisions has
evolved to be quite complex.

Testing reveals that it doesn't actually work in at least one scenario,
and closer inspection suggests that it could be greatly simplified
without compromising on the goal (make high-order page available) or
efficiency (don't reclaim too much).  The simplification relieas of
doing all compaction in kcompactd, which is simply woken up when high
watermarks are reached by kswapd's reclaim.

The scenario where kswapd compaction doesn't work was found with mmtests
test stress-highalloc configured to attempt order-9 allocations without
direct reclaim, just waking up kswapd.  There was no compaction attempt
from kswapd during the whole test.  Some added instrumentation shows
what happens:

 - balance_pgdat() sets end_zone to Normal, as it's not balanced
 - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
   it cannot reclaim anything, so sc.nr_reclaimed is 0
 - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
   it merely checks if high watermarks were reached for base pages.
   This is true, so no reclaim is attempted.  For DMA, testorder=0
   wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
 - even though the pgdat_needs_compaction flag wasn't set to false, no
   compaction happens due to the condition sc.nr_reclaimed >
   nr_attempted being false (as 0 < 99)
 - priority-- due to nr_reclaimed being 0, repeat until priority reaches
   0 pgdat_balanced() is false as only the small zone DMA appears
   balanced (curiously in that check, watermark appears OK and
   compaction_suitable() returns COMPACT_PARTIAL, because a lower
   classzone_idx is used there)

Now, even if it was decided that reclaim shouldn't be attempted on the
DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
nr_attempted=0) is also false.  The condition really should use >= as
the comment suggests.  Then there is a mismatch in the check for setting
pgdat_needs_compaction to false using low watermark, while the rest uses
high watermark, and who knows what other subtlety.  Hopefully this
demonstrates that this is unsustainable.

Luckily we can simplify this a lot.  The reclaim/compaction decisions
make sense for direct reclaim scenario, but in kswapd, our primary goal
is to reach high watermark in order-0 pages.  Afterwards we can attempt
compaction just once.  Unlike direct reclaim, we don't reclaim extra
pages (over the high watermark), the current code already disallows it
for good reasons.

After this patch, we simply wake up kcompactd to process the pgdat,
after we have either succeeded or failed to reach the high watermarks in
kswapd, which goes to sleep.  We pass kswapd's order and classzone_idx,
so kcompactd can apply the same criteria to determine which zones are
worth compacting.  Note that we use the classzone_idx from
wakeup_kswapd(), not balanced_classzone_idx which can include higher
zones that kswapd tried to balance too, but didn't consider them in
pgdat_balanced().

Since kswapd now cannot create high-order pages itself, we need to
adjust how it determines the zones to be balanced.  The key element here
is adding a "highorder" parameter to zone_balanced, which, when set to
false, makes it consider only order-0 watermark instead of the desired
higher order (this was done previously by kswapd_shrink_zone(), but not
elsewhere).  This false is passed for example in pgdat_balanced().
Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
kcompactd are woken up for a high-order allocation failure.

The last thing is to decide what to do with pageblock_skip bitmap
handling.  Compaction maintains a pageblock_skip bitmap to record
pageblocks where isolation recently failed.  This bitmap can be reset by
three ways:

1) direct compaction is restarting after going through the full deferred cycle

2) kswapd goes to sleep, and some other direct compaction has previously
   finished scanning the whole zone and set zone->compact_blockskip_flush.
   Note that a successful direct compaction clears this flag.

3) compaction was invoked manually via trigger in /proc

The case 2) is somewhat fuzzy to begin with, but after introducing
kcompactd we should update it.  The check for direct compaction in 1),
and to set the flush flag in 2) use current_is_kswapd(), which doesn't
work for kcompactd.  Thus, this patch adds bool direct_compaction to
compact_control to use in 2).  For the case 1) we remove the check
completely - unlike the former kswapd compaction, kcompactd does use the
deferred compaction functionality, so flushing tied to restarting from
deferred compaction makes sense here.

Note that when kswapd goes to sleep, kcompactd is woken up, so it will
see the flushed pageblock_skip bits.  This is different from when the
former kswapd compaction observed the bits and I believe it makes more
sense.  Kcompactd can afford to be more thorough than a direct
compaction trying to limit allocation latency, or kswapd whose primary
goal is to reclaim.

For testing, I used stress-highalloc configured to do order-9
allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
phases 1 and 2 work as usual):

stress-highalloc
                        4.5-rc1+before          4.5-rc1+after
                             -nodirect              -nodirect
Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)
Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)
Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)
Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)
Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)
Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)
Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)
Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)
Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)

User                          3166.67        3181.09
System                        1153.37        1158.25
Elapsed                       1768.53        1799.37

                            4.5-rc1+before   4.5-rc1+after
                                 -nodirect    -nodirect
Direct pages scanned                32938        32797
Kswapd pages scanned              2183166      2202613
Kswapd pages reclaimed            2152359      2143524
Direct pages reclaimed              32735        32545
Percentage direct scans                1%           1%
THP fault alloc                       579          612
THP collapse alloc                    304          316
THP splits                              0            0
THP fault fallback                    793          778
THP collapse fail                      11           16
Compaction stalls                    1013         1007
Compaction success                     92           67
Compaction failures                   920          939
Page migrate success               238457       721374
Page migrate failure                23021        23469
Compaction pages isolated          504695      1479924
Compaction migrate scanned         661390      8812554
Compaction free scanned          13476658     84327916
Compaction cost                       262          838

After this patch we see improvements in allocation success rate
(especially for phase 3) along with increased compaction activity.  The
compaction stalls (direct compaction) in the interfering kernel builds
(probably THP's) also decreased somewhat thanks to kcompactd activity,
yet THP alloc successes improved a bit.

Note that elapsed and user time isn't so useful for this benchmark,
because of the background interference being unpredictable.  It's just
to quickly spot some major unexpected differences.  System time is
somewhat more useful and that didn't increase.

Also (after adjusting mmtests' ftrace monitor):

Time kswapd awake               2547781     2269241
Time kcompactd awake                  0      119253
Time direct compacting           939937      557649
Time kswapd compacting                0           0
Time kcompactd compacting             0      119099

The decrease of overal time spent compacting appears to not match the
increased compaction stats.  I suspect the tasks get rescheduled and
since the ftrace monitor doesn't see that, the reported time is wall
time, not CPU time.  But arguably direct compactors care about overall
latency anyway, whether busy compacting or waiting for CPU doesn't
matter.  And that latency seems to almost halved.

It's also interesting how much time kswapd spent awake just going
through all the priorities and failing to even try compacting, over and
over.

We can also configure stress-highalloc to perform both direct
reclaim/compaction and wakeup kswapd/kcompactd, by using
GFP_KERNEL|__GFP_HIGH|__GFP_COMP:

stress-highalloc
                        4.5-rc1+before         4.5-rc1+after
                               -direct               -direct
Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)
Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)
Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)
Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)
Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)
Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)
Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)

User                          3344.73       3246.04
System                        1194.24       1172.29
Elapsed                       1838.04       1836.76

                            4.5-rc1+before  4.5-rc1+after
                                   -direct     -direct
Direct pages scanned               125146      120966
Kswapd pages scanned              2119757     2135012
Kswapd pages reclaimed            2073183     2108388
Direct pages reclaimed             124909      120577
Percentage direct scans                5%          5%
THP fault alloc                       599         652
THP collapse alloc                    323         354
THP splits                              0           0
THP fault fallback                    806         793
THP collapse fail                      17          16
Compaction stalls                    2457        2025
Compaction success                    906         518
Compaction failures                  1551        1507
Page migrate success              2031423     2360608
Page migrate failure                32845       40852
Compaction pages isolated         4129761     4802025
Compaction migrate scanned       11996712    21750613
Compaction free scanned         214970969   344372001
Compaction cost                      2271        2694

In this scenario, this patch doesn't change the overall success rate as
direct compaction already tries all it can.  There's however significant
reduction in direct compaction stalls (that is, the number of
allocations that went into direct compaction).  The number of successes
(i.e.  direct compaction stalls that ended up with successful
allocation) is reduced by the same number.  This means the offload to
kcompactd is working as expected, and direct compaction is reduced
either due to detecting contention, or compaction deferred by kcompactd.
In the previous version of this patchset there was some apparent
reduction of success rate, but the changes in this version (such as
using sync compaction only), new baseline kernel, and/or averaging
results from 5 executions (my bet), made this go away.

Ftrace-based stats seem to roughly agree:

Time kswapd awake               2532984     2326824
Time kcompactd awake                  0      257916
Time direct compacting           864839      735130
Time kswapd compacting                0           0
Time kcompactd compacting             0      257585

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: accf62422b3a67fce8ce086aa81c8300ddbf42be
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I5cc784ff0fb1b28bdcfa8d4ef9bd33f585ee4662
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2017-02-21 12:37:21 +05:30
..
kasan UBSAN: run-time undefined behavior sanity checker 2016-03-22 11:09:57 -07:00
backing-dev.c block: fix bdi vs gendisk lifetime mismatch 2016-08-20 18:09:24 +02:00
balloon_compaction.c virtio_balloon: fix race between migration and ballooning 2016-03-03 15:07:18 -08:00
bootmem.c mm: Remove __init annotations from free_bootmem_late 2016-03-22 11:03:23 -07:00
cleancache.c cleancache: remove limit on the number of cleancache enabled filesystems 2015-04-14 16:49:03 -07:00
cma.c mm: cma: check the max limit for cma allocation 2016-11-16 05:53:10 -08:00
cma.h mm: cma: mark cma_bitmap_maxno() inline in header 2015-08-14 15:56:32 -07:00
cma_debug.c mm/cma_debug: correct size input to bitmap function 2015-07-17 16:39:54 -07:00
compaction.c mm, kswapd: replace kswapd compaction with waking up kcompactd 2017-02-21 12:37:21 +05:30
debug-pagealloc.c debug-pagealloc: Panic on pagealloc corruption 2016-10-28 10:55:48 +05:30
debug.c mm: add WasActive page flag 2016-05-31 15:27:11 -07:00
dmapool.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
early_ioremap.c mm/early_ioremap: use offset_in_page macro 2015-11-05 19:34:48 -08:00
fadvise.c writeback: implement and use inode_congested() 2015-06-02 08:33:35 -06:00
failslab.c mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM 2015-11-06 17:50:42 -08:00
filemap.c mm: vmstat: add pageoutclean 2016-03-25 16:03:59 -07:00
frame_vector.c mm: fix docbook comment for get_vaddr_frames() 2015-11-05 19:34:48 -08:00
frontswap.c frontswap: allow multiple backends 2015-06-24 17:49:45 -07:00
gup.c mm: remove gup_flags FOLL_WRITE games from __get_user_pages() 2016-12-09 17:43:44 -08:00
highmem.c mm/highmem: make kmap cache coloring aware 2014-08-06 18:01:22 -07:00
huge_memory.c Revert "Merge remote-tracking branch 'msm-4.4/tmp-510d0a3f' into msm-4.4" 2016-08-26 14:34:05 -07:00
hugetlb.c hugetlb: fix nr_pmds accounting with shared page tables 2016-09-07 08:32:35 +02:00
hugetlb_cgroup.c mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
hwpoison-inject.c hwpoison: use page_cgroup_ino for filtering by memcg 2015-09-10 13:29:01 -07:00
init-mm.c
internal.h mm, kswapd: replace kswapd compaction with waking up kcompactd 2017-02-21 12:37:21 +05:30
interval_tree.c mm: replace vma->sharead.linear with vma->shared 2015-02-10 14:30:31 -08:00
Kconfig mm: Support address range reclaim 2016-06-22 14:44:20 -07:00
Kconfig.debug mm/debug_pagealloc: ask users for default setting of debug_pagealloc 2016-04-29 14:40:10 -07:00
kmemcheck.c mm/slab_common: move kmem_cache definition to internal header 2014-10-09 22:25:50 -04:00
kmemleak-test.c mm/kmemleak-test.c: use pr_fmt for logging 2014-06-06 16:08:18 -07:00
kmemleak.c kmemleak : Make kmemleak_stack_scan optional using config 2016-03-22 11:03:47 -07:00
ksm.c mm: ksm: avoid trageted reclaim of ksm pages 2016-06-29 15:12:23 -07:00
list_lru.c memcg: simplify and inline __mem_cgroup_from_kmem 2015-11-05 19:34:48 -08:00
maccess.c x86: remove more uaccess_32.h complexity 2016-08-27 11:23:38 +08:00
madvise.c mm: add a field to store names for private anonymous memory 2016-02-16 13:54:13 -08:00
Makefile Merge branch 'v4.4-16.09-android-tmp' into lsk-v4.4-16.09-android 2016-12-16 13:52:17 -08:00
memblock.c mm/memblock: disable local irqs while late memblock changes 2016-05-31 15:26:50 -07:00
memcontrol.c Merge "Merge branch 'v4.4-16.09-android-tmp' into lsk-v4.4-16.09-android" 2016-12-19 17:04:51 -08:00
memory-failure.c mm: Enhance per process reclaim to consider shared pages 2016-06-22 14:43:57 -07:00
memory.c Merge remote-tracking branch 'msm-4.4/tmp-510d0a3f' into msm-4.4 2016-10-21 18:00:55 -07:00
memory_hotplug.c mm, memory hotplug: small cleanup in online_pages() 2017-02-21 12:37:10 +05:30
mempolicy.c mm: add a field to store names for private anonymous memory 2016-02-16 13:54:13 -08:00
mempool.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
memtest.c lib: memtest: Add MEMTEST_ENABLE_DEFAULT option 2016-05-05 15:05:52 -07:00
migrate.c Merge remote-tracking branch 'msm4.4/tmp-da9a92f' into msm-4.4 2016-10-28 10:48:35 -07:00
mincore.c mm/mincore: use offset_in_page macro 2015-11-05 19:34:48 -08:00
mlock.c Merge remote-tracking branch 'lsk-44/linux-linaro-lsk-v4.4' into 44rc2 2016-03-23 20:51:00 -07:00
mm_init.c mm: meminit: remove mminit_verify_page_links 2015-06-30 19:44:56 -07:00
mmap.c Revert "arm64: Add support for app specific settings" 2016-08-15 15:08:53 -07:00
mmu_context.c sched/mm: call finish_arch_post_lock_switch in idle_task_exit and use_mm 2014-02-21 08:50:17 +01:00
mmu_notifier.c mmu-notifier: add clear_young callback 2015-09-10 13:29:01 -07:00
mmzone.c mm: microoptimize zonelist operations 2015-02-11 17:06:02 -08:00
mprotect.c mm: add a field to store names for private anonymous memory 2016-02-16 13:54:13 -08:00
mremap.c mm/mremap: use offset_in_page macro 2015-11-05 19:34:48 -08:00
msync.c mm/msync: use offset_in_page macro 2015-11-05 19:34:48 -08:00
nobootmem.c mm: Remove __init annotations from free_bootmem_late 2016-03-22 11:03:23 -07:00
nommu.c mm/nommu.c: drop unlikely inside BUG_ON() 2015-11-05 19:34:48 -08:00
oom_kill.c mm, oom: make dump_tasks public 2016-03-22 11:03:30 -07:00
page-writeback.c Merge remote-tracking branch 'msm4.4/tmp-da9a92f' into msm-4.4 2016-10-28 10:48:35 -07:00
page_alloc.c mm, compaction: introduce kcompactd 2017-02-21 12:36:57 +05:30
page_counter.c mm: page_counter: let page_counter_try_charge() return bool 2015-11-05 19:34:48 -08:00
page_ext.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
page_idle.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
page_io.c fs: use helper bio_add_page() instead of open coding on bi_io_vec 2015-08-13 12:32:00 -06:00
page_isolation.c mm: Inform KASAN when allocating pages during isolation 2016-11-23 17:18:03 -08:00
page_owner.c mm/page_owner: ask users about default setting of PAGE_OWNER 2016-05-03 15:53:55 -07:00
pagewalk.c mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers 2015-03-25 16:20:30 -07:00
percpu-km.c percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated 2014-09-02 14:46:05 -04:00
percpu-vm.c percpu: move region iterations out of pcpu_[de]populate_chunk() 2014-09-02 14:46:02 -04:00
percpu.c percpu: fix synchronization between synchronous map extension and chunk destruction 2016-07-27 09:47:33 -07:00
pgtable-generic.c mm,thp: khugepaged: call pte flush at the time of collapse 2016-02-25 12:01:23 -08:00
process_reclaim.c lowmemorykiller: Introduce sysfs node for ALMK and PPR adj threshold 2016-12-17 16:16:17 +05:30
process_vm_access.c ptrace: use fsuid, fsgid, effective creds for fs access checks 2016-02-25 12:01:16 -08:00
quicklist.c
readahead.c mm: change initial readahead window size calculation 2016-03-22 11:03:39 -07:00
rmap.c mm: Enhance per process reclaim to consider shared pages 2016-06-22 14:43:57 -07:00
shmem.c Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2016-07-29 21:38:37 +01:00
showmem.c mm: showmem: make the notifiers atomic 2016-03-22 11:03:55 -07:00
slab.c mm: SLAB hardened usercopy support 2016-08-27 11:23:38 +08:00
slab.h slab/slub: adjust kmem_cache_alloc_bulk API 2015-11-22 11:58:44 -08:00
slab_common.c mm: memcontrol: fix cgroup creation failure after many small jobs 2016-08-16 09:30:51 +02:00
slob.c slab/slub: adjust kmem_cache_alloc_bulk API 2015-11-22 11:58:44 -08:00
slub.c Merge branch 'v4.4-16.09-android-tmp' into lsk-v4.4-16.09-android 2016-12-16 13:52:17 -08:00
sparse-vmemmap.c
sparse.c mm: use macros from compiler.h instead of __attribute__((...)) 2014-04-07 16:35:54 -07:00
swap.c mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
swap_cgroup.c mm: page_cgroup: rename file to mm/swap_cgroup.c 2014-12-10 17:41:09 -08:00
swap_ratio.c mm: swap_ratio: bail out if there aren't any other swap device 2016-05-31 15:23:38 -07:00
swap_state.c lowmemorykiller: Don't count swap cache pages twice 2016-04-13 11:11:01 -07:00
swapfile.c mm: swap: swap ratio support 2016-03-23 21:19:04 -07:00
truncate.c mm + fs: extends support for cache dropping 2016-03-23 21:24:12 -07:00
usercopy.c UPSTREAM: usercopy: remove page-spanning test for now 2016-09-14 14:44:29 +05:30
userfaultfd.c userfaultfd: avoid mmap_sem read recursion in mcopy_atomic 2015-09-04 16:54:41 -07:00
util.c proc: revert /proc/<pid>/maps [stack:TID] annotation 2016-09-15 08:27:46 +02:00
vmacache.c mm/vmacache: inline vmacache_valid_mm() 2015-11-05 19:34:48 -08:00
vmalloc.c mm: Update is_vmalloc_addr to account for vmalloc savings 2016-03-22 11:03:59 -07:00
vmpressure.c mm: vmpressure: make vmpressure window variable 2016-12-16 20:25:45 +05:30
vmscan.c mm, kswapd: replace kswapd compaction with waking up kcompactd 2017-02-21 12:37:21 +05:30
vmstat.c mm, compaction: introduce kcompactd 2017-02-21 12:36:57 +05:30
workingset.c list_lru: add helpers to isolate items 2015-02-12 18:54:10 -08:00
zbud.c mm: zbud: fix the locking scenarios with zcache 2016-08-25 11:49:45 +05:30
zcache.c mm: zcache: fix merge issues 2016-06-07 11:57:44 -07:00
zpool.c mm: zsmalloc: constify struct zs_pool name 2015-11-06 17:50:42 -08:00
zsmalloc.c Revert "Merge remote-tracking branch 'msm-4.4/tmp-510d0a3f' into msm-4.4" 2016-08-26 14:34:05 -07:00
zswap.c Revert "Merge remote-tracking branch 'msm-4.4/tmp-510d0a3f' into msm-4.4" 2016-08-26 14:34:05 -07:00