Commit graph

10073 commits

Author SHA1 Message Date
Arnd Bergmann
9d92396f8a mm: memcontrol: avoid unused function warning
am: 5fad174344

Change-Id: I6734d095131126260931a16a670c66ec58ba4896
2017-03-18 11:21:33 +00:00
Arnd Bergmann
5fad174344 mm: memcontrol: avoid unused function warning
commit 358c07fcc3b60ab08d77f1684de8bd81bcf49a1a upstream.

A bugfix in v4.8-rc2 introduced a harmless warning when
CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:

  mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function]
   static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)

This moves the function inside of the #ifdef block that hides the
calling function, to avoid the warning.

Fixes: 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
Link: http://lkml.kernel.org/r/20160824113733.2776701-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-18 19:09:57 +08:00
Minchan Kim
656fedaf74 mm: do not access page->mapping directly on page_endio
am: c5c893e7c4

Change-Id: Ic37f4022408ccc77d7702ebc4be606bd94ad20e1
2017-03-12 08:15:36 +00:00
Vinayak Menon
5cc4831e1b mm: vmpressure: fix sending wrong events on underflow
am: 66f43a5768

Change-Id: I914c9c65e90dc02251788277937daca32fd59818
2017-03-12 08:15:27 +00:00
Gavin Shan
32125704fe mm/page_alloc: fix nodes for reclaim in fast path
am: 612e4679b8

Change-Id: I134dc72c598a798d4f3dead4421c269ae7bb3791
2017-03-12 08:15:17 +00:00
Minchan Kim
c5c893e7c4 mm: do not access page->mapping directly on page_endio
commit dd8416c47715cf324c9a16f13273f9fda87acfed upstream.

With rw_page, page_endio is used for completing IO on a page and it
propagates write error to the address space if the IO fails.  The
problem is it accesses page->mapping directly which might be okay for
file-backed pages but it shouldn't for anonymous page.  Otherwise, it
can corrupt one of field from anon_vma under us and system goes panic
randomly.

swap_writepage
  bdev_writepage
    ops->rw_page

I encountered the BUG during developing new zram feature and it was
really hard to figure it out because it made random crash, somtime
mmap_sem lockdep, sometime other places where places never related to
zram/zsmalloc, and not reproducible with some configuration.

When I consider how that bug is subtle and people do fast-swap test with
brd, it's worth to add stable mark, I think.

Fixes: dd6bd0d9c7 ("swap: use bdev_read_page() / bdev_write_page()")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12 06:37:25 +01:00
Vinayak Menon
66f43a5768 mm: vmpressure: fix sending wrong events on underflow
commit e1587a4945408faa58d0485002c110eb2454740c upstream.

At the end of a window period, if the reclaimed pages is greater than
scanned, an unsigned underflow can result in a huge pressure value and
thus a critical event.  Reclaimed pages is found to go higher than
scanned because of the addition of reclaimed slab pages to reclaimed in
shrink_node without a corresponding increment to scanned pages.

Minchan Kim mentioned that this can also happen in the case of a THP
page where the scanned is 1 and reclaimed could be 512.

Link: http://lkml.kernel.org/r/1486641577-11685-1-git-send-email-vinmenon@codeaurora.org
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Shiraz Hashim <shashim@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12 06:37:25 +01:00
Gavin Shan
612e4679b8 mm/page_alloc: fix nodes for reclaim in fast path
commit e02dc017c3032dcdce1b993af0db135462e1b4b7 upstream.

When @node_reclaim_node isn't 0, the page allocator tries to reclaim
pages if the amount of free memory in the zones are below the low
watermark.  On Power platform, none of NUMA nodes are scanned for page
reclaim because no nodes match the condition in zone_allows_reclaim().
On Power platform, RECLAIM_DISTANCE is set to 10 which is the distance
of Node-A to Node-A.  So the preferred node even won't be scanned for
page reclaim.

   __alloc_pages_nodemask()
   get_page_from_freelist()
      zone_allows_reclaim()

Anton proposed the test code as below:

   # cat alloc.c
      :
   int main(int argc, char *argv[])
   {
	void *p;
	unsigned long size;
	unsigned long start, end;

	start = time(NULL);
	size = strtoul(argv[1], NULL, 0);
	printf("To allocate %ldGB memory\n", size);

	size <<= 30;
	p = malloc(size);
	assert(p);
	memset(p, 0, size);

	end = time(NULL);
	printf("Used time: %ld seconds\n", end - start);
	sleep(3600);
	return 0;
   }

The system I use for testing has two NUMA nodes.  Both have 128GB
memory.  In below scnario, the page caches on node#0 should be reclaimed
when it encounters pressure to accommodate request of allocation.

   # echo 2 > /proc/sys/vm/zone_reclaim_mode; \
     sync; \
     echo 3 > /proc/sys/vm/drop_caches; \
   # taskset -c 0 cat file.32G > /dev/null; \
     grep FilePages /sys/devices/system/node/node0/meminfo
     Node 0 FilePages:       33619712 kB
   # taskset -c 0 ./alloc 128
   # grep FilePages /sys/devices/system/node/node0/meminfo
     Node 0 FilePages:       33619840 kB
   # grep MemFree /sys/devices/system/node/node0/meminfo
     Node 0 MemFree:          186816 kB

With the patch applied, the pagecache on node-0 is reclaimed when its
free memory is running out.  It's the expected behaviour.

   # echo 2 > /proc/sys/vm/zone_reclaim_mode; \
     sync; \
     echo 3 > /proc/sys/vm/drop_caches
   # taskset -c 0 cat file.32G > /dev/null; \
     grep FilePages /sys/devices/system/node/node0/meminfo
     Node 0 FilePages:       33605568 kB
   # taskset -c 0 ./alloc 128
   # grep FilePages /sys/devices/system/node/node0/meminfo
     Node 0 FilePages:        1379520 kB
   # grep MemFree /sys/devices/system/node/node0/meminfo
     Node 0 MemFree:           317120 kB

Fixes: 5f7a75acdb ("mm: page_alloc: do not cache reclaim distances")
Link: http://lkml.kernel.org/r/1486532455-29613-1-git-send-email-gwshan@linux.vnet.ibm.com
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12 06:37:25 +01:00
Todd Kjos
837de638dc Merge branch 'upstream-linux-4.4.y' into android-4.4 2017-03-02 13:53:48 -08:00
Tejun Heo
de5634875b block: fix double-free in the failure path of cgwb_bdi_init()
commit 5f478e4ea5c5560b4e40eb136991a09f9389f331 upstream.

When !CONFIG_CGROUP_WRITEBACK, bdi has single bdi_writeback_congested
at bdi->wb_congested.  cgwb_bdi_init() allocates it with kzalloc() and
doesn't do further initialization.  This usually works fine as the
reference count gets bumped to 1 by wb_init() and the put from
wb_exit() releases it.

However, when wb_init() fails, it puts the wb base ref automatically
freeing the wb and the explicit kfree() in cgwb_bdi_init() error path
ends up trying to free the same pointer the second time causing a
double-free.

Fix it by explicitly initilizing the refcnt to 1 and putting the base
ref from cgwb_bdi_destroy().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: a13f35e871 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-26 11:07:51 +01:00
Dmitry Shmidt
5edfa05a10 This is the 4.4.48 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlicFCgACgkQONu9yGCS
 aT4TLg//QVqQvdkxyy0lKQfOxmo4RSErmpFstgkvuVgucGh6Akvh8OV9hHJKabjK
 RUn3BNASoWfQF+G1vn7EQWcTGDgJhF/P39DvMu3zvpRbSYMMeX7og9iDnoNn2WtG
 l89l+5YfQG7Y8eJWj1mnTW2ul9pUxJFg4j2rjmcLhfgKPvJPCn+cpU2XKUxpj7gM
 yd/nbVuQlMFW6qfEES1W1RbDEOQ1KWJgdupsMEgodRxb/dlg8KldBQFmv1fGcrA6
 5jFqWzsQQ7AyfMWIRDBm9mJlHuvdoGCEGkyTbsZoSyuN72/cyfPSfTZPInpi09bb
 l0sod1nzcZsuQVJzaQHTKlvpMEduIDQVxy2/pNW/pKnGAS++fkK+uJCsu0mz+6+8
 zntaPdVoboiwwoK5dgP27vgWpYpw2QoCpPqWno7NIVNZfUcWWng3NS49goN+ytvY
 m1i1ih4KU1bMqMrT0qZugQwHHqaE9IJ8xyDMdXc86cMH1ylTo8ZnOOyGxRKKLOW1
 nVs4aQT2i7E9yQ8TjVJplLxtU3t/Q3D1qqPr5U70XJyEgT5X4/V0mXJaRRWXAzXP
 2IBJOLznqwbwuIHV8ocp7i76qtpVqbJkpMx2NhB0tFP0XjffqpZvv0v8aBTAdBS2
 060nyG8fZad6L++tWVODt7nd7gkD4NN/I8BqD0XzXx6zbOJexqA=
 =GUZe
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.48' into android-4.4.y

This is the 4.4.48 stable release
2017-02-09 10:59:15 -08:00
Toshi Kani
87ebcc534d base/memory, hotplug: fix a kernel oops in show_valid_zones()
commit a96dfddbcc04336bbed50dc2b24823e45e09e80c upstream.

Reading a sysfs "memoryN/valid_zones" file leads to the following oops
when the first page of a range is not backed by struct page.
show_valid_zones() assumes that 'start_pfn' is always valid for
page_zone().

 BUG: unable to handle kernel paging request at ffffea017a000000
 IP: show_valid_zones+0x6f/0x160

This issue may happen on x86-64 systems with 64GiB or more memory since
their memory block size is bumped up to 2GiB.  [1] An example of such
systems is desribed below.  0x3240000000 is only aligned by 1GiB and
this memory block starts from 0x3200000000, which is not backed by
struct page.

 BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable

Since test_pages_in_a_zone() already checks holes, fix this issue by
extending this function to return 'valid_start' and 'valid_end' for a
given range.  show_valid_zones() then proceeds with the valid range.

[1] 'Commit bdee237c03 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>	[4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-09 08:02:47 +01:00
Michal Hocko
4025ab36c8 mm, fs: check for fatal signals in do_generic_file_read()
commit 5abf186a30a89d5b9c18a6bf93a2c192c9fd52f6 upstream.

do_generic_file_read() can be told to perform a large request from
userspace.  If the system is under OOM and the reading task is the OOM
victim then it has an access to memory reserves and finishing the full
request can lead to the full memory depletion which is dangerous.  Make
sure we rather go with a short read and allow the killed task to
terminate.

Link: http://lkml.kernel.org/r/20170201092706.9966-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-09 08:02:46 +01:00
Toshi Kani
e86a876957 mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
commit deb88a2a19e85842d79ba96b05031739ec327ff4 upstream.

Patch series "fix a kernel oops when reading sysfs valid_zones", v2.

A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory.  [1] When the start address of a
memory block is not backed by struct page, i.e.  a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops.  This issue was observed on multiple x86-64 systems with
more than 64GiB of memory.  This patch-set fixes this issue.

Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.

Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).

Note for stable kernels: The memory block size change was made by commit
bdee237c03 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9.  However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4 ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.

So, I recommend that we backport it up to 4.4.

[1] 'Commit bdee237c03 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

This patch (of 2):

test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'.  Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.

Fix it by properly setting 'sec_end_pfn' to the next section pfn.

Also make sure that this function returns 1 only when the range belongs
to a zone.

Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-09 08:02:45 +01:00
Dan Streetman
7aeb95ceb8 zswap: disable changing params if init fails
commit d7b028f56a971a2e4d8d7887540a144eeefcd4ab upstream.

Add zswap_init_failed bool that prevents changing any of the module
params, if init_zswap() fails, and set zswap_enabled to false.  Change
'enabled' param to a callback, and check zswap_init_failed before
allowing any change to 'enabled', 'zpool', or 'compressor' params.

Any driver that is built-in to the kernel will not be unloaded if its
init function returns error, and its module params remain accessible for
users to change via sysfs.  Since zswap uses param callbacks, which
assume that zswap has been initialized, changing the zswap params after
a failed initialization will result in WARNING due to the param
callbacks expecting a pool to already exist.  This prevents that by
immediately exiting any of the param callbacks if initialization failed.

This was reported here:
  https://marc.info/?l=linux-mm&m=147004228125528&w=4

And fixes this WARNING:
  [  429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60

The warning is just noise, and not serious.  However, when init fails,
zswap frees all its percpu dstmem pages and its kmem cache.  The kmem
cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
the percpu dstmem pages are definitely a problem, as they're used as
temporary buffer for compressed pages before copying into place in the
zpool.

If the user does get zswap enabled after an init failure, then zswap
will likely Oops on the first page it tries to compress (or worse, start
corrupting memory).

Fixes: 90b0fc26d5 ("zswap: change zpool/compressor at runtime")
Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Marcin Miroslaw <marcin@mejor.pl>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-09 08:02:45 +01:00
Dmitry Shmidt
c8da41f0dc This is the 4.4.46 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAliRjsUACgkQONu9yGCS
 aT53lxAAzSDKqsc4eiGDuRW6A+hWUveObsOsAldQz8PLIhIEh/NiQPNsYisuWA+3
 K5nOwSU4E7LiFYemN/9wGGctq5aPtC7nWyDe43Xdeek0B3keu/KSqOhbOBywuAGr
 pdfEPcyIzgbzgIrygn5g8RUNCcgNkw5IYgmaiz2RCqNnjeQItAQQdm37svz/XDGt
 D6i0MaOHEkGfk3Z3ty4PJWSa/+Gd61M4OYTWiWWCY9gf1CCjQA59RlGhynm6bj+C
 Zq58tsnEzL8bOz9PDHoXKyrLc43mLhn7Q4Nhxs+rT00Yl7yxE/aFm9vd0EMIP3/V
 HUVobz4RmVyopvJPKqcPlD063BAYEm9BpxxXhc3tJNP3AaCjJodvKGkYKFKWPfG5
 6h0agCRCfeCS81y8JKeccw0tVBsD8ChuzO95tSHNulCWk/pN/7OD52b0jiuCMTFs
 lato8AKn2Ygtip/MWkK43yiuPd4qYjtFBSKx6x5we+Vj1HZz2DVjM8prkIqkfzM9
 FpgoyLOPAKXKbECzyURngLsXCckneS96ErxL/a/L/0/2u8Zv4RR9y9jcE38eAhZv
 IGhq8txZXnHD1Qyd6u0nbTeaLDswN06P1stsIUX8fR3qUa5QnlrhG9LzY9wihxlk
 oloLWuhIsUussgO07xgD/9NM8V7goN/DkQ+MCTShXxsNqb87WY0=
 =pPNg
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.46' into android-4.4.y

This is the 4.4.46 stable release
2017-02-01 12:55:09 -08:00
David Rientjes
d072189321 mm, memcg: do not retry precharge charges
commit 3674534b775354516e5c148ea48f51d4d1909a78 upstream.

When memory.move_charge_at_immigrate is enabled and precharges are
depleted during move, mem_cgroup_move_charge_pte_range() will attempt to
increase the size of the precharge.

Prevent precharges from ever looping by setting __GFP_NORETRY.  This was
probably the intention of the GFP_KERNEL & ~__GFP_NORETRY, which is
pointless as written.

Fixes: 0029e19ebf ("mm: memcontrol: remove explicit OOM parameter in charge path")
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701130208510.69402@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-01 08:30:54 +01:00
Vlastimil Babka
f11e8bf8e9 mm/mempolicy.c: do not put mempolicy before using its nodemask
commit d51e9894d27492783fc6d1b489070b4ba66ce969 upstream.

Since commit be97a41b29 ("mm/mempolicy.c: merge alloc_hugepage_vma to
alloc_pages_vma") alloc_pages_vma() can potentially free a mempolicy by
mpol_cond_put() before accessing the embedded nodemask by
__alloc_pages_nodemask().  The commit log says it's so "we can use a
single exit path within the function" but that's clearly wrong.  We can
still do that when doing mpol_cond_put() after the allocation attempt.

Make sure the mempolicy is not freed prematurely, otherwise
__alloc_pages_nodemask() can end up using a bogus nodemask, which could
lead e.g.  to premature OOM.

Fixes: be97a41b29 ("mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma")
Link: http://lkml.kernel.org/r/20170118141124.8345-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-01 08:30:52 +01:00
Dmitry Shmidt
e9a82a4cbe This is the 4.4.45 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAliJpBoACgkQONu9yGCS
 aT54KRAAm2BjHOgU3FlM/mTal6ZVNIPKS/Xy9W0YXdQ+9URDKWNb0fwuqWAsf7LP
 n6ozLIB2n8FNlMWro7VHVNXKiUtw3BSRcjNamMm61XQcR1g0xY4iW6uhtpoTblAG
 PdeK3WAUfROxJEAxciFSTqfPKgSDQeaQRDSG10KTP5qIAPQM0T0/VU+20K0w7Cbf
 UZEJaGDOZS0XIRvNOak2DvQQxeXzwfvY5JTdx/MBOHw6e1MPfndeuhRFDJrIeOZC
 hKaG1ipkMQANcftHWTmJQ0gZEZMgVokqDtyQO3hqyrqLgVChM24j6mD7KvguCfPQ
 +ixC5oDQzBMQnp2uienP6FbDg1BZjHxO2R8z0vscXk++QtB3Mjxk8LBKZqeA636k
 E1fuGCrRf6Ec/0d7loMqOOO4KCUxOu+0JuhmlvmQDtrtGvQa5Qqd5WEF8ecOm6Y+
 5yKI11P5yiFANEkz4ysfTlyEltvIxp4Psu0YBrnVM6x5vNYEnr9wuGdikL21FI6F
 kS2FRB9+u2H4n2qNz7PGMt0tPub/F34W7RvD/zII4wqRrFz3wtw3UufAGgiT6X2n
 EIye5DErGfDcpHJ13kKYd7kCXl1u1y8tsBISRqYxl1sqshIZis0ktsb3ZtE5NMXF
 Qbh72lvpUU78E452ER1XDmk6keb98zUWbOtlBfbqJZ4iVpQ4GGY=
 =lShl
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.45' into android-4.4.y

This is the 4.4.45 stable release
2017-01-26 13:42:20 -08:00
Mike Kravetz
1a46e6ecf8 mm/hugetlb.c: fix reservation race when freeing surplus pages
commit e5bbc8a6c992901058bc09e2ce01d16c111ff047 upstream.

return_unused_surplus_pages() decrements the global reservation count,
and frees any unused surplus pages that were backing the reservation.

Commit 7848a4bf51 ("mm/hugetlb.c: add cond_resched_lock() in
return_unused_surplus_pages()") added a call to cond_resched_lock in the
loop freeing the pages.

As a result, the hugetlb_lock could be dropped, and someone else could
use the pages that will be freed in subsequent iterations of the loop.
This could result in inconsistent global hugetlb page state, application
api failures (such as mmap) failures or application crashes.

When dropping the lock in return_unused_surplus_pages, make sure that
the global reservation count (resv_huge_pages) remains sufficiently
large to prevent someone else from claiming pages about to be freed.

Analyzed by Paul Cassella.

Fixes: 7848a4bf51 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()")
Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Paul Cassella <cassella@cray.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:17:19 +01:00
Dmitry Shmidt
f103e3b0d8 This is the 4.4.43 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlh7bhIACgkQONu9yGCS
 aT5KKRAAw7baMz//gshbaXZuZZHJjqB+rBekdnzgBMBo4P2OJwiuFi7N27dRxiaO
 6uFAB5BUlFoc16AExAnmQJIiWB8lWeAt8S20RBLaiGGQ0iPTr4W7bsVH4Tk3zEaF
 gjCt3Tv8kzbno64lWk02xDilkxFO09y3ZtiMVkleUDpI1DRm5iAF11j+C42OG1Ox
 U1QPsjCoWJyZ9Ta7SEyoQsuJcU32Wl0IW1VAroqfYAJJF5yLOxGoJQfWsiyvwEjQ
 VQg+Yd2LlJkHjuOp4lSAaYjNrCjvV91KwcwOocyI2iw69vyyCQpbKeg50wA1+jBO
 2+b0WKTIYSA6EruAivIj0646UqnzzpUGf9DfeH2NIApO7PvTGWaIWk5uvheOf3Vz
 yVviVGYdedtMXixdzHVXgRVZQThlhLe2D5bvYB0bFInDrY8LlMZJVwjrbJuVQaUy
 u0eguKvOIXSsUwtDOLCEKKh7bH1605JXVm0yUAYRmTPbRjs8LQHu0kPpS70L5tYI
 MaftvgPFyLev88cDns+VjnJxm1cOHrSRyLigM4ArCrZdNs8EKPScFeV3bKcR2Gwi
 u05MdpwagOMSFqKdPFhiGYjjcpAeieeAOkmMro9C1KvIRhVt83cAlbP6L9R0PYSK
 n/wfpvrcbDKl0vcAPVscw1iM590WbRPGGrqlDGv+ak4cjsCb8ro=
 =kCbR
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.43' into android-4.4.y

This is the 4.4.43 stable release
2017-01-17 12:44:14 -08:00
Oliver O'Halloran
e21901d7a5 mm/init: fix zone boundary creation
commit 90cae1fe1c3540f791d5b8e025985fa5e699b2bb upstream.

As a part of memory initialisation the architecture passes an array to
free_area_init_nodes() which specifies the max PFN of each memory zone.
This array is not necessarily monotonic (due to unused zones) so this
array is parsed to build monotonic lists of the min and max PFN for each
zone.  ZONE_MOVABLE is special cased here as its limits are managed by
the mm subsystem rather than the architecture.  Unfortunately, this
special casing is broken when ZONE_MOVABLE is the not the last zone in
the zone list.  The core of the issue is:

	if (i == ZONE_MOVABLE)
		continue;
	arch_zone_lowest_possible_pfn[i] =
		arch_zone_highest_possible_pfn[i-1];

As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone will
be set to zero.  This patch fixes this bug by adding explicitly tracking
where the next zone should start rather than relying on the contents
arch_zone_highest_possible_pfn[].

Thie is low priority.  To get bitten by this you need to enable a zone
that appears after ZONE_MOVABLE in the zone_type enum.  As far as I can
tell this means running a kernel with ZONE_DEVICE or ZONE_CMA enabled,
so I can't see this affecting too many people.

I only noticed this because I've been fiddling with ZONE_DEVICE on
powerpc and 4.6 broke my test kernel.  This bug, in conjunction with the
changes in Taku Izumi's kernelcore=mirror patch (d91749c1dda71) and
powerpc being the odd architecture which initialises max_zone_pfn[] to
~0ul instead of 0 caused all of system memory to be placed into
ZONE_DEVICE at boot, followed a panic since device memory cannot be used
for kernel allocations.  I've already submitted a patch to fix the
powerpc specific bits, but I figured this should be fixed too.

Link: http://lkml.kernel.org/r/1462435033-15601-1-git-send-email-oohall@gmail.com
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-15 13:41:36 +01:00
Dmitry Shmidt
712517177d This is the 4.4.40 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlhvboUACgkQONu9yGCS
 aT6eKxAAptTMEtfLi+wtyTgwAW/0bIMJ/jj57d06Q8jZa80MsSoYBysoZfDrTZjh
 i2/vXh8dqR6AcZQeURt6OJVRbSIw5H7qPQSVfmrQbCocrGwk6eu5Yrv85J1SuB5c
 Ad2NaYZ+H+UYdwKH0xzjWXgcgnZ9PSChXE85hLTGtM9J64AeZZ0aQDNQXAdpphLg
 UUDpoglD5+oeHywCLQ/H68rjSytmLLjmLTeRgK5hIIMefnsNx/eo3O7cZZmdPC3r
 81E+dvPhch42a/rLdRkb/1msNXNRptUQkYQ8+anFn9fvNn8M8ZfPmu9GhUF001Jl
 R3V2Q/I2hxGoMyrxpHYVx8puyYqzTMuVpanScmaJbq7TA33LV8+VAyENLIdNHWTf
 6ZI2MBxwLt+hkqJPSfdMKCyXB4DGVqRmiy5LqRbb2/Dp8xHlVTYLYO+rq3mXveuF
 8PeRm3MhhYfGQXcq9sbUlv0hNe2SxBJm7j8QPj6uHJN/EOWtAtmg8Y+b+V3DKYNp
 2+Zoz4qir8S5CkofMyYJRPrjn5clc0iwcfnLo57VzxsPO3Y3nNxjAtnGPjyndt3f
 8LKjrCUzncQVCLRCaS9xnMKC1qwKlY3MGi40qjHepFfjBKzgfZKotE7pdeU7v8Vk
 JLIgvHzo1AguCRvoPgP6sqPZ0ilFrJz7u7eK4sQh1pqBVRoiANI=
 =p6OP
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.40' into android-4.4.y

This is the 4.4.40 stable release
2017-01-09 10:12:25 -08:00
Shaohua Li
14d8e5cae0 mm/vmscan.c: set correct defer count for shrinker
commit 5f33a0803bbd781de916f5c7448cbbbbc763d911 upstream.

Our system uses significantly more slab memory with memcg enabled with
the latest kernel.  With 3.10 kernel, slab uses 2G memory, while with
4.6 kernel, 6G memory is used.  The shrinker has problem.  Let's see we
have two memcg for one shrinker.  In do_shrink_slab:

1. Check cg1.  nr_deferred = 0, assume total_scan = 700.  batch size
   is 1024, then no memory is freed.  nr_deferred = 700

2. Check cg2.  nr_deferred = 700.  Assume freeable = 20, then
   total_scan = 10 or 40.  Let's assume it's 10.  No memory is freed.
   nr_deferred = 10.

The deferred share of cg1 is lost in this case.  kswapd will free no
memory even run above steps again and again.

The fix makes sure one memcg's deferred share isn't lost.

Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 11:16:14 +01:00
Eric W. Biederman
03eed7afbc mm: Add a user_ns owner to mm_struct and fix ptrace permission checks
commit bfedb589252c01fa505ac9f6f2a3d5d68d707ef4 upstream.

During exec dumpable is cleared if the file that is being executed is
not readable by the user executing the file.  A bug in
ptrace_may_access allows reading the file if the executable happens to
enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

This problem is fixed with only necessary userspace breakage by adding
a user namespace owner to mm_struct, captured at the time of exec, so
it is clear in which user namespace CAP_SYS_PTRACE must be present in
to be able to safely give read permission to the executable.

The function ptrace_may_access is modified to verify that the ptracer
has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
This ensures that if the task changes it's cred into a subordinate
user namespace it does not become ptraceable.

The function ptrace_attach is modified to only set PT_PTRACE_CAP when
CAP_SYS_PTRACE is held over task->mm->user_ns.  The intent of
PT_PTRACE_CAP is to be a flag to note that whatever permission changes
the task might go through the tracer has sufficient permissions for
it not to be an issue.  task->cred->user_ns is always the same
as or descendent of mm->user_ns.  Which guarantees that having
CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
credentials.

To prevent regressions mm->dumpable and mm->user_ns are not considered
when a task has no mm.  As simply failing ptrace_may_attach causes
regressions in privileged applications attempting to read things
such as /proc/<pid>/stat

Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Fixes: 8409cca705 ("userns: allow ptrace from non-init user namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 11:16:11 +01:00
Dmitry Shmidt
97b84c9650 This is the 4.4.37 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlhI+pgACgkQONu9yGCS
 aT4cfg/9EvPNVXZaO12eBDOXExA9SJKysG7kxVq2Ks6Z4hpKOkbvTv3enbUM7P37
 eN+G2v5zythE7MRR4MfF1pFcUDDLury3yAdZSfghRNpU85LwKLJmPYh74WVl9SSj
 wNoAG27gbAidVg5aYNVa3vkt+kmNkM8zNJ79Sk7mfqyGbC25GBkqQ/e2oFnGtZfN
 KbQ/J0G9hvKhFc8q6xY/Oz/PZbuf1kB7W2YuuQq1eLvNf99PErh1UAthL7C0HKDZ
 p6BMCIoE/Q3X4iXhGIpr0NJLr7/ke+fwjl3+ACwfj/xJ5vew/mj+tT8xxZvUGtaA
 jtP432UtRkoofpsgZ6pFUhRcWxErsd5ikAM7xfrzsO2vPE+XGnMODms2A0G7J8kC
 rXF4QtUR9i92HTroTG32EhIGDH1M0ZqeM5Hy+8iMsMidIuemHIORZdva/ZPSR7Lk
 7rtD2Ye2r54oDib97d8MdMIlsQxS2gUF8Yh2NxquK6UCbBdXYzgMo0XlM6sMYrur
 Nt1u9IYa4hVnSbFPh4pmm7MBwNkkyUKrI/Jc/4tr0f3yGOS/xKq3thTxsWUdxoRi
 Gs7nkJx9vPZdB0gmd4wA4Lvjk4VCtyXQI3apMsSvZpG0IOnjKQRJ+6ExN7ZC1Tgf
 WeWRfMS6SfoRLnbPFumGpjp5OJloCxvMcRQtvfiTy3eMSR++Y0c=
 =rmG9
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.37' into android-4.4.y

This is the 4.4.37 stable release

Change-Id: Ic6753a5a223abc02c4fe5205642d4f904de2e5b8
2016-12-08 15:08:27 -08:00
Dmitry Vyukov
995761627d kasan: update kasan_global for gcc 7
commit 045d599a286bc01daa3510d59272440a17b23c2e upstream.

kasan_global struct is part of compiler/runtime ABI.  gcc revision
241983 has added a new field to kasan_global struct.  Update kernel
definition of kasan_global struct to include the new field.

Without this patch KASAN is broken with gcc 7.

Link: http://lkml.kernel.org/r/1479219743-28682-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08 07:15:24 +01:00
Dmitry Shmidt
84dc474f3c This is the 4.4.34 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYMrk7AAoJEDjbvchgkmk+MuIP/ApODvom/RcmEk2cyStEbedT
 wrFgzAXF5diBNayg2sVDoNjCVRhY7dYZN1WI1ItR929gkGHDxkwk0LmRsv5K9fLK
 jflWzZc9OrChYiJKPt9x4xuvDBXJAhtT4D4czZI1ZJhfrYhq0EDkxQB23J5WCsl/
 aCVBqe57fEf5C2crsPdS8V9UmPyYgrtV74PB/EMkTLbYlPU50AuWOjUWrVPMbMTf
 lbslyN82OD/09SuYhsKLw87CAOsJPVunSOE99w+jk83j6p2nCVVKFFxvFkON6rjw
 zZHCZ657oXYM0jeoSN/y+KuP4TTlYkl3LTD6EekzWDj9GtcDCOMnWqIA128a7gpJ
 /ME7piqf1/ypfUaUyQa6eM1U0QYHsRXnJ6jnuRAJ54qQlk4FTjRLPhQbVKqejdVb
 +5vDRXL3GhFPEc5zg8x0j+sTlVj/spcqQDx71t2G9UFKHdh4IhLJuUUySeA/BeTh
 bTimCMD6oG+q+WQnKB8oNSFokbmIabvj/pGLkdAw2Iji/P6JEsMzLIHQhZVSPcUb
 oqWIj5ZJPkad6Xs1kpaJUoD1o+GUxwnzXjCZ20uxHhQ2DChpta8mFE/W6IgN3QvH
 Kj7MWpxZohu37te0XaegAZoJAimMAO9YWYSpiMIcYfN6+3xxExN90aHBlIk1NHlu
 NTaNcHKOEgeoV0E+X3oK
 =jOxx
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.34' into android-4.4.y

This is the 4.4.34 stable release

Change-Id: Ic90323945584a7173f54595e0482d26fafd10174
2016-11-21 10:56:25 -08:00
Jann Horn
5c54f79ad2 swapfile: fix memory corruption via malformed swapfile
commit dd111be69114cc867f8e826284559bfbc1c40e37 upstream.

When root activates a swap partition whose header has the wrong
endianness, nr_badpages elements of badpages are swabbed before
nr_badpages has been checked, leading to a buffer overrun of up to 8GB.

This normally is not a security issue because it can only be exploited
by root (more specifically, a process with CAP_SYS_ADMIN or the ability
to modify a swap file/partition), and such a process can already e.g.
modify swapped-out memory of any other userspace process on the system.

Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-18 10:48:34 +01:00
Dmitry Shmidt
324e88de4a This is the 4.4.32 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYKq+PAAoJEDjbvchgkmk+W3sQAKHJ6dI10P/sFTe4AlGoRGNr
 ZtCwGwwolBoD/NtXa2HCovc9ofIU4zWYXl5P+kbHtKV/ZB4q5+m7Q5bpWh4TQFUy
 9TKho6aywF9uXpAEV99qKYvAOIq5EgJXdgrhCRTYBBR9+uR3+B1cUJhxpyD6htw4
 H7ABpmihWjij0o9YYAin7y/O+8jeqnuNLPUoCek1Emf0cn7G5keMg8Lli0WCz7jM
 JdKOjbvaYscgvb4BqTKqtg5NneC3GoeNp43Kvz4LbmcPw1yT5N8sHswqlSio4U2U
 Sxyvtj0RxoSoAus2UR62pTGDu1TrSHxWEWpYpqa77hr1/TpBY7put1OldFmUfu1B
 voQUI05Ox74RT9pl5c8DGnXH8Zyiu6a7Fpj6EdWbWxtbIgvWCLaDHniEY1WKR6cj
 Bmil/zjGyDtzANJBasC9NJHF8yd+/vxNfn5n0eAz6Xp94MIdOGPIQle+NATG5osN
 0b/NLit64B2F6Djijkv1vV9V7x1oYqIYVG6f1BoVtRXCjhcx9PnkskXcP+1SKUhH
 xOTXLt6rGNaTj+T2/41VJUtZ6eiZj+0GZMXILu5SIEdKiRiGLfsLHX117OK3ZhYT
 PFzzzWZoC2FOL/ldp/K6ncPZV0oHn3yfQa3T97jGI1LbsYkXXyQkW5PNwqGccbUc
 xvhEAPDvBxDlfcgqWMaw
 =DC+B
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.32' into android-4.4.y

This is the 4.4.32 stable release

Change-Id: I5028402eadfcf055ac44a5e67abc6da75b2068b3
2016-11-15 17:02:38 -08:00
Stephen Rothwell
26a5f0596f mm/cma: silence warnings due to max() usage
commit badbda53e505089062e194c614e6f23450bc98b2 upstream.

pageblock_order can be (at least) an unsigned int or an unsigned long
depending on the kernel config and architecture, so use max_t(unsigned
long, ...) when comparing it.

fixes these warnings:

In file included from include/asm-generic/bug.h:13:0,
                 from arch/powerpc/include/asm/bug.h:127,
                 from include/linux/bug.h:4,
                 from include/linux/mmdebug.h:4,
                 from include/linux/mm.h:8,
                 from include/linux/memblock.h:18,
                 from mm/cma.c:28:
mm/cma.c: In function 'cma_init_reserved_mem':
include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
  (void) (&_max1 == &_max2);                   ^
mm/cma.c:186:27: note: in expansion of macro 'max'
  alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
                           ^
mm/cma.c: In function 'cma_declare_contiguous':
include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
  (void) (&_max1 == &_max2);                   ^
include/linux/kernel.h:747:9: note: in definition of macro 'max'
  typeof(y) _max2 = (y);            ^
mm/cma.c:270:29: note: in expansion of macro 'max'
   (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
                             ^
include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
  (void) (&_max1 == &_max2);                   ^
include/linux/kernel.h:747:21: note: in definition of macro 'max'
  typeof(y) _max2 = (y);                        ^
mm/cma.c:270:29: note: in expansion of macro 'max'
   (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
                             ^

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160526150748.5be38a4f@canb.auug.org.au
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10 16:36:36 +01:00
Johannes Weiner
299991298b mm: memcontrol: do not recurse in direct reclaim
commit 89a2848381b5fcd9c4d9c0cd97680e3b28730e31 upstream.

On 4.0, we saw a stack corruption from a page fault entering direct
memory cgroup reclaim, calling into btrfs_releasepage(), which then
tried to allocate an extent and recursed back into a kmem charge ad
nauseam:

  [...]
  btrfs_releasepage+0x2c/0x30
  try_to_release_page+0x32/0x50
  shrink_page_list+0x6da/0x7a0
  shrink_inactive_list+0x1e5/0x510
  shrink_lruvec+0x605/0x7f0
  shrink_zone+0xee/0x320
  do_try_to_free_pages+0x174/0x440
  try_to_free_mem_cgroup_pages+0xa7/0x130
  try_charge+0x17b/0x830
  memcg_charge_kmem+0x40/0x80
  new_slab+0x2d9/0x5a0
  __slab_alloc+0x2fd/0x44f
  kmem_cache_alloc+0x193/0x1e0
  alloc_extent_state+0x21/0xc0
  __clear_extent_bit+0x2b5/0x400
  try_release_extent_mapping+0x1a3/0x220
  __btrfs_releasepage+0x31/0x70
  btrfs_releasepage+0x2c/0x30
  try_to_release_page+0x32/0x50
  shrink_page_list+0x6da/0x7a0
  shrink_inactive_list+0x1e5/0x510
  shrink_lruvec+0x605/0x7f0
  shrink_zone+0xee/0x320
  do_try_to_free_pages+0x174/0x440
  try_to_free_mem_cgroup_pages+0xa7/0x130
  try_charge+0x17b/0x830
  mem_cgroup_try_charge+0x65/0x1c0
  handle_mm_fault+0x117f/0x1510
  __do_page_fault+0x177/0x420
  do_page_fault+0xc/0x10
  page_fault+0x22/0x30

On later kernels, kmem charging is opt-in rather than opt-out, and that
particular kmem allocation in btrfs_releasepage() is no longer being
charged and won't recurse and overrun the stack anymore.

But it's not impossible for an accounted allocation to happen from the
memcg direct reclaim context, and we needed to reproduce this crash many
times before we even got a useful stack trace out of it.

Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
avoid recursing into any other form of direct reclaim.  Then let
recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10 16:36:32 +01:00
Alexander Polakov
9fa32e04f8 mm/list_lru.c: avoid error-path NULL pointer deref
commit 1bc11d70b5db7c6bb1414b283d7f09b1fe1ac0d0 upstream.

As described in https://bugzilla.kernel.org/show_bug.cgi?id=177821:

After some analysis it seems to be that the problem is in alloc_super().
In case list_lru_init_memcg() fails it goes into destroy_super(), which
calls list_lru_destroy().

And in list_lru_init() we see that in case memcg_init_list_lru() fails,
lru->node is freed, but not set NULL, which then leads list_lru_destroy()
to believe it is initialized and call memcg_destroy_list_lru().
memcg_destroy_list_lru() in turn can access lru->node[i].memcg_lrus,
which is NULL.

[akpm@linux-foundation.org: add comment]
Signed-off-by: Alexander Polakov <apolyakov@beget.ru>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10 16:36:32 +01:00
Dmitry Shmidt
5577070b6b This is the 4.4.30 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYF/ZyAAoJEDjbvchgkmk+6RcP+wewepsJZuBDDT0hNVglGDoQ
 GeD5/Ck6sMoRMGLSwM8IxHdC0wBml8fFj0DUX103UK+LK2CF8NSzLOgYCM4g7E9P
 E5Ye+SfSeefl47Is3hbW2YesrLjswEYht8zqytHehc7Uz6UAYwxHvAlpmbRvpI8k
 l+sbTnqovTBoXw3S/tAGpOm3hSeb7Qw1lVKlOtnYAk7e7OotqKbQjbbKKzI8Orbg
 uhs5UV2iw6AQY2mHZssilUiw5iEg/wqw+Ksjxos69thbT5oHrhAfx+grdOAWU9//
 m1YHh6ljZb/FXTrQBbt82R96HhlOKCKCj88dPfsHtIBfVOLWSacv9nc0PC8Xt3Uy
 xDq8DgP7OJRPVTNGuFFL1TuReWFtoYsDcxdQDyivn61XOPXYbKuXI4L5PJaSKQ85
 w4cbD0iOsk6MyfZR2wzj/yESzcJQz40A0sAOxGjak0ZgnGL6LrkKgcZhEBRMR6cs
 zl4TF6R2bCbSDSpnSxHVmCKU+rzENoXql6IE/9ExMNMbo1t0NDrlB+Okojj+n2oR
 wZOHxEPxAzeZVNx4pBiuVlRarSZ54TS0miLROnCRUfbwn4chDMOEbB+i5NwFzey2
 2yfqV5cNUAy0bzbz/Zw5DcF+FNkYadgPdDHZLGu0w6HJ8//V1gZKg/6Ca3Cc/iK0
 WNtHNZUOpCFLPesosNpt
 =+QqD
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.30' into android-4.4.y

This is the 4.4.30 stable release
2016-11-07 10:23:59 -08:00
Gerald Schaefer
b5784d4209 mm/hugetlb: fix memory offline with hugepage size > memory block size
commit 2247bb335ab9c40058484cac36ea74ee652f3b7b upstream.

Patch series "mm/hugetlb: memory offline issues with hugepages", v4.

This addresses several issues with hugepages and memory offline.  While
the first patch fixes a panic, and is therefore rather important, the
last patch is just a performance optimization.

The second patch fixes a theoretical issue with reserved hugepages,
while still leaving some ugly usability issue, see description.

This patch (of 3):

dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
list corruption and addressing exception when trying to set a memory
block offline that is part (but not the first part) of a "gigantic"
hugetlb page with a size > memory block size.

When no other smaller hugetlb page sizes are present, the VM_BUG_ON()
will trigger directly.  In the other case we will run into an addressing
exception later, because dissolve_free_huge_page() will not work on the
head page of the compound hugetlb page which will result in a NULL
hstate from page_hstate().

To fix this, first remove the VM_BUG_ON() because it is wrong, and then
use the compound head page in dissolve_free_huge_page().  This means
that an unused pre-allocated gigantic page that has any part of itself
inside the memory block that is going offline will be dissolved
completely.  Losing an unused gigantic hugepage is preferable to failing
the memory offline, for example in the situation where a (possibly
faulty) memory DIMM needs to go offline.

Changes for v4.4 stable:
  - make it apply w/o commit c1470b33 "mm/hugetlb: fix incorrect
    hugepages count during mem hotplug"

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rui Teng <rui.teng@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-31 04:13:58 -06:00
Dmitry Shmidt
c302df26cb This is the 4.4.28 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYEwP8AAoJEDjbvchgkmk+a+AQALRSaM0ngEiS/y8adwUhRKn/
 C1wimkHCPZaV6vrqRnp1VuvidhPI1YFQrAbcESMegpo1n87jdl2CcwcpHdYvhBrt
 W5a0ezIyugsE6Olgrf+gtwBl0mfpB7ZW2h2/M0yZYskjyRLDBKS5EwEVbX7Y0BEB
 OwdDJJw707U36fPgPEWzzOBDa/DBy+QNYeflzCbsLWX+dCMQ+pjrF7tTT5/oOZoO
 +er1LgO5onAc9kooOqbv8QapfsRD1zGQHjb9QvjYRvONz1VeggfgsywNWiGJ8lS2
 lyqoT+6jODpvuFwRNimb7+EPdZ2siFoTYHbdmSKOE479T8uNPZMP/cFmRt8YDTyl
 c7bmrE/igOH8wcgJniIuZz9BJm/ElT6a+gijI/u0I2ygj1SBKeIa9sThRTCGnmQM
 X2iQ9zK1YdCAQ8PKKt965/AnjnLXojg0NvKoMcMjCzRFtQ7B77hw38KXzq8rY9w9
 mThOivQm3InZK2fRURT4HaBzoc2mhGaaK4HtfUQV0g+kky8MmFbibkKo8dUVEryN
 Vjm2EYjbgbc9idVxkeaMVA4VE1XpNL4CUhrmsK0nOFHncNNuxbk9LVfvdc9y9O65
 ypRiYDGXjhkvRNvTCvOB+pcWLiW/WVeM+bCeZzlFX0fqRWCGjEROHgGio9GbPlnM
 S+jgnP/F/s/szlR4+U8g
 =kv4U
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.28' into android-4.4.y

This is the 4.4.28 stable release
2016-10-28 10:44:19 -07:00
Johannes Weiner
ddafc88008 mm: filemap: fix mapping->nrpages double accounting in fuse
commit 3ddf40e8c31964b744ff10abb48c8e36a83ec6e7 upstream.

Commit 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker
caused by replace_page_cache_page()") switched replace_page_cache() from
raw radix tree operations to page_cache_tree_insert() but didn't take
into account that the latter function, unlike the raw radix tree op,
handles mapping->nrpages.  As a result, that counter is bumped for each
page replacement rather than balanced out even.

The mapping->nrpages counter is used to skip needless radix tree walks
when invalidating, truncating, syncing inodes without pages, as well as
statistics for userspace.  Since the error is positive, we'll do more
page cache tree walks than necessary; we won't miss a necessary one.
And we'll report more buffer pages to userspace than there are.  The
error is limited to fuse inodes.

Fixes: 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28 03:01:34 -04:00
Johannes Weiner
f84311d7cd mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()
commit 22f2ac51b6d643666f4db093f13144f773ff3f3a upstream.

Antonio reports the following crash when using fuse under memory pressure:

  kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: all of them
  CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
  Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
  task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
  RIP: shadow_lru_isolate+0x181/0x190
  Call Trace:
    __list_lru_walk_one.isra.3+0x8f/0x130
    list_lru_walk_one+0x23/0x30
    scan_shadow_nodes+0x34/0x50
    shrink_slab.part.40+0x1ed/0x3d0
    shrink_zone+0x2ca/0x2e0
    kswapd+0x51e/0x990
    kthread+0xd8/0xf0
    ret_from_fork+0x3f/0x70

which corresponds to the following sanity check in the shadow node
tracking:

  BUG_ON(node->count & RADIX_TREE_COUNT_MASK);

The workingset code tracks radix tree nodes that exclusively contain
shadow entries of evicted pages in them, and this (somewhat obscure)
line checks whether there are real pages left that would interfere with
reclaim of the radix tree node under memory pressure.

While discussing ways how fuse might sneak pages into the radix tree
past the workingset code, Miklos pointed to replace_page_cache_page(),
and indeed there is a problem there: it properly accounts for the old
page being removed - __delete_from_page_cache() does that - but then
does a raw raw radix_tree_insert(), not accounting for the replacement
page.  Eventually the page count bits in node->count underflow while
leaving the node incorrectly linked to the shadow node LRU.

To address this, make sure replace_page_cache_page() uses the tracked
page insertion code, page_cache_tree_insert().  This fixes the page
accounting and makes sure page-containing nodes are properly unlinked
from the shadow node LRU again.

Also, make the sanity checks a bit less obscure by using the helpers for
checking the number of pages and shadows in a radix tree node.

[mhocko@suse.com: backport for 4.4]
Fixes: 449dd6984d ("mm: keep page cache radix tree nodes in check")
Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Antonio SJ Musumeci <trapexit@spawn.link>
Debugged-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28 03:01:34 -04:00
Johannes Weiner
b52b7b5a5c mm: filemap: don't plant shadow entries without radix tree node
commit d3798ae8c6f3767c726403c2ca6ecc317752c9dd upstream.

When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:

  kernel BUG at ./include/linux/swap.h:276!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
   soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
  CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
  Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
  task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
  RIP: page_cache_tree_insert+0xf1/0x100
  Call Trace:
    __add_to_page_cache_locked+0x12e/0x270
    add_to_page_cache_lru+0x4e/0xe0
    mpage_readpages+0x112/0x1d0
    blkdev_readpages+0x1d/0x20
    __do_page_cache_readahead+0x1ad/0x290
    force_page_cache_readahead+0xaa/0x100
    page_cache_sync_readahead+0x3f/0x50
    generic_file_read_iter+0x5af/0x740
    blkdev_read_iter+0x35/0x40
    __vfs_read+0xe1/0x130
    vfs_read+0x96/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x13/0x8f
  Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
  RIP  page_cache_tree_insert+0xf1/0x100

This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.

Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.

Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.

Fixes: 449dd6984d ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28 03:01:32 -04:00
Dmitry Shmidt
59fc70469a Merge remote-tracking branch 'common/android-4.4' into android-4.4.y
Change-Id: I8c5ec371d8b612f6880b2428893bec89d7da71f6
2016-10-21 13:50:06 -07:00
Dmitry Shmidt
b8fa4a3ee5 This is the 4.4.26 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYCHnGAAoJEDjbvchgkmk+/tAP/RvQZ+Ft5ItRdlAxrXYgTFda
 PXpyUiCPC8XnrzBClXRoA+EiERIpckC5THdU3uUok4BWuALsgrhxK8YxfHXWLQWe
 2cqn19QPiFMZFz6E1sktWtwk63FH4uyfSXQ7/Roun5oQR4Ltg+Q6wRWlf33ox03t
 v8WVv880M4Q2pbp08USCDD8/EcWj3BNK2qFb05RD6pyW9jgIToSXEFoxtKsRiwDv
 QvUTXRkQOfo6Ui9zQTPzQR/ltahb6W/78fEiqu0G1p+uLkTqwusV33t4T64F+829
 s7MTMb7vnQWwuWKEcWy+CMjMwwUfrZHi6gl0PcBxHatFW/AOADixag0PYZWEku9/
 ok9Ltx63ZaSXvGy9odzIgjB/VIjdQXfPwRGOnNojJIvRDkb/zP3fCCglH9kXIwG2
 W5tXCOf43i1a6G3DyEkNEfCk66WYhuwyulGRRpo5RjsiPXnLiKbX9rqVh+WA3ucZ
 jtnZRFKGu2AsXes4tp/aip46zu73BL3wkKhI08z5ipUoV1PDEUb0fxhw0ZkEI/J7
 +tKotIQXdc190qtQNdkBwYAxujlzwhVYJAb9dwgYpWweVL5cEGIp5+sV5isJ5u4M
 0eD9pLVKsgG/BocnNve/h2heRNtTCq34vQrGEnh437NCrZbMGiRhwnI8HRCstqhr
 ECbkLL5cDnwtRiieBYSj
 =RoKA
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.26' into android-4.4.y

This is the 4.4.26 stable release
2016-10-21 13:40:45 -07:00
Linus Torvalds
1294d35588 mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream.

This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db975 ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f404 ("fix get_user_pages bug").

In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better).  The
s390 dirty bit was implemented in abf09bed3c ("s390/mm: implement
software dirty bits") which made it into v3.9.  Earlier kernels will
have to look at the page state itself.

Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.

To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.

Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-20 10:00:47 +02:00
Dmitry Torokhov
e78f134a78 CHROMIUM: remove Android's cgroup generic permissions checks
The implementation is utterly broken, resulting in all processes being
allows to move tasks between sets (as long as they have access to the
"tasks" attribute), and upstream is heading towards checking only
capability anyway, so let's get rid of this code.

BUG=b:31790445,chromium:647994
TEST=Boot android container, examine logcat

Change-Id: I2f780a5992c34e52a8f2d0b3557fc9d490da2779
Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/394967
Reviewed-by: Ricky Zhou <rickyz@chromium.org>
Reviewed-by: John Stultz <john.stultz@linaro.org>
2016-10-17 22:55:19 -07:00
Dmitry Shmidt
14de94f03d This is the 4.4.24 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJX96H2AAoJEDjbvchgkmk+MqcQAJuhiwLmCsXRKXGujGByPi5P
 vk+mnkt8o2UpamvT4KVRnWQrJuN8EHDHg29esGXKHV9Ahdmw/UfnXvbz3P0auet9
 GvMi4rKZpL3vD/dcMxshQchRKF7SwUbNNMkyJv1WYCjLex7W1LU/NOQV5VLx21i6
 9E/R/ARrazGhrGqanfm4NhIZYOR9QAWCrsc8pbJiE2OSty1HbQCLapA8gWg2PSnz
 sTlH0BF9tJ2kKSnyYjXM1Xb1zHbZuj83qEELhSnXsGK71Sq/8jIH9a5SiUwSDtdt
 szGp+vLODPqIMYa01qyLtFA2tvkusvKDUps8vtZ5mp9t38u2R8TDA+CAbz6w19mb
 C0d9abvZ9l1pIe/96OdgZkdSmGG2DC5hxnk3eaxhsyHn6RkIXfB9igH5+Fk7r/nm
 Yq15xxOIu6DCcuoQesUcAHoIR2961kbo/ZnnUEy2hRsqvR3/21X7qW3oj68uTdnB
 QQtMc1jq32toaZFk21ojLDtxKAlVqVHuslQ0hsMMgKtADZAveWpqZj408aNlPOi8
 CFHoEAxYXCQItOhRCoQeC1mljahvhEBI9N+5Zbpf30q5imLKen9hQphytbKABEWN
 CVJ6h6YndrdnlN7cS/AQ62+SNDk4kLmeMomgXfB701WTJ1cvI6eW4q6WUSS+54DZ
 q+brnDATt0K3nUmsrpGM
 =JaSn
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.24' into android-4.4.y

This is the 4.4.24 stable release
2016-10-14 13:34:43 -07:00
Dmitry Shmidt
fed8e608c3 Merge remote-tracking branch 'common/android-4.4' into android-4.4.y
Change-Id: I203e905e0a63db40a5bb8ee85fcac1e128736331
2016-10-10 13:10:27 -07:00
Dmitry Shmidt
09f6247a9c This is the 4.4.23 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJX7iBiAAoJEDjbvchgkmk+aIQQAIAZ97gsrZInLRZaLJCMS6Me
 4zZRry3pUDtrLkBglerFiKrJTG/mFzasJxyyHuvNU++C9Nu8GdIkslnZ6/g+BO4P
 xaX4PLeM4nCq33f8R5QX5dfM8qaCwWEdD01xK17Agrfcw8nljomPu3B1o8HnaFhb
 jZbmQ9I2yIpDivNorbHZAWZWV3fmk4brDbO/X60X6k4nn42ZSp5f2M2NlcirzR9/
 to5ZVEY51nrShXCJcoaNEMMd/lxPrsv1j5rI+WYibDlOJ4RTEy/UK+yJFgZqAXL9
 mou/A9D0p0uKAKH85s/5wpjvQ/7QsFRasW1HM1nEd8B3TqS2Xi9k+nYAD6S0HvE4
 IPKwPTpV9J+7ZixWSE4lorpHKZhhla+DVP09ZEZwJQlrxs/sGPQw2EgdhY5Kid1J
 Bd7dyxIUieF5sDFJnwYnsCGdtJaSaKW/Kpscz5q70bI2h8SugZYdIBpJdHTTe5cX
 vvfy+JaChpdcTLTgWh4XvgtWsabS2W2hFH6uBIkhy9hjhiflotBEG6WFOo3vrEC/
 lTqRx9AphBb9fW8hIIKNhI8gKEsAF7xzZ7/YHounGrBXCiJTbiogyysvNHkebHfd
 LbWtwMTSYrNNsu4ixiobofGu29PEDQW/i/emkUlF9jbKIL09bSGRaGet/qQOY6EJ
 WoHyxZAZCT+3xGuvhTcj
 =5gCn
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.23' into android-4.4.y

This is the 4.4.23 stable release
2016-10-10 12:43:03 -07:00
zhong jiang
a2fc10c36f mm,ksm: fix endless looping in allocating memory when ksm enable
commit 5b398e416e880159fe55eefd93c6588fa072cd66 upstream.

I hit the following hung task when runing a OOM LTP test case with 4.1
kernel.

Call trace:
[<ffffffc000086a88>] __switch_to+0x74/0x8c
[<ffffffc000a1bae0>] __schedule+0x23c/0x7bc
[<ffffffc000a1c09c>] schedule+0x3c/0x94
[<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350
[<ffffffc000a1e32c>] down_write+0x64/0x80
[<ffffffc00021f794>] __ksm_exit+0x90/0x19c
[<ffffffc0000be650>] mmput+0x118/0x11c
[<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74
[<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4
[<ffffffc0000d0f34>] get_signal+0x444/0x5e0
[<ffffffc000089fcc>] do_signal+0x1d8/0x450
[<ffffffc00008a35c>] do_notify_resume+0x70/0x78

The oom victim cannot terminate because it needs to take mmap_sem for
write while the lock is held by ksmd for read which loops in the page
allocator

ksm_do_scan
	scan_get_next_rmap_item
		down_read
		get_next_rmap_item
			alloc_rmap_item   #ksmd will loop permanently.

There is no way forward because the oom victim cannot release any memory
in 4.1 based kernel.  Since 4.6 we have the oom reaper which would solve
this problem because it would release the memory asynchronously.
Nevertheless we can relax alloc_rmap_item requirements and use
__GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan
would just retry later after the lock got dropped.

Such a patch would be also easy to backport to older stable kernels which
do not have oom_reaper.

While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed
by the allocation failure.

Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Suggested-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-07 15:23:40 +02:00
Tejun Heo
3b41e21ffa UPSTREAM: percpu: fix synchronization between synchronous map extension and chunk destruction
(cherry picked from commit 6710e594f71ccaad8101bc64321152af7cd9ea28)

For non-atomic allocations, pcpu_alloc() can try to extend the area
map synchronously after dropping pcpu_lock; however, the extension
wasn't synchronized against chunk destruction and the chunk might get
freed while extension is in progress.

This patch fixes the bug by putting most of non-atomic allocations
under pcpu_alloc_mutex to synchronize against pcpu_balance_work which
is responsible for async chunk management including destruction.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # v3.18+
Fixes: 1a4d76076c ("percpu: implement asynchronous chunk population")
Change-Id: I8800962e658e78eac866fff4a4e00294c58a3dec
Bug: 31596597
2016-10-06 21:45:33 -07:00
Tejun Heo
077d045666 UPSTREAM: percpu: fix synchronization between chunk->map_extend_work and chunk destruction
(cherry picked from commit 4f996e234dad488e5d9ba0858bc1bae12eff82c3)

Atomic allocations can trigger async map extensions which is serviced
by chunk->map_extend_work.  pcpu_balance_work which is responsible for
destroying idle chunks wasn't synchronizing properly against
chunk->map_extend_work and may end up freeing the chunk while the work
item is still in flight.

This patch fixes the bug by rolling async map extension operations
into pcpu_balance_work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # v3.18+
Fixes: 9c824b6a17 ("percpu: make sure chunk->map array has available space")
Change-Id: I8f4aaf7fe0bc0e9f353d41e0a7840c40d6a32117
Bug: 31596597
2016-10-06 21:45:31 -07:00
Hugh Dickins
7af3e4e129 mm: delete unnecessary and unsafe init_tlb_ubc()
commit b385d21f27d86426472f6ae92a231095f7de2a8d upstream.

init_tlb_ubc() looked unnecessary to me: tlb_ubc is statically
initialized with zeroes in the init_task, and copied from parent to
child while it is quiescent in arch_dup_task_struct(); so I went to
delete it.

But inserted temporary debug WARN_ONs in place of init_tlb_ubc() to
check that it was always empty at that point, and found them firing:
because memcg reclaim can recurse into global reclaim (when allocating
biosets for swapout in my case), and arrive back at the init_tlb_ubc()
in shrink_node_memcg().

Resetting tlb_ubc.flush_required at that point is wrong: if the upper
level needs a deferred TLB flush, but the lower level turns out not to,
we miss a TLB flush.  But fortunately, that's the only part of the
protocol that does not nest: with the initialization removed, cpumask
collects bits from upper and lower levels, and flushes TLB when needed.

Fixes: 72b252aed5 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-30 10:18:38 +02:00