android_kernel_oneplus_msm8998/fs
Mike Kravetz aba029c8e7 hugetlbfs: fix races and page leaks during migration
commit cb6acd01e2e43fd8bad11155752b7699c3d0fb76 upstream.

hugetlb pages should only be migrated if they are 'active'.  The
routines set/clear_page_huge_active() modify the active state of hugetlb
pages.

When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked.  Therefore, another thread could
race and migrate the page while it is being added to page table by the
fault code.  This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.

To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.

Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem.
For example, consider a two node system with 4GB worth of huge pages
available.  A program mmaps a 2G file in a hugetlbfs filesystem.  It
then migrates the pages associated with the file from one node to
another.  When the program exits, huge page counts are as follows:

  node0
  1024    free_hugepages
  1024    nr_hugepages

  node1
  0       free_hugepages
  1024    nr_hugepages

  Filesystem                         Size  Used Avail Use% Mounted on
  nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool

That is as expected.  2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem.  If the file is then removed, the counts become:

  node0
  1024    free_hugepages
  1024    nr_hugepages

  node1
  1024    free_hugepages
  1024    nr_hugepages

  Filesystem                         Size  Used Avail Use% Mounted on
  nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool

Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use.  The only way to 'fix' the filesystem
accounting is to unmount the filesystem

If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field.  At migration
time, this information is not preserved.  To fix, simply transfer
page_private from old to new page at migration time if necessary.

There is a related race with removing a huge page from a file and
migration.  When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page().  A page could be migrated
while in this state.  However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem.

To fix that, check for this condition before migrating a huge page.  If
the condition is detected, return EBUSY for the page.

Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes: bcc5422230 ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
[mike.kravetz@oracle.com: v2]
  Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
[mike.kravetz@oracle.com: update comment and changelog]
  Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 08:44:23 +01:00
..
9p v9fs_dir_readdir: fix double-free on p9stat_read error 2018-12-01 09:46:33 +01:00
adfs
affs affs_lookup(): close a race with affs_remove_link() 2018-05-30 07:48:51 +02:00
afs afs: Fix afs_kill_pages() 2017-12-20 10:04:56 +01:00
autofs4 autofs: fix autofs_sbi() does not check super block type 2018-09-19 22:49:00 +02:00
befs
bfs bfs: add sanity check at bfs_fill_super() 2018-12-01 09:46:33 +01:00
btrfs btrfs: wait on ordered extents on abort cleanup 2019-01-26 09:42:50 +01:00
cachefiles fscache, cachefiles: remove redundant variable 'cache' 2018-12-17 21:55:12 +01:00
ceph ceph: avoid repeatedly adding inode to mdsc->snap_flush_list 2019-03-23 08:44:15 +01:00
cifs cifs: Limit memory used by lock request calls to a page 2019-02-20 10:13:21 +01:00
coda coda: fix 'kernel memory exposure attempt' in fsync 2017-11-24 08:32:25 +01:00
configfs configfs: replace strncpy with memcpy 2018-11-21 09:27:44 +01:00
cramfs Cramfs: fix abad comparison when wrap-arounds occur 2018-11-21 09:27:37 +01:00
debugfs debugfs: fix debugfs_rename parameter checking 2019-02-20 10:13:18 +01:00
devpts devpts: clean up interface to pty drivers 2016-08-16 09:30:49 +02:00
dlm dlm: Don't swamp the CPU with callbacks queued during recovery 2019-02-20 10:13:04 +01:00
ecryptfs do d_instantiate/unlock_new_inode combinations safely 2018-05-30 07:48:52 +02:00
efivarfs efi: Make efivarfs entries immutable by default 2016-03-03 15:07:09 -08:00
efs
exofs fs/exofs: fix potential memory leak in mount option parsing 2018-11-27 16:08:00 +01:00
exportfs exportfs: do not read dentry after free 2018-12-17 21:55:10 +01:00
ext2 ext2: fix potential use after free 2018-12-13 09:21:27 +01:00
ext4 ext4: fix a potential fiemap/page fault deadlock w/ inline_data 2019-01-16 22:16:12 +01:00
f2fs f2fs: fix wrong return value of f2fs_acl_create 2019-02-20 10:13:06 +01:00
fat fs/fat/fatent.c: add cond_resched() to fat_count_free_clusters() 2018-11-10 07:41:40 -08:00
freevxfs
fscache fscache: fix race between enablement and dropping of object 2018-12-17 21:55:11 +01:00
fuse fuse: handle zero sized retrieve correctly 2019-02-20 10:13:16 +01:00
gfs2 gfs2: Revert "Fix loop in gfs2_rbm_find" 2019-02-06 19:43:07 +01:00
hfs hfs: do not free node before using 2018-12-17 21:55:12 +01:00
hfsplus hfsplus: do not free node before using 2018-12-17 21:55:12 +01:00
hostfs hostfs: Freeing an ERR_PTR in hostfs_fill_sb_common() 2016-09-30 10:18:39 +02:00
hpfs hpfs: implement the show_options method 2016-06-01 12:15:54 -07:00
hugetlbfs hugetlbfs: fix races and page leaks during migration 2019-03-23 08:44:23 +01:00
isofs isofs: fix timestamps beyond 2027 2017-11-30 08:37:20 +00:00
jbd2 jbd2: fix use after free in jbd2_log_do_checkpoint() 2018-11-21 09:27:34 +01:00
jffs2 jffs2: Fix use of uninitialized delayed_work, lockdep breakage 2019-01-26 09:42:53 +01:00
jfs jfs: Fix inconsistency between memory allocation and ea_buf->max_size 2018-08-09 12:19:28 +02:00
kernfs kernfs: Replace strncpy with memcpy 2018-12-13 09:21:29 +01:00
lockd lockd: fix access beyond unterminated strings in prints 2018-11-21 09:27:36 +01:00
logfs mm, fs: introduce mapping_gfp_constraint() 2015-11-06 17:50:42 -08:00
minix
ncpfs ncpfs: fix build warning of strncpy 2019-03-23 08:44:21 +01:00
nfs NFS: nfs_compare_mount_options always compare auth flavors. 2019-02-20 10:13:12 +01:00
nfs_common lockd: fix "list_add double add" caused by legacy signal interface 2018-02-03 17:04:28 +01:00
nfsd nfsd4: fix crash on writing v4_end_grace before nfsd startup 2019-02-20 10:13:07 +01:00
nilfs2 do d_instantiate/unlock_new_inode combinations safely 2018-05-30 07:48:52 +02:00
nls
notify fanotify: fix logic of events on child 2018-04-24 09:32:11 +02:00
ntfs mm, fs: introduce mapping_gfp_constraint() 2015-11-06 17:50:42 -08:00
ocfs2 ocfs2: don't clear bh uptodate for block read 2019-02-20 10:13:13 +01:00
omfs
openpromfs
overlayfs ovl: proper cleanup of workdir 2018-09-15 09:40:41 +02:00
proc proc: Remove empty line in /proc/self/status 2019-01-26 09:42:49 +01:00
pstore pstore/ram: Do not treat empty buffers as valid 2019-01-26 09:42:53 +01:00
qnx4
qnx6
quota fs/quota: Fix spectre gadget in do_quotactl 2018-09-09 20:04:36 +02:00
ramfs mm, fs: obey gfp_mapping for add_to_page_cache() 2015-10-16 11:42:28 -07:00
reiserfs reiserfs: propagate errors from fill_with_dentries() properly 2018-11-27 16:08:00 +01:00
romfs romfs: use different way to generate fsid for BLOCK or MTD 2017-06-17 06:39:38 +02:00
squashfs squashfs: more metadata hardenings 2018-08-06 16:24:42 +02:00
sysfs scsi: sysfs: Introduce sysfs_{un,}break_active_protection() 2018-09-05 09:18:40 +02:00
sysv sysv: return 'err' instead of 0 in __sysv_write_inode 2018-12-17 21:55:09 +01:00
tracefs tracefs: Fix refcount imbalance in start_creating() 2015-11-04 22:13:45 -05:00
ubifs ubifs: Check for name being NULL while mounting 2018-10-13 09:11:34 +02:00
udf udf: Fix BUG on corrupted inode 2019-02-20 10:13:09 +01:00
ufs do d_instantiate/unlock_new_inode combinations safely 2018-05-30 07:48:52 +02:00
xfs xfs: don't fail when converting shortform attr to long form during ATTR_REPLACE 2019-01-26 09:42:52 +01:00
aio.c aio: fix spectre gadget in lookup_ioctx 2018-12-21 14:09:50 +01:00
anon_inodes.c
attr.c vfs: move permission checking into notify_change() for utimes(NULL) 2016-10-22 12:26:56 +02:00
bad_inode.c
binfmt_aout.c
binfmt_elf.c fs, elf: make sure to page align bss in load_elf_library 2018-11-21 09:27:41 +01:00
binfmt_elf_fdpic.c libnvdimm for 4.4: 2015-11-10 12:07:22 -08:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c fs/binfmt_misc.c: do not allow offset overflow 2018-07-03 11:21:26 +02:00
binfmt_script.c Revert "exec: load_script: don't blindly truncate shebang string" 2019-02-20 10:13:20 +01:00
block_dev.c fs/block_dev: always invalidate cleancache in invalidate_bdev() 2017-05-20 14:27:01 +02:00
buffer.c fs: add i_blocksize() 2017-06-14 13:16:24 +02:00
char_dev.c
compat.c
compat_binfmt_elf.c binfmt_elf: compat: avoid unused function warning 2018-02-25 11:03:51 +01:00
compat_ioctl.c fs: compat: Remove warning from COMPATIBLE_IOCTL 2018-04-08 11:51:57 +02:00
coredump.c coredump: Ensure proper size of sparse core files 2017-07-05 14:37:20 +02:00
dax.c dax: disable pmd mappings 2015-11-16 23:54:45 -08:00
dcache.c fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb() 2019-02-06 19:43:07 +01:00
dcookies.c
direct-io.c direct-io: Prevent NULL pointer access in submit_page_section 2017-10-18 09:20:42 +02:00
drop_caches.c
eventfd.c
eventpoll.c fs/epoll: drop ovflist branch prediction 2019-02-20 10:13:14 +01:00
exec.c mm: replace get_user_pages() write/force parameters with gup_flags 2018-12-17 21:55:16 +01:00
fcntl.c fs/fcntl: f_setown, avoid undefined behaviour 2018-01-31 12:06:11 +01:00
fhandle.c fs/coredump: prevent fsuid=0 dumps into user-controlled directories 2016-04-12 09:08:58 -07:00
file.c vfs: clear remainder of 'full_fds_bits' in dup_fd() 2015-11-05 23:05:32 -08:00
file_table.c
filesystems.c
fs-writeback.c bdi: Fix oops in wb_workfn() 2018-05-16 10:06:51 +02:00
fs_pin.c
fs_struct.c
inode.c Fix up non-directory creation in SGID directories 2018-07-17 11:31:43 +02:00
internal.h
ioctl.c
Kconfig dax: disable pmd mappings 2015-11-16 23:54:45 -08:00
Kconfig.binfmt
libfs.c
locks.c locks: don't check for race with close when setting OFD lock 2018-01-17 09:35:27 +01:00
Makefile ext4: promote ext4 over ext2 in the default probe order 2015-10-15 10:33:21 -04:00
mbcache.c
mount.h mnt: In propgate_umount handle visiting mounts in any order 2017-07-21 07:44:57 +02:00
mpage.c fs: add i_blocksize() 2017-06-14 13:16:24 +02:00
namei.c namei: allow restricted O_CREAT of FIFOs and regular files 2018-12-01 09:46:41 +01:00
namespace.c mount: Prevent MNT_DETACH from disconnecting locked mounts 2018-11-21 09:27:44 +01:00
no-block.c
nsfs.c nsfs: mark dentry with DCACHE_RCUACCESS 2018-02-16 20:09:43 +01:00
open.c fs: completely ignore unknown open flags 2017-07-15 11:57:44 +02:00
pipe.c pipe: cap initial pipe capacity according to pipe-max-size limit 2018-05-26 08:48:51 +02:00
pnode.c mnt: Make propagate_umount less slow for overlapping mount propagation trees 2017-07-21 07:44:58 +02:00
pnode.h mnt: Add a per mount namespace limit on the number of mounts 2017-04-30 05:49:28 +02:00
posix_acl.c tmpfs: clear S_ISGID when setting posix ACLs 2017-01-26 08:23:47 +01:00
proc_namespace.c vfs: show_vfsstat: do not ignore errors from show_devname method 2016-04-12 09:08:55 -07:00
read_write.c fs: add the fsnotify call to vfs_iter_write 2019-02-06 19:43:06 +01:00
readdir.c
select.c fs/select: add vmalloc fallback for select(2) 2018-01-31 12:06:09 +01:00
seq_file.c Make file credentials available to the seqfile interfaces 2017-08-06 19:19:42 -07:00
signalfd.c
splice.c vfs: fix uninitialized flags in splice_to_pipe() 2017-02-23 17:43:09 +01:00
stack.c
stat.c ufs: restore maintaining ->i_blocks 2017-06-14 13:16:24 +02:00
statfs.c
super.c fs: don't scan the inode cache before SB_BORN is set 2019-02-06 19:43:08 +01:00
sync.c fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE writeback 2015-11-06 17:50:42 -08:00
timerfd.c timerfd: Protect the might cancel mechanism proper 2017-05-08 07:46:01 +02:00
userfaultfd.c userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE 2017-12-20 10:04:53 +01:00
utimes.c vfs: move permission checking into notify_change() for utimes(NULL) 2016-10-22 12:26:56 +02:00
xattr.c getxattr: use correct xattr length 2018-09-09 20:04:36 +02:00