android_kernel_oneplus_msm8998/fs
Greg Thelen 6f051f8986 writeback: safer lock nesting
commit 2e898e4c0a3897ccd434adac5abb8330194f527b upstream.

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Wang Long <wanglong19@meituan.com>
Acked-by: Wang Long <wanglong19@meituan.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org>	[v4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[natechancellor: Applied to 4.4 based on Greg's backport on lkml.org]
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24 09:32:12 +02:00
..
9p fs/9p: Compare qid.path in v9fs_test_inode 2017-11-30 08:37:22 +00:00
adfs
affs affs: fix remount failure when there are no options changed 2016-06-07 18:14:32 -07:00
afs afs: Fix afs_kill_pages() 2017-12-20 10:04:56 +01:00
autofs4 autofs: mount point create should honour passed in mode 2018-04-24 09:32:11 +02:00
befs
bfs
btrfs btrfs: fix incorrect error return ret being passed to mapping_set_error 2018-04-13 19:50:06 +02:00
cachefiles FS-Cache: Add missing initialization of ret in cachefiles_write_page() 2015-11-16 20:38:43 -05:00
ceph ceph: drop negative child dentries before try pruning inode's alias 2017-12-20 10:04:52 +01:00
cifs SMB2: Fix share type handling 2018-04-13 19:50:04 +02:00
coda coda: fix 'kernel memory exposure attempt' in fsync 2017-11-24 08:32:25 +01:00
configfs configfs: Fix race between create_link and configfs_rmdir 2017-06-26 07:13:08 +02:00
cramfs
debugfs dentry name snapshots 2017-08-06 19:19:42 -07:00
devpts devpts: clean up interface to pty drivers 2016-08-16 09:30:49 +02:00
dlm dlm: avoid double-free on error path in dlm_device_{register,unregister} 2017-09-13 14:09:45 -07:00
ecryptfs eCryptfs: use after free in ecryptfs_release_messaging() 2017-11-30 08:37:20 +00:00
efivarfs efi: Make efivarfs entries immutable by default 2016-03-03 15:07:09 -08:00
efs
exofs osd fs: __r4w_get_page rely on PageUptodate for uptodate 2015-12-12 10:15:34 -08:00
exportfs
ext2 ext2: Don't clear SGID when inheriting ACLs 2018-01-31 12:06:11 +01:00
ext4 ext4: bugfix for mmaped pages in mpage_release_unused_pages() 2018-04-24 09:32:11 +02:00
f2fs f2fs: relax node version check for victim data in gc 2018-03-22 09:23:22 +01:00
fat fat: fix using uninitialized fields of fat_inode/fsinfo_inode 2017-03-15 09:57:15 +08:00
freevxfs
fscache FS-Cache: fix dereference of NULL user_key_payload 2017-10-27 10:23:18 +02:00
fuse fuse: fix READDIRPLUS skipping an entry 2017-11-02 09:40:49 +01:00
gfs2 GFS2: Take inode off order_write list when setting jdata flag 2017-12-20 10:04:59 +01:00
hfs
hfsplus posix_acl: Clear SGID bit when setting file permissions 2016-10-31 04:13:58 -06:00
hostfs hostfs: Freeing an ERR_PTR in hostfs_fill_sb_common() 2016-09-30 10:18:39 +02:00
hpfs hpfs: implement the show_options method 2016-06-01 12:15:54 -07:00
hugetlbfs mm: larger stack guard gap, between vmas 2017-06-26 07:13:11 +02:00
isofs isofs: fix timestamps beyond 2027 2017-11-30 08:37:20 +00:00
jbd2 jbd2: if the journal is aborted then don't allow update of the log tail 2018-04-24 09:32:07 +02:00
jffs2 jffs2_kill_sb(): deal with failed allocations 2018-04-24 09:32:11 +02:00
jfs fs: add i_blocksize() 2017-06-14 13:16:24 +02:00
kernfs kernfs: fix regression in kernfs_fop_write caused by wrong type 2018-02-16 20:09:42 +01:00
lockd lockd: fix lockd shutdown race 2018-04-13 19:50:02 +02:00
logfs mm, fs: introduce mapping_gfp_constraint() 2015-11-06 17:50:42 -08:00
minix
ncpfs staging: ncpfs: memory corruption in ncp_read_kernel() 2018-03-28 18:40:15 +02:00
nfs pNFS/flexfiles: missing error code in ff_layout_alloc_lseg() 2018-04-13 19:50:10 +02:00
nfs_common lockd: fix "list_add double add" caused by legacy signal interface 2018-02-03 17:04:28 +01:00
nfsd nfsd4: permit layoutget of executable-only files 2018-03-24 10:58:48 +01:00
nilfs2 nilfs2: fix race condition that causes file system corruption 2017-11-30 08:37:20 +00:00
nls
notify fanotify: fix logic of events on child 2018-04-24 09:32:11 +02:00
ntfs mm, fs: introduce mapping_gfp_constraint() 2015-11-06 17:50:42 -08:00
ocfs2 Revert "ocfs2: should wait dio before inode lock in ocfs2_setattr()" 2017-12-09 18:42:43 +01:00
omfs
openpromfs
overlayfs ovl: filter trusted xattr for non-admin 2018-04-13 19:50:14 +02:00
proc fs/proc: Stop trying to report thread stacks 2018-04-08 11:52:00 +02:00
pstore pstore: Use dynamic spinlock initializer 2017-08-06 19:19:43 -07:00
qnx4
qnx6
quota quota: Check for register_shrinker() failure. 2018-02-03 17:04:28 +01:00
ramfs
reiserfs fs/reiserfs/journal.c: add missing resierfs_warning() arg 2018-04-24 09:32:05 +02:00
romfs romfs: use different way to generate fsid for BLOCK or MTD 2017-06-17 06:39:38 +02:00
squashfs squashfs: xattr simplifications 2015-11-13 20:34:33 -05:00
sysfs sysfs: be careful of error returns from ops->show() 2017-04-12 12:38:33 +02:00
sysv fix sysvfs symlinks 2015-11-23 21:11:08 -05:00
tracefs tracefs: Fix refcount imbalance in start_creating() 2015-11-04 22:13:45 -05:00
ubifs ubifs: Check ubifs_wbuf_sync() return code 2018-04-24 09:32:05 +02:00
udf udf: Avoid overflow when session starts at large offset 2017-12-20 10:05:01 +01:00
ufs ufs_getfrag_block(): we only grab ->truncate_mutex on block creation path 2017-06-14 13:16:24 +02:00
xfs xfs: quota: check result of register_shrinker() 2018-03-03 10:19:44 +01:00
aio.c fs/aio: Use RCU accessors for kioctx_table->table[] 2018-03-22 09:23:31 +01:00
anon_inodes.c
attr.c vfs: move permission checking into notify_change() for utimes(NULL) 2016-10-22 12:26:56 +02:00
bad_inode.c
binfmt_aout.c
binfmt_elf.c binfmt_elf: use ELF_ET_DYN_BASE only for PIE 2017-07-21 07:44:57 +02:00
binfmt_elf_fdpic.c libnvdimm for 4.4: 2015-11-10 12:07:22 -08:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
block_dev.c fs/block_dev: always invalidate cleancache in invalidate_bdev() 2017-05-20 14:27:01 +02:00
buffer.c fs: add i_blocksize() 2017-06-14 13:16:24 +02:00
char_dev.c
compat.c
compat_binfmt_elf.c binfmt_elf: compat: avoid unused function warning 2018-02-25 11:03:51 +01:00
compat_ioctl.c fs: compat: Remove warning from COMPATIBLE_IOCTL 2018-04-08 11:51:57 +02:00
coredump.c coredump: Ensure proper size of sparse core files 2017-07-05 14:37:20 +02:00
dax.c dax: disable pmd mappings 2015-11-16 23:54:45 -08:00
dcache.c lock_parent() needs to recheck if dentry got __dentry_kill'ed under it 2018-03-22 09:23:31 +01:00
dcookies.c
direct-io.c direct-io: Prevent NULL pointer access in submit_page_section 2017-10-18 09:20:42 +02:00
drop_caches.c
eventfd.c
eventpoll.c epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove() 2017-09-07 08:34:10 +02:00
exec.c exec: Limit arg stack to at most 75% of _STK_LIM 2017-07-21 07:44:57 +02:00
fcntl.c fs/fcntl: f_setown, avoid undefined behaviour 2018-01-31 12:06:11 +01:00
fhandle.c fs/coredump: prevent fsuid=0 dumps into user-controlled directories 2016-04-12 09:08:58 -07:00
file.c vfs: clear remainder of 'full_fds_bits' in dup_fd() 2015-11-05 23:05:32 -08:00
file_table.c
filesystems.c
fs-writeback.c writeback: safer lock nesting 2018-04-24 09:32:12 +02:00
fs_pin.c
fs_struct.c
inode.c don't put symlink bodies in pagecache into highmem 2018-02-16 20:09:38 +01:00
internal.h
ioctl.c
Kconfig dax: disable pmd mappings 2015-11-16 23:54:45 -08:00
Kconfig.binfmt
libfs.c
locks.c locks: don't check for race with close when setting OFD lock 2018-01-17 09:35:27 +01:00
Makefile
mbcache.c
mount.h mnt: In propgate_umount handle visiting mounts in any order 2017-07-21 07:44:57 +02:00
mpage.c fs: add i_blocksize() 2017-06-14 13:16:24 +02:00
namei.c getname_kernel() needs to make sure that ->name != ->iname in long case 2018-04-24 09:32:04 +02:00
namespace.c Don't leak MNT_INTERNAL away from internal mounts 2018-04-24 09:32:11 +02:00
no-block.c
nsfs.c nsfs: mark dentry with DCACHE_RCUACCESS 2018-02-16 20:09:43 +01:00
open.c fs: completely ignore unknown open flags 2017-07-15 11:57:44 +02:00
pipe.c pipe: avoid round_pipe_size() nr_pages overflow on 32-bit 2018-01-23 19:50:15 +01:00
pnode.c mnt: Make propagate_umount less slow for overlapping mount propagation trees 2017-07-21 07:44:58 +02:00
pnode.h mnt: Add a per mount namespace limit on the number of mounts 2017-04-30 05:49:28 +02:00
posix_acl.c tmpfs: clear S_ISGID when setting posix ACLs 2017-01-26 08:23:47 +01:00
proc_namespace.c vfs: show_vfsstat: do not ignore errors from show_devname method 2016-04-12 09:08:55 -07:00
read_write.c vfs: Return -ENXIO for negative SEEK_HOLE / SEEK_DATA offsets 2017-10-05 09:41:45 +02:00
readdir.c
select.c fs/select: add vmalloc fallback for select(2) 2018-01-31 12:06:09 +01:00
seq_file.c Make file credentials available to the seqfile interfaces 2017-08-06 19:19:42 -07:00
signalfd.c
splice.c vfs: fix uninitialized flags in splice_to_pipe() 2017-02-23 17:43:09 +01:00
stack.c
stat.c ufs: restore maintaining ->i_blocks 2017-06-14 13:16:24 +02:00
statfs.c
super.c sget(): handle failures of register_shrinker() 2018-03-03 10:19:41 +01:00
sync.c fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE writeback 2015-11-06 17:50:42 -08:00
timerfd.c timerfd: Protect the might cancel mechanism proper 2017-05-08 07:46:01 +02:00
userfaultfd.c userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE 2017-12-20 10:04:53 +01:00
utimes.c vfs: move permission checking into notify_change() for utimes(NULL) 2016-10-22 12:26:56 +02:00
xattr.c lsm: fix smack_inode_removexattr and xattr_getsecurity memleak 2017-10-12 11:27:32 +02:00