Commit graph

43388 commits

Author SHA1 Message Date
Kees Cook
fa77731910 pstore: Use dynamic spinlock initializer
commit e9a330c4289f2ba1ca4bf98c2b430ab165a8931b upstream.

The per-prz spinlock should be using the dynamic initializer so that
lockdep can correctly track it. Without this, under lockdep, we get a
warning at boot that the lock is in non-static memory.

Fixes: 109704492ef6 ("pstore: Make spinlock per zone instead of global")
Fixes: 76d5692a5803 ("pstore: Correctly initialize spinlock and flags")
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-06 19:19:43 -07:00
Kees Cook
9ece74e100 pstore: Correctly initialize spinlock and flags
commit 76d5692a58031696e282384cbd893832bc92bd76 upstream.

The ram backend wasn't always initializing its spinlock correctly. Since
it was coming from kzalloc memory, though, it was harmless on
architectures that initialize unlocked spinlocks to 0 (at least x86 and
ARM). This also fixes a possibly ignored flag setting too.

When running under CONFIG_DEBUG_SPINLOCK, the following Oops was visible:

[    0.760836] persistent_ram: found existing buffer, size 29988, start 29988
[    0.765112] persistent_ram: found existing buffer, size 30105, start 30105
[    0.769435] persistent_ram: found existing buffer, size 118542, start 118542
[    0.785960] persistent_ram: found existing buffer, size 0, start 0
[    0.786098] persistent_ram: found existing buffer, size 0, start 0
[    0.786131] pstore: using zlib compression
[    0.790716] BUG: spinlock bad magic on CPU#0, swapper/0/1
[    0.790729]  lock: 0xffffffc0d1ca9bb0, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[    0.790742] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.10.0-rc2+ #913
[    0.790747] Hardware name: Google Kevin (DT)
[    0.790750] Call trace:
[    0.790768] [<ffffff900808ae88>] dump_backtrace+0x0/0x2bc
[    0.790780] [<ffffff900808b164>] show_stack+0x20/0x28
[    0.790794] [<ffffff9008460ee0>] dump_stack+0xa4/0xcc
[    0.790809] [<ffffff9008113cfc>] spin_dump+0xe0/0xf0
[    0.790821] [<ffffff9008113d3c>] spin_bug+0x30/0x3c
[    0.790834] [<ffffff9008113e28>] do_raw_spin_lock+0x50/0x1b8
[    0.790846] [<ffffff9008a2d2ec>] _raw_spin_lock_irqsave+0x54/0x6c
[    0.790862] [<ffffff90083ac3b4>] buffer_size_add+0x48/0xcc
[    0.790875] [<ffffff90083acb34>] persistent_ram_write+0x60/0x11c
[    0.790888] [<ffffff90083aab1c>] ramoops_pstore_write_buf+0xd4/0x2a4
[    0.790900] [<ffffff90083a9d3c>] pstore_console_write+0xf0/0x134
[    0.790912] [<ffffff900811c304>] console_unlock+0x48c/0x5e8
[    0.790923] [<ffffff900811da18>] register_console+0x3b0/0x4d4
[    0.790935] [<ffffff90083aa7d0>] pstore_register+0x1a8/0x234
[    0.790947] [<ffffff90083ac250>] ramoops_probe+0x6b8/0x7d4
[    0.790961] [<ffffff90085ca548>] platform_drv_probe+0x7c/0xd0
[    0.790972] [<ffffff90085c76ac>] driver_probe_device+0x1b4/0x3bc
[    0.790982] [<ffffff90085c7ac8>] __device_attach_driver+0xc8/0xf4
[    0.790996] [<ffffff90085c4bfc>] bus_for_each_drv+0xb4/0xe4
[    0.791006] [<ffffff90085c7414>] __device_attach+0xd0/0x158
[    0.791016] [<ffffff90085c7b18>] device_initial_probe+0x24/0x30
[    0.791026] [<ffffff90085c648c>] bus_probe_device+0x50/0xe4
[    0.791038] [<ffffff90085c35b8>] device_add+0x3a4/0x76c
[    0.791051] [<ffffff90087d0e84>] of_device_add+0x74/0x84
[    0.791062] [<ffffff90087d19b8>] of_platform_device_create_pdata+0xc0/0x100
[    0.791073] [<ffffff90087d1a2c>] of_platform_device_create+0x34/0x40
[    0.791086] [<ffffff900903c910>] of_platform_default_populate_init+0x58/0x78
[    0.791097] [<ffffff90080831fc>] do_one_initcall+0x88/0x160
[    0.791109] [<ffffff90090010ac>] kernel_init_freeable+0x264/0x31c
[    0.791123] [<ffffff9008a25bd0>] kernel_init+0x18/0x11c
[    0.791133] [<ffffff9008082ec0>] ret_from_fork+0x10/0x50
[    0.793717] console [pstore-1] enabled
[    0.797845] pstore: Registered ramoops as persistent store backend
[    0.804647] ramoops: attached 0x100000@0xf7edc000, ecc: 0/0

Fixes: 663deb47880f ("pstore: Allow prz to control need for locking")
Fixes: 109704492ef6 ("pstore: Make spinlock per zone instead of global")
Reported-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-06 19:19:43 -07:00
Joel Fernandes
aca5b1e3c5 pstore: Allow prz to control need for locking
commit 663deb47880f2283809669563c5a52ac7c6aef1a upstream.

In preparation of not locking at all for certain buffers depending on if
there's contention, make locking optional depending on the initialization
of the prz.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: moved locking flag into prz instead of via caller arguments]
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-06 19:19:43 -07:00
Linus Torvalds
14ae9c4b5a Make file credentials available to the seqfile interfaces
commit 34dbbcdbf63360661ff7bda6c5f52f99ac515f92 upstream.

A lot of seqfile users seem to be using things like %pK that uses the
credentials of the current process, but that is actually completely
wrong for filesystem interfaces.

The unix semantics for permission checking files is to check permissions
at _open_ time, not at read or write time, and that is not just a small
detail: passing off stdin/stdout/stderr to a suid application and making
the actual IO happen in privileged context is a classic exploit
technique.

So if we want to be able to look at permissions at read time, we need to
use the file open credentials, not the current ones.  Normal file
accesses can just use "f_cred" (or any of the helper functions that do
that, like file_ns_capable()), but the seqfile interfaces do not have
any such options.

It turns out that seq_file _does_ save away the user_ns information of
the file, though.  Since user_ns is just part of the full credential
information, replace that special case with saving off the cred pointer
instead, and suddenly seq_file has all the permission information it
needs.

[sumits: this is used in Ubuntu as a fix for CVE-2015-8944]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-06 19:19:42 -07:00
Al Viro
407669f2c9 dentry name snapshots
commit 49d31c2f389acfe83417083e1208422b4091cd9e upstream.

take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified).  In either case the pointer to stable
string is stored into the same structure.

dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().

Intended use:
	struct name_snapshot s;

	take_dentry_name_snapshot(&s, dentry);
	...
	access s.name
	...
	release_dentry_name_snapshot(&s);

Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-06 19:19:42 -07:00
Brian Foster
56548b6f50 xfs: don't BUG() on mixed direct and mapped I/O
commit 04197b341f23b908193308b8d63d17ff23232598 upstream.

We've had reports of generic/095 causing XFS to BUG() in
__xfs_get_blocks() due to the existence of delalloc blocks on a
direct I/O read. generic/095 issues a mix of various types of I/O,
including direct and memory mapped I/O to a single file. This is
clearly not supported behavior and is known to lead to such
problems. E.g., the lack of exclusion between the direct I/O and
write fault paths means that a write fault can allocate delalloc
blocks in a region of a file that was previously a hole after the
direct read has attempted to flush/inval the file range, but before
it actually reads the block mapping. In turn, the direct read
discovers a delalloc extent and cannot proceed.

While the appropriate solution here is to not mix direct and memory
mapped I/O to the same regions of the same file, the current
BUG_ON() behavior is probably overkill as it can crash the entire
system.  Instead, localize the failure to the I/O in question by
returning an error for a direct I/O that cannot be handled safely
due to delalloc blocks. Be careful to allow the case of a direct
write to post-eof delalloc blocks. This can occur due to speculative
preallocation and is safe as post-eof blocks are not accompanied by
dirty pages in pagecache (conversely, preallocation within eof must
have been zeroed, and thus dirtied, before the inode size could have
been increased beyond said blocks).

Finally, provide an additional warning if a direct I/O write occurs
while the file is memory mapped. This may not catch all problematic
scenarios, but provides a hint that some known-to-be-problematic I/O
methods are in use.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-06 19:19:40 -07:00
Joel Fernandes
08408f7ae5 pstore: Make spinlock per zone instead of global
commit 109704492ef637956265ec2eb72ae7b3b39eb6f4 upstream.

Currently pstore has a global spinlock for all zones. Since the zones
are independent and modify different areas of memory, there's no need
to have a global lock, so we should use a per-zone lock as introduced
here. Also, when ramoops's ftrace use-case has a FTRACE_PER_CPU flag
introduced later, which splits the ftrace memory area into a single zone
per CPU, it will eliminate the need for locking. In preparation for this,
make the locking optional.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: updated commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-06 19:19:38 -07:00
Yan, Zheng
ba790013b5 ceph: fix race in concurrent readdir
commit 84583cfb973c4313955c6231cc9cb3772d280b15 upstream.

For a large directory, program needs to issue multiple readdir
syscalls to get all dentries. When there are multiple programs
read the directory concurrently. Following sequence of events
can happen.

 - program calls readdir with pos = 2. ceph sends readdir request
   to mds. The reply contains N1 entries. ceph adds these N1 entries
   to readdir cache.
 - program calls readdir with pos = N1+2. The readdir is satisfied
   by the readdir cache, N2 entries are returned. (Other program
   calls readdir in the middle, which fills the cache)
 - program calls readdir with pos = N1+N2+2. ceph sends readdir
   request to mds. The reply contains N3 entries and it reaches
   directory end. ceph adds these N3 entries to the readdir cache
   and marks directory complete.

The second readdir call does not update fi->readdir_cache_idx.
ceph add the last N3 entries to wrong places.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27 15:06:09 -07:00
Jan Kara
f57b4ae0b7 udf: Fix deadlock between writeback and udf_setsize()
commit f2e95355891153f66d4156bf3a142c6489cd78c6 upstream.

udf_setsize() called truncate_setsize() with i_data_sem held. Thus
truncate_pagecache() called from truncate_setsize() could lock a page
under i_data_sem which can deadlock as page lock ranks below
i_data_sem - e. g. writeback can hold page lock and try to acquire
i_data_sem to map a block.

Fix the problem by moving truncate_setsize() calls from under
i_data_sem. It is safe for us to change i_size without holding
i_data_sem as all the places that depend on i_size being stable already
hold inode_lock.

Fixes: 7e49b6f248
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27 15:06:09 -07:00
NeilBrown
d2fa4057b1 NFS: only invalidate dentrys that are clearly invalid.
commit cc89684c9a265828ce061037f1f79f4a68ccd3f7 upstream.

Since commit bafc9b754f ("vfs: More precise tests in d_invalidate")
in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
to be invalidated even if it has filesystems mounted on or it or on a
descendant.  The mounted filesystem is unmounted.

This means we need to be careful not to return 0 unless the directory
referred to truly is invalid.  So -ESTALE or -ENOENT should invalidate
the directory.  Other errors such a -EPERM or -ERESTARTSYS should be
returned from ->d_revalidate() so they are propagated to the caller.

A particular problem can be demonstrated by:

1/ mount an NFS filesystem using NFSv3 on /mnt
2/ mount any other filesystem on /mnt/foo
3/ ls /mnt/foo
4/ turn off network, or otherwise make the server unable to respond
5/ ls /mnt/foo &
6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
7/ kill -9 $! # this results in -ERESTARTSYS being returned
8/ observe that /mnt/foo has been unmounted.

This patch changes nfs_lookup_revalidate() to only treat
  -ESTALE from nfs_lookup_verify_inode() and
  -ESTALE or -ENOENT from ->lookup()
as indicating an invalid inode.  Other errors are returned.

Also nfs_check_inode_attributes() is changed to return -ESTALE rather
than -EIO.  This is consistent with the error returned in similar
circumstances from nfs_update_inode().

As this bug allows any user to unmount a filesystem mounted on an NFS
filesystem, this fix is suitable for stable kernels.

Fixes: bafc9b754f ("vfs: More precise tests in d_invalidate")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27 15:06:08 -07:00
Jaegeuk Kim
fca8859982 f2fs: Don't clear SGID when inheriting ACLs
commit c925dc162f770578ff4a65ec9b08270382dba9e6 upstream.

This patch copies commit b7f8a09f80:
"btrfs: Don't clear SGID when inheriting ACLs" written by Jan.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27 15:06:07 -07:00
Eric W. Biederman
f07288cfb0 mnt: Make propagate_umount less slow for overlapping mount propagation trees
commit 296990deb389c7da21c78030376ba244dc1badf5 upstream.

Andrei Vagin pointed out that time to executue propagate_umount can go
non-linear (and take a ludicrious amount of time) when the mount
propogation trees of the mounts to be unmunted by a lazy unmount
overlap.

Make the walk of the mount propagation trees nearly linear by
remembering which mounts have already been visited, allowing
subsequent walks to detect when walking a mount propgation tree or a
subtree of a mount propgation tree would be duplicate work and to skip
them entirely.

Walk the list of mounts whose propgatation trees need to be traversed
from the mount highest in the mount tree to mounts lower in the mount
tree so that odds are higher that the code will walk the largest trees
first, allowing later tree walks to be skipped entirely.

Add cleanup_umount_visitation to remover the code's memory of which
mounts have been visited.

Add the functions last_slave and skip_propagation_subtree to allow
skipping appropriate parts of the mount propagation tree without
needing to change the logic of the rest of the code.

A script to generate overlapping mount propagation trees:

$ cat runs.h
set -e
mount -t tmpfs zdtm /mnt
mkdir -p /mnt/1 /mnt/2
mount -t tmpfs zdtm /mnt/1
mount --make-shared /mnt/1
mkdir /mnt/1/1

iteration=10
if [ -n "$1" ] ; then
	iteration=$1
fi

for i in $(seq $iteration); do
	mount --bind /mnt/1/1 /mnt/1/1
done

mount --rbind /mnt/1 /mnt/2

TIMEFORMAT='%Rs'
nr=$(( ( 2 ** ( $iteration + 1 ) ) + 1 ))
echo -n "umount -l /mnt/1 -> $nr        "
time umount -l /mnt/1

nr=$(cat /proc/self/mountinfo | grep zdtm | wc -l )
time umount -l /mnt/2

$ for i in $(seq 9 19); do echo $i; unshare -Urm bash ./run.sh $i; done

Here are the performance numbers with and without the patch:

     mhash |  8192   |  8192  | 1048576 | 1048576
    mounts | before  | after  |  before | after
    ------------------------------------------------
      1025 |  0.040s | 0.016s |  0.038s | 0.019s
      2049 |  0.094s | 0.017s |  0.080s | 0.018s
      4097 |  0.243s | 0.019s |  0.206s | 0.023s
      8193 |  1.202s | 0.028s |  1.562s | 0.032s
     16385 |  9.635s | 0.036s |  9.952s | 0.041s
     32769 | 60.928s | 0.063s | 44.321s | 0.064s
     65537 |         | 0.097s |         | 0.097s
    131073 |         | 0.233s |         | 0.176s
    262145 |         | 0.653s |         | 0.344s
    524289 |         | 2.305s |         | 0.735s
   1048577 |         | 7.107s |         | 2.603s

Andrei Vagin reports fixing the performance problem is part of the
work to fix CVE-2016-6213.

Fixes: a05964f391 ("[PATCH] shared mounts handling: umount")
Reported-by: Andrei Vagin <avagin@openvz.org>
Reviewed-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:58 +02:00
Eric W. Biederman
fdb8f10499 mnt: In propgate_umount handle visiting mounts in any order
commit 99b19d16471e9c3faa85cad38abc9cbbe04c6d55 upstream.

While investigating some poor umount performance I realized that in
the case of overlapping mount trees where some of the mounts are locked
the code has been failing to unmount all of the mounts it should
have been unmounting.

This failure to unmount all of the necessary
mounts can be reproduced with:

$ cat locked_mounts_test.sh

mount -t tmpfs test-base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b

mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10

mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20

mount --rbind /mnt/b /mnt/b/10/20

unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount -l /mnt/b
wait %%

$ unshare -Urm ./locked_mounts_test.sh

This failure is corrected by removing the prepass that marks mounts
that may be umounted.

A first pass is added that umounts mounts if possible and if not sets
mount mark if they could be unmounted if they weren't locked and adds
them to a list to umount possibilities.  This first pass reconsiders
the mounts parent if it is on the list of umount possibilities, ensuring
that information of umoutability will pass from child to mount parent.

A second pass then walks through all mounts that are umounted and processes
their children unmounting them or marking them for reparenting.

A last pass cleans up the state on the mounts that could not be umounted
and if applicable reparents them to their first parent that remained
mounted.

While a bit longer than the old code this code is much more robust
as it allows information to flow up from the leaves and down
from the trunk making the order in which mounts are encountered
in the umount propgation tree irrelevant.

Fixes: 0c56fe3142 ("mnt: Don't propagate unmounts to locked mounts")
Reviewed-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:57 +02:00
Eric W. Biederman
7cbc3955ef mnt: In umount propagation reparent in a separate pass
commit 570487d3faf2a1d8a220e6ee10f472163123d7da upstream.

It was observed that in some pathlogical cases that the current code
does not unmount everything it should.  After investigation it
was determined that the issue is that mnt_change_mntpoint can
can change which mounts are available to be unmounted during mount
propagation which is wrong.

The trivial reproducer is:
$ cat ./pathological.sh

mount -t tmpfs test-base /mnt
cd /mnt
mkdir 1 2 1/1
mount --bind 1 1
mount --make-shared 1
mount --bind 1 2
mount --bind 1/1 1/1
mount --bind 1/1 1/1
echo
grep test-base /proc/self/mountinfo
umount 1/1
echo
grep test-base /proc/self/mountinfo

$ unshare -Urm ./pathological.sh

The expected output looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

The output without the fix looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

That last mount in the output was in the propgation tree to be unmounted but
was missed because the mnt_change_mountpoint changed it's parent before the walk
through the mount propagation tree observed it.

Fixes: 1064f874abc0 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:57 +02:00
Kees Cook
86949eb964 exec: Limit arg stack to at most 75% of _STK_LIM
commit da029c11e6b12f321f36dac8771e833b65cec962 upstream.

To avoid pathological stack usage or the need to special-case setuid
execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:57 +02:00
Kees Cook
7eb968cd04 binfmt_elf: use ELF_ET_DYN_BASE only for PIE
commit eab09532d40090698b05a07c1c87f39fdbc5fab5 upstream.

The ELF_ET_DYN_BASE position was originally intended to keep loaders
away from ET_EXEC binaries.  (For example, running "/lib/ld-linux.so.2
/bin/cat" might cause the subsequent load of /bin/cat into where the
loader had been loaded.)

With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
ELF_ET_DYN_BASE continued to be used since the kernel was only looking
at ET_DYN.  However, since ELF_ET_DYN_BASE is traditionally set at the
top 1/3rd of the TASK_SIZE, a substantial portion of the address space
is unused.

For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
loaded above the mmap region.  This means they can be made to collide
(CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
pathological stack regions.

Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
region in all cases, and will now additionally avoid programs falling
back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
if it would have collided with the stack, now it will fail to load
instead of falling back to the mmap region).

To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
are loaded into the mmap region, leaving space available for either an
ET_EXEC binary with a fixed location or PIE being loaded into mmap by
the loader.  Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
which means architectures can now safely lower their values without risk
of loaders colliding with their subsequently loaded programs.

For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
the entire 32-bit address space for 32-bit pointers.

Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
suggestions on how to implement this solution.

Fixes: d1fd836dcf ("mm: split ET_DYN ASLR from mmap ASLR")
Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:57 +02:00
Sahitya Tummala
68b0f5d85b fs/dcache.c: fix spin lockup issue on nlru->lock
commit b17c070fb624cf10162cf92ea5e1ec25cd8ac176 upstream.

__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
duration if there are more number of items in the lru list.  As per the
current code, it can hold the spin lock for upto maximum UINT_MAX
entries at a time.  So if there are more number of items in the lru
list, then "BUG: spinlock lockup suspected" is observed in the below
path:

  spin_bug+0x90
  do_raw_spin_lock+0xfc
  _raw_spin_lock+0x28
  list_lru_add+0x28
  dput+0x1c8
  path_put+0x20
  terminate_walk+0x3c
  path_lookupat+0x100
  filename_lookup+0x6c
  user_path_at_empty+0x54
  SyS_faccessat+0xd0
  el0_svc_naked+0x24

This nlru->lock is acquired by another CPU in this path -

  d_lru_shrink_move+0x34
  dentry_lru_isolate_shrink+0x48
  __list_lru_walk_one.isra.10+0x94
  list_lru_walk_node+0x40
  shrink_dcache_sb+0x60
  do_remount_sb+0xbc
  do_emergency_remount+0xb0
  process_one_work+0x228
  worker_thread+0x2e0
  kthread+0xf4
  ret_from_fork+0x10

Fix this lockup by reducing the number of entries to be shrinked from
the lru list to 1024 at once.  Also, add cond_resched() before
processing the lru list again.

Link: http://marc.info/?t=149722864900001&r=1&w=2
Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Alexander Polakov <apolyakov@beget.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:56 +02:00
Chao Yu
ad5a88c54c ext4: check return value of kstrtoull correctly in reserved_clusters_store
commit 1ea1516fbbab2b30bf98c534ecaacba579a35208 upstream.

kstrtoull returns 0 on success, however, in reserved_clusters_store we
will return -EINVAL if kstrtoull returns 0, it makes us fail to update
reserved_clusters value through sysfs.

Fixes: 76d33bca55
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15 11:57:50 +02:00
Andreas Gruenbacher
e952c291df gfs2: Fix glock rhashtable rcu bug
commit 961ae1d83d055a4b9ebbfb4cc8ca62ec1a7a3b74 upstream.

Before commit 88ffbf3e03 "GFS2: Use resizable hash table for glocks",
glocks were freed via call_rcu to allow reading the glock hashtable
locklessly using rcu.  This was then changed to free glocks immediately,
which made reading the glock hashtable unsafe.  Bring back the original
code for freeing glocks via call_rcu.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15 11:57:46 +02:00
Christoph Hellwig
4043d5bca5 fs: completely ignore unknown open flags
commit 629e014bb8349fcf7c1e4df19a842652ece1c945 upstream.

Currently we just stash anything we got into file->f_flags, and the
report it in fcntl(F_GETFD).  This patch just clears out all unknown
flags so that we don't pass them to the fs or report them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15 11:57:44 +02:00
Christoph Hellwig
ccb973e681 fs: add a VALID_OPEN_FLAGS
commit 80f18379a7c350c011d30332658aa15fe49a8fa5 upstream.

Add a central define for all valid open flags, and use it in the uniqueness
check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15 11:57:44 +02:00
Junxiao Bi
404ef3b4bf ocfs2: o2hb: revert hb threshold to keep compatible
commit 33496c3c3d7b88dcbe5e55aa01288b05646c6aca upstream.

Configfs is the interface for ocfs2-tools to set configure to kernel and
$configfs_dir/cluster/$clustername/heartbeat/dead_threshold is the one
used to configure heartbeat dead threshold.  Kernel has a default value
of it but user can set O2CB_HEARTBEAT_THRESHOLD in /etc/sysconfig/o2cb
to override it.

Commit 45b997737a ("ocfs2/cluster: use per-attribute show and store
methods") changed heartbeat dead threshold name while ocfs2-tools did
not, so ocfs2-tools won't set this configurable and the default value is
always used.  So revert it.

Fixes: 45b997737a ("ocfs2/cluster: use per-attribute show and store methods")
Link: http://lkml.kernel.org/r/1490665245-15374-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:37:22 +02:00
Dave Kleikamp
878f37efac coredump: Ensure proper size of sparse core files
[ Upstream commit 4d22c75d4c7b5c5f4bd31054f09103ee490878fd ]

If the last section of a core file ends with an unmapped or zero page,
the size of the file does not correspond with the last dump_skip() call.
gdb complains that the file is truncated and can be confusing to users.

After all of the vma sections are written, make sure that the file size
is no smaller than the current file position.

This problem can be demonstrated with gdb's bigcore testcase on the
sparc architecture.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:37:20 +02:00
Liu Bo
6e1116a0b3 Btrfs: fix truncate down when no_holes feature is enabled
[ Upstream commit 91298eec05cd8d4e828cf7ee5d4a6334f70cf69a ]

For such a file mapping,

[0-4k][hole][8k-12k]

In NO_HOLES mode, we don't have the [hole] extent any more.
Commit c1aa45759e ("Btrfs: fix shrinking truncate when the no_holes feature is enabled")
 fixed disk isize not being updated in NO_HOLES mode when data is not flushed.

However, even if data has been flushed, we can still have trouble
in updating disk isize since we updated disk isize to 'start' of
the last evicted extent.

Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:37:18 +02:00
Kinglong Mee
5424427100 NFSv4: fix a reference leak caused WARNING messages
commit 366a1569bff3fe14abfdf9285e31e05e091745f5 upstream.

Because nfs4_opendata_access() has close the state when access is denied,
so the state isn't leak.
Rather than revert the commit a974deee47, I'd like clean the strange state close.

[ 1615.094218] ------------[ cut here ]------------
[ 1615.094607] WARNING: CPU: 0 PID: 23702 at lib/list_debug.c:31 __list_add_valid+0x8e/0xa0
[ 1615.094913] list_add double add: new=ffff9d7901d9f608, prev=ffff9d7901d9f608, next=ffff9d7901ee8dd0.
[ 1615.095458] Modules linked in: nfsv4(E) nfs(E) nfsd(E) tun bridge stp llc fuse ip_set nfnetlink vmw_vsock_vmci_transport vsock f2fs snd_seq_midi snd_seq_midi_event fscrypto coretemp ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_rapl_perf vmw_balloon snd_ens1371 joydev gameport snd_ac97_codec ac97_bus snd_seq snd_pcm snd_rawmidi snd_timer snd_seq_device snd soundcore nfit parport_pc parport acpi_cpufreq tpm_tis tpm_tis_core tpm i2c_piix4 vmw_vmci shpchp auth_rpcgss nfs_acl lockd(E) grace sunrpc(E) xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel mptspi e1000 serio_raw scsi_transport_spi mptscsih mptbase ata_generic pata_acpi fjes [last unloaded: nfs]
[ 1615.097663] CPU: 0 PID: 23702 Comm: fstest Tainted: G        W   E   4.11.0-rc1+ #517
[ 1615.098015] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 1615.098807] Call Trace:
[ 1615.099183]  dump_stack+0x63/0x86
[ 1615.099578]  __warn+0xcb/0xf0
[ 1615.099967]  warn_slowpath_fmt+0x5f/0x80
[ 1615.100370]  __list_add_valid+0x8e/0xa0
[ 1615.100760]  nfs4_put_state_owner+0x75/0xc0 [nfsv4]
[ 1615.101136]  __nfs4_close+0x109/0x140 [nfsv4]
[ 1615.101524]  nfs4_close_state+0x15/0x20 [nfsv4]
[ 1615.101949]  nfs4_close_context+0x21/0x30 [nfsv4]
[ 1615.102691]  __put_nfs_open_context+0xb8/0x110 [nfs]
[ 1615.103155]  put_nfs_open_context+0x10/0x20 [nfs]
[ 1615.103586]  nfs4_file_open+0x13b/0x260 [nfsv4]
[ 1615.103978]  do_dentry_open+0x20a/0x2f0
[ 1615.104369]  ? nfs4_copy_file_range+0x30/0x30 [nfsv4]
[ 1615.104739]  vfs_open+0x4c/0x70
[ 1615.105106]  ? may_open+0x5a/0x100
[ 1615.105469]  path_openat+0x623/0x1420
[ 1615.105823]  do_filp_open+0x91/0x100
[ 1615.106174]  ? __alloc_fd+0x3f/0x170
[ 1615.106568]  do_sys_open+0x130/0x220
[ 1615.106920]  ? __put_cred+0x3d/0x50
[ 1615.107256]  SyS_open+0x1e/0x20
[ 1615.107588]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1615.107922] RIP: 0033:0x7fab599069b0
[ 1615.108247] RSP: 002b:00007ffcf0600d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[ 1615.108575] RAX: ffffffffffffffda RBX: 00007fab59bcfae0 RCX: 00007fab599069b0
[ 1615.108896] RDX: 0000000000000200 RSI: 0000000000000200 RDI: 00007ffcf060255e
[ 1615.109211] RBP: 0000000000040010 R08: 0000000000000000 R09: 0000000000000016
[ 1615.109515] R10: 00000000000006a1 R11: 0000000000000246 R12: 0000000000041000
[ 1615.109806] R13: 0000000000040010 R14: 0000000000001000 R15: 0000000000002710
[ 1615.110152] ---[ end trace 96ed63b1306bf2f3 ]---

Fixes: a974deee47 ("NFSv4: Fix memory and state leak in...")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:37:15 +02:00
Pavel Shilovsky
63ba840a53 CIFS: Improve readdir verbosity
commit dcd87838c06f05ab7650b249ebf0d5b57ae63e1e upstream.

Downgrade the loglevel for SMB2 to prevent filling the log
with messages if e.g. readdir was interrupted. Also make SMB2
and SMB1 codepaths do the same logging during readdir.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29 12:48:51 +02:00
NeilBrown
b95aa98e77 autofs: sanity check status reported with AUTOFS_DEV_IOCTL_FAIL
commit 9fa4eb8e490a28de40964b1b0e583d8db4c7e57c upstream.

If a positive status is passed with the AUTOFS_DEV_IOCTL_FAIL ioctl,
autofs4_d_automount() will return

   ERR_PTR(status)

with that status to follow_automount(), which will then dereference an
invalid pointer.

So treat a positive status the same as zero, and map to ENOENT.

See comment in systemd src/core/automount.c::automount_send_ready().

Link: http://lkml.kernel.org/r/871sqwczx5.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29 12:48:50 +02:00
Kees Cook
1d3d0f8b7c fs/exec.c: account for argv/envp pointers
commit 98da7d08850fb8bdeb395d6368ed15753304aa0c upstream.

When limiting the argv/envp strings during exec to 1/4 of the stack limit,
the storage of the pointers to the strings was not included.  This means
that an exec with huge numbers of tiny strings could eat 1/4 of the stack
limit in strings and then additional space would be later used by the
pointers to the strings.

For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721
single-byte strings would consume less than 2MB of stack, the max (8MB /
4) amount allowed, but the pointers to the strings would consume the
remaining additional stack space (1677721 * 4 == 6710884).

The result (1677721 + 6710884 == 8388605) would exhaust stack space
entirely.  Controlling this stack exhaustion could result in
pathological behavior in setuid binaries (CVE-2017-1000365).

[akpm@linux-foundation.org: additional commenting from Kees]
Fixes: b6a2fea393 ("mm: variable length argument support")
Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Qualys Security Advisory <qsa@qualys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29 12:48:50 +02:00
Hugh Dickins
4b35943067 mm: larger stack guard gap, between vmas
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[wt: backport to 4.11: adjust context]
[wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
[wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes]
Signed-off-by: Willy Tarreau <w@1wt.eu>
[gkh: minor build fixes for 4.4]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-26 07:13:11 +02:00
Nicholas Bellinger
0ad134d81c configfs: Fix race between create_link and configfs_rmdir
commit ba80aa909c99802c428682c352b0ee0baac0acd3 upstream.

This patch closes a long standing race in configfs between
the creation of a new symlink in create_link(), while the
symlink target's config_item is being concurrently removed
via configfs_rmdir().

This can happen because the symlink target's reference
is obtained by config_item_get() in create_link() before
the CONFIGFS_USET_DROPPING bit set by configfs_detach_prep()
during configfs_rmdir() shutdown is actually checked..

This originally manifested itself on ppc64 on v4.8.y under
heavy load using ibmvscsi target ports with Novalink API:

[ 7877.289863] rpadlpar_io: slot U8247.22L.212A91A-V1-C8 added
[ 7879.893760] ------------[ cut here ]------------
[ 7879.893768] WARNING: CPU: 15 PID: 17585 at ./include/linux/kref.h:46 config_item_get+0x7c/0x90 [configfs]
[ 7879.893811] CPU: 15 PID: 17585 Comm: targetcli Tainted: G           O 4.8.17-customv2.22 #12
[ 7879.893812] task: c00000018a0d3400 task.stack: c0000001f3b40000
[ 7879.893813] NIP: d000000002c664ec LR: d000000002c60980 CTR: c000000000b70870
[ 7879.893814] REGS: c0000001f3b43810 TRAP: 0700   Tainted: G O     (4.8.17-customv2.22)
[ 7879.893815] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28222242  XER: 00000000
[ 7879.893820] CFAR: d000000002c664bc SOFTE: 1
                GPR00: d000000002c60980 c0000001f3b43a90 d000000002c70908 c0000000fbc06820
                GPR04: c0000001ef1bd900 0000000000000004 0000000000000001 0000000000000000
                GPR08: 0000000000000000 0000000000000001 d000000002c69560 d000000002c66d80
                GPR12: c000000000b70870 c00000000e798700 c0000001f3b43ca0 c0000001d4949d40
                GPR16: c00000014637e1c0 0000000000000000 0000000000000000 c0000000f2392940
                GPR20: c0000001f3b43b98 0000000000000041 0000000000600000 0000000000000000
                GPR24: fffffffffffff000 0000000000000000 d000000002c60be0 c0000001f1dac490
                GPR28: 0000000000000004 0000000000000000 c0000001ef1bd900 c0000000f2392940
[ 7879.893839] NIP [d000000002c664ec] config_item_get+0x7c/0x90 [configfs]
[ 7879.893841] LR [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
[ 7879.893842] Call Trace:
[ 7879.893844] [c0000001f3b43ac0] [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
[ 7879.893847] [c0000001f3b43b10] [c000000000329770] do_dentry_open+0x2c0/0x460
[ 7879.893849] [c0000001f3b43b70] [c000000000344480] path_openat+0x210/0x1490
[ 7879.893851] [c0000001f3b43c80] [c00000000034708c] do_filp_open+0xfc/0x170
[ 7879.893853] [c0000001f3b43db0] [c00000000032b5bc] do_sys_open+0x1cc/0x390
[ 7879.893856] [c0000001f3b43e30] [c000000000009584] system_call+0x38/0xec
[ 7879.893856] Instruction dump:
[ 7879.893858] 409d0014 38210030 e8010010 7c0803a6 4e800020 3d220000 e94981e0 892a0000
[ 7879.893861] 2f890000 409effe0 39200001 992a0000 <0fe00000> 4bffffd0 60000000 60000000
[ 7879.893866] ---[ end trace 14078f0b3b5ad0aa ]---

To close this race, go ahead and obtain the symlink's target
config_item reference only after the existing CONFIGFS_USET_DROPPING
check succeeds.

This way, if configfs_rmdir() wins create_link() will return -ENONET,
and if create_link() wins configfs_rmdir() will return -EBUSY.

Reported-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Tested-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-26 07:13:08 +02:00
Eric Dumazet
77d2b8dc95 proc: add a schedule point in proc_pid_readdir()
[ Upstream commit 3ba4bceef23206349d4130ddf140819b365de7c8 ]

We have seen proc_pid_readdir() invocations holding cpu for more than 50
ms.  Add a cond_resched() to be gentle with other tasks.

[akpm@linux-foundation.org: coding style fix]
Link: http://lkml.kernel.org/r/1484238380.15816.42.camel@edumazet-glaptop3.roam.corp.google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:39:38 +02:00
Coly Li
202776694c romfs: use different way to generate fsid for BLOCK or MTD
[ Upstream commit f598f82e204ec0b17797caaf1b0311c52d43fb9a ]

Commit 8a59f5d252 ("fs/romfs: return f_fsid for statfs(2)") generates
a 64bit id from sb->s_bdev->bd_dev.  This is only correct when romfs is
defined with CONFIG_ROMFS_ON_BLOCK.  If romfs is only defined with
CONFIG_ROMFS_ON_MTD, sb->s_bdev is NULL, referencing sb->s_bdev->bd_dev
will triger an oops.

Richard Weinberger points out that when CONFIG_ROMFS_BACKED_BY_BOTH=y,
both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD are defined.
Therefore when calling huge_encode_dev() to generate a 64bit id, I use
the follow order to choose parameter,

- CONFIG_ROMFS_ON_BLOCK defined
  use sb->s_bdev->bd_dev
- CONFIG_ROMFS_ON_BLOCK undefined and CONFIG_ROMFS_ON_MTD defined
  use sb->s_dev when,
- both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD undefined
  leave id as 0

When CONFIG_ROMFS_ON_MTD is defined and sb->s_mtd is not NULL, sb->s_dev
is set to a device ID generated by MTD_BLOCK_MAJOR and mtd index,
otherwise sb->s_dev is 0.

This is a try-best effort to generate a uniq file system ID, if all the
above conditions are not meet, f_fsid of this romfs instance will be 0.
Generally only one romfs can be built on single MTD block device, this
method is enough to identify multiple romfs instances in a computer.

Link: http://lkml.kernel.org/r/1482928596-115155-1-git-send-email-colyli@suse.de
Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Nong Li <nongli1031@gmail.com>
Tested-by: Nong Li <nongli1031@gmail.com>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:39:38 +02:00
Chuck Lever
10bfb4c76c nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED"
[ Upstream commit 406dab8450ec76eca88a1af2fc15d18a2b36ca49 ]

Lock sequence IDs are bumped in decode_lock by calling
nfs_increment_seqid(). nfs_increment_sequid() does not use the
seqid_mutating_err() function fixed in commit 059aa7348241 ("Don't
increment lock sequence ID after NFS4ERR_MOVED").

Fixes: 059aa7348241 ("Don't increment lock sequence ID after ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Xuan Qi <xuan.qi@oracle.com>
Cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:39:38 +02:00
David Howells
95a4659ee8 FS-Cache: Initialise stores_lock in netfs cookie
[ Upstream commit 62deb8187d116581c88c69a2dd9b5c16588545d4 ]

Initialise the stores_lock in fscache netfs cookies.  Technically, it
shouldn't be necessary, since the netfs cookie is an index and stores no
data, but initialising it anyway adds insignificant overhead.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:39:37 +02:00
David Howells
38481d7d43 fscache: Clear outstanding writes when disabling a cookie
[ Upstream commit 6bdded59c8933940ac7e5b416448276ac89d1144 ]

fscache_disable_cookie() needs to clear the outstanding writes on the
cookie it's disabling because they cannot be completed after.

Without this, fscache_nfs_open_file() gets stuck because it disables the
cookie when the file is opened for writing but can't uncache the pages till
afterwards - otherwise there's a race between the open routine and anyone
who already has it open R/O and is still reading from it.

Looking in /proc/pid/stack of the offending process shows:

[<ffffffffa0142883>] __fscache_wait_on_page_write+0x82/0x9b [fscache]
[<ffffffffa014336e>] __fscache_uncache_all_inode_pages+0x91/0xe1 [fscache]
[<ffffffffa01740fa>] nfs_fscache_open_file+0x59/0x9e [nfs]
[<ffffffffa01ccf41>] nfs4_file_open+0x17f/0x1b8 [nfsv4]
[<ffffffff8117350e>] do_dentry_open+0x16d/0x2b7
[<ffffffff811743ac>] vfs_open+0x5c/0x65
[<ffffffff81184185>] path_openat+0x785/0x8fb
[<ffffffff81184343>] do_filp_open+0x48/0x9e
[<ffffffff81174710>] do_sys_open+0x13b/0x1cb
[<ffffffff811747b9>] SyS_open+0x19/0x1b
[<ffffffff81001c44>] do_syscall_64+0x80/0x17a
[<ffffffff8165c2da>] return_from_SYSCALL_64+0x0/0x7a
[<ffffffffffffffff>] 0xffffffffffffffff

Reported-by: Jianhong Yin <jiyin@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:39:37 +02:00
David Howells
b421d230df fscache: Fix dead object requeue
[ Upstream commit e26bfebdfc0d212d366de9990a096665d5c0209a ]

Under some circumstances, an fscache object can become queued such that it
fscache_object_work_func() can be called once the object is in the
OBJECT_DEAD state.  This results in the kernel oopsing when it tries to
invoke the handler for the state (which is hard coded to 0x2).

The way this comes about is something like the following:

 (1) The object dispatcher is processing a work state for an object.  This
     is done in workqueue context.

 (2) An out-of-band event comes in that isn't masked, causing the object to
     be queued, say EV_KILL.

 (3) The object dispatcher finishes processing the current work state on
     that object and then sees there's another event to process, so,
     without returning to the workqueue core, it processes that event too.
     It then follows the chain of events that initiates until we reach
     OBJECT_DEAD without going through a wait state (such as
     WAIT_FOR_CLEARANCE).

     At this point, object->events may be 0, object->event_mask will be 0
     and oob_event_mask will be 0.

 (4) The object dispatcher returns to the workqueue processor, and in due
     course, this sees that the object's work item is still queued and
     invokes it again.

 (5) The current state is a work state (OBJECT_DEAD), so the dispatcher
     jumps to it - resulting in an OOPS.

When I'm seeing this, the work state in (1) appears to have been either
LOOK_UP_OBJECT or CREATE_OBJECT (object->oob_table is
fscache_osm_lookup_oob).

The window for (2) is very small:

 (A) object->event_mask is cleared whilst the event dispatch process is
     underway - though there's no memory barrier to force this to the top
     of the function.

     The window, therefore is from the time the object was selected by the
     workqueue processor and made requeueable to the time the mask was
     cleared.

 (B) fscache_raise_event() will only queue the object if it manages to set
     the event bit and the corresponding event_mask bit was set.

     The enqueuement is then deferred slightly whilst we get a ref on the
     object and get the per-CPU variable for workqueue congestion.  This
     slight deferral slightly increases the probability by allowing extra
     time for the workqueue to make the item requeueable.

Handle this by giving the dead state a processor function and checking the
for the dead state address rather than seeing if the processor function is
address 0x2.  The dead state processor function can then set a flag to
indicate that it's occurred and give a warning if it occurs more than once
per object.

If this race occurs, an oops similar to the following is seen (note the RIP
value):

BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
IP: [<0000000000000002>] 0x1
PGD 0
Oops: 0010 [#1] SMP
Modules linked in: ...
CPU: 17 PID: 16077 Comm: kworker/u48:9 Not tainted 3.10.0-327.18.2.el7.x86_64 #1
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
Workqueue: fscache_object fscache_object_work_func [fscache]
task: ffff880302b63980 ti: ffff880717544000 task.ti: ffff880717544000
RIP: 0010:[<0000000000000002>]  [<0000000000000002>] 0x1
RSP: 0018:ffff880717547df8  EFLAGS: 00010202
RAX: ffffffffa0368640 RBX: ffff880edf7a4480 RCX: dead000000200200
RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff880edf7a4480
RBP: ffff880717547e18 R08: 0000000000000000 R09: dfc40a25cb3a4510
R10: dfc40a25cb3a4510 R11: 0000000000000400 R12: 0000000000000000
R13: ffff880edf7a4510 R14: ffff8817f6153400 R15: 0000000000000600
FS:  0000000000000000(0000) GS:ffff88181f420000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000002 CR3: 000000000194a000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffffffffa0363695 ffff880edf7a4510 ffff88093f16f900 ffff8817faa4ec00
 ffff880717547e60 ffffffff8109d5db 00000000faa4ec18 0000000000000000
 ffff8817faa4ec18 ffff88093f16f930 ffff880302b63980 ffff88093f16f900
Call Trace:
 [<ffffffffa0363695>] ? fscache_object_work_func+0xa5/0x200 [fscache]
 [<ffffffff8109d5db>] process_one_work+0x17b/0x470
 [<ffffffff8109e4ac>] worker_thread+0x21c/0x400
 [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
 [<ffffffff810a5acf>] kthread+0xcf/0xe0
 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff816460d8>] ret_from_fork+0x58/0x90
 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeremy McNicoll <jeremymc@redhat.com>
Tested-by: Frank Sorenson <sorenson@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:39:36 +02:00
Sachin Prabhu
2ba464a4b7 Call echo service immediately after socket reconnect
commit b8c600120fc87d53642476f48c8055b38d6e14c7 upstream.

Commit 4fcd1813e640 ("Fix reconnect to not defer smb3 session reconnect
long after socket reconnect") changes the behaviour of the SMB2 echo
service and causes it to renegotiate after a socket reconnect. However
under default settings, the echo service could take up to 120 seconds to
be scheduled.

The patch forces the echo service to be called immediately resulting a
negotiate call being made immediately on reconnect.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:39:35 +02:00
Artem Savkov
bc5f31d34e Make __xfs_xattr_put_listen preperly report errors.
commit 791cc43b36eb1f88166c8505900cad1b43c7fe1a upstream.

Commit 2a6fba6 "xfs: only return -errno or success from attr ->put_listent"
changes the returnvalue of __xfs_xattr_put_listen to 0 in case when there is
insufficient space in the buffer assuming that setting context->count to -1
would be enough, but all of the ->put_listent callers only check seen_enough.
This results in a failed assertion:
XFS: Assertion failed: context->count >= 0, file: fs/xfs/xfs_xattr.c, line: 175
in insufficient buffer size case.

This is only reproducible with at least 2 xattrs and only when the buffer
gets depleted before the last one.

Furthermore if buffersize is such that it is enough to hold the last xattr's
name, but not enough to hold the sum of preceeding xattr names listxattr won't
fail with ERANGE, but will suceed returning last xattr's name without the
first character. The first character end's up overwriting data stored at
(context->alist - 1).

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:27 +02:00
Trond Myklebust
e8a1086ae1 NFSv4: Don't perform cached access checks before we've OPENed the file
commit 762674f86d0328d5dc923c966e209e1ee59663f2 upstream.

Donald Buczek reports that a nfs4 client incorrectly denies
execute access based on outdated file mode (missing 'x' bit).
After the mode on the server is 'fixed' (chmod +x) further execution
attempts continue to fail, because the nfs ACCESS call updates
the access parameter but not the mode parameter or the mode in
the inode.

The root cause is ultimately that the VFS is calling may_open()
before the NFS client has a chance to OPEN the file and hence revalidate
the access and attribute caches.

Al Viro suggests:
>>> Make nfs_permission() relax the checks when it sees MAY_OPEN, if you know
>>> that things will be caught by server anyway?
>>
>> That can work as long as we're guaranteed that everything that calls
>> inode_permission() with MAY_OPEN on a regular file will also follow up
>> with a vfs_open() or dentry_open() on success. Is this always the
>> case?
>
> 1) in do_tmpfile(), followed by do_dentry_open() (not reachable by NFS since
> it doesn't have ->tmpfile() instance anyway)
>
> 2) in atomic_open(), after the call of ->atomic_open() has succeeded.
>
> 3) in do_last(), followed on success by vfs_open()
>
> That's all.  All calls of inode_permission() that get MAY_OPEN come from
> may_open(), and there's no other callers of that puppy.

Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=109771
Link: http://lkml.kernel.org/r/1451046656-26319-1-git-send-email-buczek@molgen.mpg.de
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:26 +02:00
Trond Myklebust
5330208283 NFS: Ensure we revalidate attributes before using execute_ok()
commit 5c5fc09a1157a11dbe84e6421c3e0b37d05238cb upstream.

Donald Buczek reports that NFS clients can also report incorrect
results for access() due to lack of revalidation of attributes
before calling execute_ok().
Looking closely, it seems chdir() is afflicted with the same problem.

Fix is to ensure we call nfs_revalidate_inode_rcu() or
nfs_revalidate_inode() as appropriate before deciding to trust
execute_ok().

Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Link: http://lkml.kernel.org/r/1451331530-3748-1-git-send-email-buczek@molgen.mpg.de
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:26 +02:00
Jeff Mahoney
5c7955c872 btrfs: fix memory leak in update_space_info failure path
commit 896533a7da929136d0432713f02a3edffece2826 upstream.

If we fail to add the space_info kobject, we'll leak the memory
for the percpu counter.

Fixes: 6ab0a2029c (btrfs: publish allocation data in sysfs)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:24 +02:00
David Sterba
cc8c67cadc btrfs: use correct types for page indices in btrfs_page_exists_in_range
commit cc2b702c52094b637a351d7491ac5200331d0445 upstream.

Variables start_idx and end_idx are supposed to hold a page index
derived from the file offsets. The int type is not the right one though,
offsets larger than 1 << 44 will get silently trimmed off the high bits.
(1 << 44 is 16TiB)

What can go wrong, if start is below the boundary and end gets trimmed:
- if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
- the final check "if (page->index <= end_idx)" will unexpectedly fail

The function will return false, ie. "there's no page in the range",
although there is at least one.

btrfs_page_exists_in_range is used to prevent races in:

* in hole punching, where we make sure there are not pages in the
  truncated range, otherwise we'll wait for them to finish and redo
  truncation, but we're going to replace the pages with holes anyway so
  the only problem is the intermediate state

* lock_extent_direct: we want to make sure there are no pages before we
  lock and start DIO, to prevent stale data reads

For practical occurence of the bug, there are several constaints.  The
file must be quite large, the affected range must cross the 16TiB
boundary and the internal state of the file pages and pending operations
must match.  Also, we must not have started any ordered data in the
range, otherwise we don't even reach the buggy function check.

DIO locking tries hard in several places to avoid deadlocks with
buffered IO and avoids waiting for ranges. The worst consequence seems
to be stale data read.

CC: Liu Bo <bo.li.liu@oracle.com>
Fixes: fc4adbff82 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:24 +02:00
Al Viro
f0d2e15314 ufs_getfrag_block(): we only grab ->truncate_mutex on block creation path
commit 006351ac8ead0d4a67dd3845e3ceffe650a23212 upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:24 +02:00
Al Viro
34aa71cbd4 ufs_extend_tail(): fix the braino in calling conventions of ufs_new_fragments()
commit 940ef1a0ed939c2ca029fca715e25e7778ce1e34 upstream.

... and it really needs splitting into "new" and "extend" cases, but that's for
later

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:24 +02:00
Al Viro
d6bd1e7ec7 ufs: set correct ->s_maxsize
commit 6b0d144fa758869bdd652c50aa41aaf601232550 upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:24 +02:00
Al Viro
4c516dff07 ufs: restore maintaining ->i_blocks
commit eb315d2ae614493fd1ebb026c75a80573d84f7ad upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:24 +02:00
Al Viro
1df45bb643 fix ufs_isblockset()
commit 414cf7186dbec29bd946c138d6b5c09da5955a08 upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:24 +02:00
Al Viro
db9aafaf90 ufs: restore proper tail allocation
commit 8785d84d002c2ce0f68fbcd6c2c86be859802c7e upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:24 +02:00
Fabian Frederick
044470266a fs: add i_blocksize()
commit 93407472a21b82f39c955ea7787e5bc7da100642 upstream.

Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
branch.

This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'

Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.

[geliangtang@gmail.com: truncate: use i_blocksize()]
  Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:24 +02:00
Jan Kara
daa1357ff3 ext4: fix fdatasync(2) after extent manipulation operations
commit 67a7d5f561f469ad2fa5154d2888258ab8e6df7c upstream.

Currently, extent manipulation operations such as hole punch, range
zeroing, or extent shifting do not record the fact that file data has
changed and thus fdatasync(2) has a work to do. As a result if we crash
e.g. after a punch hole and fdatasync, user can still possibly see the
punched out data after journal replay. Test generic/392 fails due to
these problems.

Fix the problem by properly marking that file data has changed in these
operations.

Fixes: a4bb6b64e3
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 13:16:22 +02:00