Commit graph

44860 commits

Author SHA1 Message Date
Chao Yu
076a6f32fe f2fs: support hot file extension
This patch supports to recognize hot file extension in f2fs, so that we
can allocate proper hot segment location for its data, which can lead to
better hot/cold seperation in filesystem.

In addition, we changes a bit on query/add/del operation method for
extension_list sysfs entry as below:

- Query: cat /sys/fs/f2fs/<disk>/extension_list
- Add: echo 'extension' > /sys/fs/f2fs/<disk>/extension_list
- Del: echo '!extension' > /sys/fs/f2fs/<disk>/extension_list
- Add: echo '[h/c]extension' > /sys/fs/f2fs/<disk>/extension_list
- Del: echo '[h/c]!extension' > /sys/fs/f2fs/<disk>/extension_list
- [h] means add/del hot file extension
- [c] means add/del cold file extension

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:51:08 -07:00
Chao Yu
58edcdbca6 f2fs: fix to avoid race in between atomic write and background GC
Sqlite user			Background GC
				- move_data_block
				  : move page #1
				 - f2fs_is_atomic_file
- f2fs_ioc_start_atomic_write
- f2fs_ioc_commit_atomic_write
 - commit_inmem_pages
   : commit page #1 & set node #2 dirty
				 - f2fs_submit_page_write
				  - f2fs_update_data_blkaddr
				   - set_page_dirty
				     : set node #2 dirty
 - f2fs_do_sync_file
  - fsync_node_pages
   : commit node #1 & node #2, then sudden power-cut

In a race case, we may check FI_ATOMIC_FILE flag before starting atomic
write flow, then we will commit meta data before data with reversed
order, after a sudden pow-cut, database transaction will be inconsistent.

So we'd better to exclude gc/atomic_write to each other by using lock
instead of flag checking.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:51:07 -07:00
Jaegeuk Kim
1e0aeb0af9 f2fs: do gc in greedy mode for whole range if gc_urgent mode is set
Otherwise, f2fs conducts GC on 8GB range only based on slow cost-benefit.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:51:06 -07:00
Jaegeuk Kim
10b2d001d6 f2fs: issue discard aggressively in the gc_urgent mode
This patch avoids to skip discard commands when user sets gc_urgent mode.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:51:06 -07:00
Jaegeuk Kim
a5052f32b9 f2fs: set readdir_ra by default
It gives general readdir improvement.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:51:05 -07:00
Jaegeuk Kim
1aa536a624 f2fs: add auto tuning for small devices
If f2fs is running on top of very small devices, it's worth to avoid abusing
free LBAs. In order to achieve that, this patch introduces some parameter
tuning.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:51:04 -07:00
Jaegeuk Kim
0ffdffc8f1 f2fs: add mount option for segment allocation policy
This patch adds an mount option, "alloc_mode=%s" having two options, "default"
and "reuse".

In "alloc_mode=reuse" case, f2fs starts to allocate segments from 0'th segment
all the time to reassign segments. It'd be useful for small-sized eMMC parts.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:51:03 -07:00
Jaegeuk Kim
b798298912 f2fs: don't stop GC if GC is contended
Let's do GC as much as possible, while gc_urgent is set.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:51:02 -07:00
Chao Yu
766d232169 f2fs: expose extension_list sysfs entry
This patch adds a sysfs entry 'extension_list' to support
query/add/del item in extension list.

Query:
cat /sys/fs/f2fs/<device>/extension_list

Add:
echo 'extension' > /sys/fs/f2fs/<device>/extension_list

Del:
echo '!extension' > /sys/fs/f2fs/<device>/extension_list

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:51:02 -07:00
Chao Yu
98b329de50 f2fs: fix to set KEEP_SIZE bit in f2fs_zero_range
As Jayashree Mohan reported:

A simple workload to reproduce this would be :
1. create foo
2. Write (8K - 16K)  // foo size = 16K now
3. fsync()
4. falloc zero_range , keep_size (4202496 - 4210688) // foo size must be 16K
5. fdatasync()
Crash now

On recovery, we see that the file size is 4210688 and not 16K, which
violates the semantics of keep_size flag. We have a test case to
reproduce this using CrashMonkey on 4.15 kernel. Try this out by
simply running :
 ./c_harness -f /dev/sda -d /dev/cow_ram0 -t f2fs -e 102400  -P -v
 tests/generic_468_zero.so

The root cause is that we miss to set KEEP_SIZE bit correctly in zero_range
when zeroing block cross EOF with FALLOC_FL_KEEP_SIZE, let's fix this
missing case.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:51:01 -07:00
Chao Yu
4d409fa334 f2fs: introduce sb_lock to make encrypt pwsalt update exclusive
f2fs_super_block.encrypt_pw_salt can be udpated and persisted
concurrently, result in getting different pwsalt in separated
threads, so let's introduce sb_lock to exclude concurrent
accessers.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:51:00 -07:00
Colin Ian King
1f6bac14c1 f2fs: remove redundant initialization of pointer 'p'
Pointer p is initialized with a value that is never read and is later
re-assigned a new value, hence the initialization is redundant and can
be removed.

Cleans up clang warning:
fs/f2fs/extent_cache.c:463:19: warning: Value stored to 'p' during
its initialization is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:50:59 -07:00
Gao Xiang
946aefc754 f2fs: flush cp pack except cp pack 2 page at first
Previously, we attempt to flush the whole cp pack in a single bio,
however, when suddenly powering off at this time, we could get into
an extreme scenario that cp pack 1 page and cp pack 2 page are updated
and latest, but payload or current summaries are still partially
outdated. (see reliable write in the UFS specification)

This patch submits the whole cp pack except cp pack 2 page at first,
and then writes the cp pack 2 page with an extra independent
bio with pre-io barrier.

Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:50:58 -07:00
Sheng Yong
e5081a52ac f2fs: clean up f2fs_sb_has_xxx functions
This patch introduces F2FS_FEATURE_FUNCS to clean up the definitions of
different f2fs_sb_has_xxx functions.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:50:57 -07:00
Tiezhu Yang
a292477154 f2fs: remove redundant check of page type when submit bio
This patch removes redundant check of page type when submit bio to
make the logic more clear.

Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:50:57 -07:00
Chao Yu
190e64a819 f2fs: fix to handle looped node chain during recovery
There is no checksum in node block now, so bit-transition from hardware
can make node_footer.next_blkaddr being corrupted w/o any detection,
result in node chain becoming looped one.

For this condition, during recovery, in order to avoid running into dead
loop, let's detect it and just skip out.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:50:56 -07:00
Jaegeuk Kim
889d980876 f2fs: handle quota for orphan inodes
This is to detect dquot_initialize errors early from evict_inode
for orphan inodes.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:50:55 -07:00
Hyunchul Lee
92b12bb1a2 f2fs: support passing down write hints to block layer with F2FS policy
Add 'whint_mode=fs-based' mount option. In this mode, F2FS passes
down write hints with its policy.

* whint_mode=fs-based. F2FS passes down hints with its policy.

User                  F2FS                     Block
----                  ----                     -----
                      META                     WRITE_LIFE_MEDIUM;
                      HOT_NODE                 WRITE_LIFE_NOT_SET
                      WARM_NODE                "
                      COLD_NODE                WRITE_LIFE_NONE
ioctl(COLD)           COLD_DATA                WRITE_LIFE_EXTREME
extension list        "                        "

-- buffered io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_LONG
WRITE_LIFE_NONE       "                        "
WRITE_LIFE_MEDIUM     "                        "
WRITE_LIFE_LONG       "                        "

-- direct io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE       "                        WRITE_LIFE_NONE
WRITE_LIFE_MEDIUM     "                        WRITE_LIFE_MEDIUM
WRITE_LIFE_LONG       "                        WRITE_LIFE_LONG

Many thanks to Chao Yu and Jaegeuk Kim for comments to
implement this patch.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:50:54 -07:00
Hyunchul Lee
22fa74c2b0 f2fs: support passing down write hints given by users to block layer
Add the 'whint_mode' mount option that controls which write
hints are passed down to block layer. There are "off" and
"user-based" mode. The default mode is "off".

1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.

2) whint_mode=user-based. F2FS tries to pass down hints given
by users.

User                  F2FS                     Block
----                  ----                     -----
                      META                     WRITE_LIFE_NOT_SET
                      HOT_NODE                 "
                      WARM_NODE                "
                      COLD_NODE                "
ioctl(COLD)           COLD_DATA                WRITE_LIFE_EXTREME
extension list        "                        "

-- buffered io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE       "                        "
WRITE_LIFE_MEDIUM     "                        "
WRITE_LIFE_LONG       "                        "

-- direct io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE       "                        WRITE_LIFE_NONE
WRITE_LIFE_MEDIUM     "                        WRITE_LIFE_MEDIUM
WRITE_LIFE_LONG       "                        WRITE_LIFE_LONG

Many thanks to Chao Yu and Jaegeuk Kim for comments to
implement this patch.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid build warning]
[Chao Yu: fix to restore whint_mode in ->remount_fs]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:50:48 -07:00
Chao Yu
180900373e f2fs: fix to clear CP_TRIMMED_FLAG
Once CP_TRIMMED_FLAG is set, after a reboot, we will never issue discard
before LBA becomes invalid again, fix it by clearing the flag in
checkpoint without CP_TRIMMED reason.

Fixes: 1f43e2ad7bff ("f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:15:44 -07:00
Chao Yu
0671fae134 f2fs: support large nat bitmap
Previously, we will store all nat version bitmap in checkpoint pack block,
so our total node entry number has a limitation which caused total node
number can not exceed (3900 * 8) block * 455 node/block = 14196000. So
that once user wants to create more nodes in large size image, it becomes
a bottleneck, that's unreasonable.

This patch detects the new layout of nat/sit version bitmap in image in
order to enable supporting large nat bitmap.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:15:42 -07:00
Chao Yu
eceb943d5d f2fs: fix to check extent cache in f2fs_drop_extent_tree
If noextent_cache mount option is on, we will never initialize extent tree
in inode, but still we're going to access it in f2fs_drop_extent_tree,
result in kernel panic as below:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
 IP: _raw_write_lock+0xc/0x30
 Call Trace:
  ? f2fs_drop_extent_tree+0x41/0x70 [f2fs]
  f2fs_fallocate+0x5a0/0xdd0 [f2fs]
  ? common_file_perm+0x47/0xc0
  ? apparmor_file_permission+0x1a/0x20
  vfs_fallocate+0x15b/0x290
  SyS_fallocate+0x44/0x70
  do_syscall_64+0x6e/0x160
  entry_SYSCALL64_slow_path+0x25/0x25

This patch fixes to check extent cache status before using in
f2fs_drop_extent_tree.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:15:40 -07:00
Chao Yu
2e2a339c98 f2fs: restrict inline_xattr_size configuration
This patch limits to enable inline_xattr_size mount option only if
both extra_attr and flexible_inline_xattr feature is on in current
image.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:15:38 -07:00
Yunlong Song
41dda11641 f2fs: fix heap mode to reset it back
Commit 7a20b8a61eff81bdb7097a578752a74860e9d142 ("f2fs: allocate node
and hot data in the beginning of partition") introduces another mount
option, heap, to reset it back. But it does not do anything for heap
mode, so fix it.

Cc: stable@vger.kernel.org
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:15:36 -07:00
Sheng Yong
39575737bb f2fs: fix potential corruption in area before F2FS_SUPER_OFFSET
sb_getblk does not guarantee the buffer head is uptodate. If bh is not
uptodate, the data (may be used as boot code) in area before
F2FS_SUPER_OFFSET may get corrupted when super block is committed.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-08 03:15:33 -07:00
Eric Biggers
7e0e7995ee fscrypt: fix build with pre-4.6 gcc versions
gcc versions prior to 4.6 require an extra level of braces when using a
designated initializer for a member in an anonymous struct or union.
This caused a compile error with the 'struct qstr' initialization in
__fscrypt_encrypt_symlink().

Fix it by using QSTR_INIT().

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Fixes: 76e81d6d5048 ("fscrypt: new helper functions for ->symlink()")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 03:05:28 -07:00
Andy Lutomirski
96450e0ffd fs/proc: Stop trying to report thread stacks
commit b18cb64ead400c01bf1580eeba330ace51f8087d upstream.

This reverts more of:

  b76437579d ("procfs: mark thread stack correctly in proc/<pid>/maps")

... which was partially reverted by:

  65376df58217 ("proc: revert /proc/<pid>/maps [stack:TID] annotation")

Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.

In current kernels, /proc/PID/maps (or /proc/TID/maps even for
threads) shows "[stack]" for VMAs in the mm's stack address range.

In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
target thread's stack's VMA.  This is racy, probably returns garbage
and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone:
KSTK_ESP is not safe to use on tasks that aren't known to be running
ordinary process-context kernel code.

This patch removes the difference and just shows "[stack]" for VMAs
in the mm's stack range.  This is IMO much more sensible -- the
actual "stack" address really is treated specially by the VM code,
and the current thread stack isn't even well-defined for programs
that frequently switch stacks on their own.

Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-08 11:52:00 +02:00
Mark Charlebois
0073296323 fs: compat: Remove warning from COMPATIBLE_IOCTL
commit 9280cdd6fe5b8287a726d24cc1d558b96c8491d7 upstream.

cmd in COMPATIBLE_IOCTL is always a u32, so cast it so there isn't a
warning about an overflow in XFORM.

From: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-08 11:51:57 +02:00
Eric Biggers
31d3279a4f fscrypt: fix up fscrypt_fname_encrypted_size() for internal use
Filesystems don't need fscrypt_fname_encrypted_size() anymore, so
unexport it and move it to fscrypt_private.h.

We also never calculate the encrypted size of a filename without having
the fscrypt_info present since it is needed to know the amount of
NUL-padding which is determined by the encryption policy, and also we
will always truncate the NUL-padding to the maximum filename length.
Therefore, also make fscrypt_fname_encrypted_size() assume that the
fscrypt_info is present, and make it truncate the returned length to the
specified max_len.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:27:21 -07:00
Eric Biggers
82bec88856 fscrypt: define fscrypt_fname_alloc_buffer() to be for presented names
Previously fscrypt_fname_alloc_buffer() was used to allocate buffers for
both presented (decrypted or encoded) and encrypted filenames.  That was
confusing, because it had to allocate the worst-case size for either,
e.g. including NUL-padding even when it was meaningless.

But now that fscrypt_setup_filename() no longer calls it, it is only
used in the ->get_link() and ->readdir() paths, which specifically want
a buffer for presented filenames.  Therefore, switch the behavior over
to allocating the buffer for presented filenames only.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:27:20 -07:00
Eric Biggers
168a907828 fscrypt: calculate NUL-padding length in one place only
Currently, when encrypting a filename (either a real filename or a
symlink target) we calculate the amount of NUL-padding twice: once
before encryption and once during encryption in fname_encrypt().  It is
needed before encryption to allocate the needed buffer size as well as
calculate the size the symlink target will take up on-disk before
creating the symlink inode.  Calculating the size during encryption as
well is redundant.

Remove this redundancy by always calculating the exact size beforehand,
and making fname_encrypt() just add as much NUL padding as is needed to
fill the output buffer.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:27:19 -07:00
Eric Biggers
042ae9f4cf fscrypt: move fscrypt_symlink_data to fscrypt_private.h
Now that all filesystems have been converted to use the symlink helper
functions, they no longer need the declaration of 'struct
fscrypt_symlink_data'.  Move it from fscrypt.h to fscrypt_private.h.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:27:18 -07:00
Eric Biggers
f9550c24c2 fscrypt: remove fscrypt_fname_usr_to_disk()
fscrypt_fname_usr_to_disk() sounded very generic but was actually only
used to encrypt symlinks.  Remove it now that all filesystems have been
switched over to fscrypt_encrypt_symlink().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:27:18 -07:00
Eric Biggers
7ac4756a24 f2fs: switch to fscrypt_get_symlink()
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:27:16 -07:00
Eric Biggers
6b76f58e24 f2fs: switch to fscrypt ->symlink() helper functions
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:26:49 -07:00
Eric Biggers
fd457d2c4e fscrypt: new helper function - fscrypt_get_symlink()
Filesystems also have duplicate code to support ->get_link() on
encrypted symlinks.  Factor it out into a new function
fscrypt_get_symlink().  It takes in the contents of the encrypted
symlink on-disk and provides the target (decrypted or encoded) that
should be returned from ->get_link().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:26:47 -07:00
Eric Biggers
a1cdacb7ae fscrypt: new helper functions for ->symlink()
Currently, filesystems supporting fscrypt need to implement some tricky
logic when creating encrypted symlinks, including handling a peculiar
on-disk format (struct fscrypt_symlink_data) and correctly calculating
the size of the encrypted symlink.  Introduce helper functions to make
things a bit easier:

- fscrypt_prepare_symlink() computes and validates the size the symlink
  target will require on-disk.
- fscrypt_encrypt_symlink() creates the encrypted target if needed.

The new helpers actually fix some subtle bugs.  First, when checking
whether the symlink target was too long, filesystems didn't account for
the fact that the NUL padding is meant to be truncated if it would cause
the maximum length to be exceeded, as is done for filenames in
directories.  Consequently users would receive ENAMETOOLONG when
creating symlinks close to what is supposed to be the maximum length.
For example, with EXT4 with a 4K block size, the maximum symlink target
length in an encrypted directory is supposed to be 4093 bytes (in
comparison to 4095 in an unencrypted directory), but in
FS_POLICY_FLAGS_PAD_32-mode only up to 4064 bytes were accepted.

Second, symlink targets of "." and ".." were not being encrypted, even
though they should be, as these names are special in *directory entries*
but not in symlink targets.  Fortunately, we can fix this simply by
starting to encrypt them, as old kernels already accept them in
encrypted form.

Third, the output string length the filesystems were providing when
doing the actual encryption was incorrect, as it was forgotten to
exclude 'sizeof(struct fscrypt_symlink_data)'.  Fortunately though, this
bug didn't make a difference.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:11:03 -07:00
Eric Biggers
7f43602f4d fscrypt: trim down fscrypt.h includes
fscrypt.h included way too many other headers, given that it is included
by filesystems both with and without encryption support.  Trim down the
includes list by moving the needed includes into more appropriate
places, and removing the unneeded ones.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:11:00 -07:00
Eric Biggers
d9cadc11bd fscrypt: move fscrypt_is_dot_dotdot() to fs/crypto/fname.c
Only fs/crypto/fname.c cares about treating the "." and ".." filenames
specially with regards to encryption, so move fscrypt_is_dot_dotdot()
from fscrypt.h to there.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:10:50 -07:00
Eric Biggers
e6fe930580 fscrypt: move fscrypt_valid_enc_modes() to fscrypt_private.h
The encryption modes are validated by fs/crypto/, not by individual
filesystems.  Therefore, move fscrypt_valid_enc_modes() from fscrypt.h
to fscrypt_private.h.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:10:15 -07:00
Eric Biggers
8216a0b51a fscrypt: move fscrypt_info_cachep declaration to fscrypt_private.h
The fscrypt_info kmem_cache is internal to fscrypt; filesystems don't
need to access it.  So move its declaration into fscrypt_private.h.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-04-08 02:08:02 -07:00
Ritesh Harjani
86f9f957c5 ANDROID: fuse: Add null terminator to path in canonical path to avoid issue
page allocated in fuse_dentry_canonical_path to be handled in
fuse_dev_do_write is allocated using __get_free_pages(GFP_KERNEL).
This may not return a page with data filled with 0. Now this
page may not have a null terminator at all.
If this happens and userspace fuse daemon screws up by passing a string
to kernel which is not NULL terminated (or did not fill anything),
then inside fuse driver in kernel when we try to do
strlen(fuse_dev_write->kern_path->getname_kernel)
on that page data -> it may give us issue with kernel paging request.

Unable to handle kernel paging request at virtual address
------------[ cut here ]------------
<..>
PC is at strlen+0x10/0x90
LR is at getname_kernel+0x2c/0xf4
<..>
strlen+0x10/0x90
kern_path+0x28/0x4c
fuse_dev_do_write+0x5b8/0x694
fuse_dev_write+0x74/0x94
do_iter_readv_writev+0x80/0xb8
do_readv_writev+0xec/0x1cc
vfs_writev+0x54/0x64
SyS_writev+0x64/0xe4
el0_svc_naked+0x24/0x28

To avoid this we should ensure in case of FUSE_CANONICAL_PATH,
the page is null terminated.

Change-Id: I33ca7cc76b4472eaa982c67bb20685df451121f5
Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
Bug: 75984715
[Daniel - small edit, using args size ]
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2018-04-06 18:38:43 -07:00
Ritesh Harjani
13172f49e6 ANDROID: sdcardfs: Fix sdcardfs to stop creating cases-sensitive duplicate entries.
sdcardfs_name_match gets a 'name' argument from the underlying FS.
This need not be null terminated string.
So in sdcardfs_name_match -> qstr_case_eq -> we should use
str_n_case_eq.

This happens because few of the entries in lower level FS may not be
NULL terminated and may have some garbage characters passed while
doing sdcardfs_name_match.

For e.g.
 # dmesg |grep Download
 [  103.646386] sdcardfs_name_match: q1->name=.nomedia, q1->len=8,
 q2->name=Download\x17\x80\x03, q2->len=8
 [  104.021340] sdcardfs_name_match: q1->name=.nomedia, q1->len=8,
 q2->name=Download\x17\x80\x03, q2->len=8
 [  105.196864] sdcardfs_name_match: q1->name=.nomedia, q1->len=8,
 q2->name=Download\x17\x80\x03, q2->len=8
 [  109.113521] sdcardfs_name_match: q1->name=logs, q1->len=4,
 q2->name=Download\x17\x80\x03, q2->len=8

Now when we try to create a directory with different case for a such
files. SDCARDFS creates a entry if it could not find the underlying
entry in it's dcache.

To reproduce:-
1. bootup the device wait for some time after sdcardfs mounting to
   complete.
2. cd /storage/emulated/0
3. echo 3 > /proc/sys/vm/drop_caches
4. mkdir download

We now start seeing two entries with name.
Download & download.

Change-Id: I976d92a220a607dd8cdb96c01c2041c5c2bc3326
Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
bug: 75987238
2018-04-06 15:31:18 -07:00
Greg Kroah-Hartman
38f41ec1cb This is the 4.4.125 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlq7xXcACgkQONu9yGCS
 aT5Syw/+K3OYEdVvooPrlPKHlCey8PUP8BuX7dYJ9eiRD+oxbyPKewNEIJyx0eUo
 3OkdCArHt3I2+XdN8wEmhpOWVVElE6lg81Gyn+gWxprKac60A5O924lsXx8uULrQ
 VSDE2+BQ6V45vzGUoqzSocA3Au8tqiP9Py9V7BSj0ADDI5LQLL+J9vp7+8sRlJWq
 F2nQxV3vKc036Da7XbPpnAxds/p9B+CoglThzJMhhqMYxfsvvMHTnV28hZqGQvHa
 /yuUBAAx/9Q0lXFimyu9Gj1pFl5KA0mLHZo0UZWg1wR/3edaB9Dor/NfY9zENWtw
 W1+vyK5zq3zku25SjVgpz8W19rcB9T4jL7capNHB7lJTpM9LncX3UdsyuC1WY+gW
 lg/sS0ESRWMJrgO+4JmXWehTYBLCNn9pqC7jxZSo/Fq3zMZWrkBx/53Mrku4LBQa
 f0sUzwvGQ9vmgWVgdbIA123A2b444tLvG2BY3d2Pq6naBE3/Xgkr58XJhNpZ8E6K
 WSIuxi60Vh9xhaXseYZBQpEGriWtwCljPaCP93k2DmksbwIP7r1UhkQ8lk3iTh2V
 1+VYX9iv2+ceL3LPc6zmwbSYXP8czkB8F5okRH7lb9ykREtHXqleeHSX5fwdLik1
 Y5rCrdFVw9avKjfNIg/gPQRFYN3KQbgUt/E4bum3w5Oo9VTdeTk=
 =AHKD
 -----END PGP SIGNATURE-----

Merge 4.4.125 into android-4.4

Changes in 4.4.125
	MIPS: ralink: Remove ralink_halt()
	iio: st_pressure: st_accel: pass correct platform data to init
	ALSA: usb-audio: Fix parsing descriptor of UAC2 processing unit
	ALSA: aloop: Sync stale timer before release
	ALSA: aloop: Fix access to not-yet-ready substream via cable
	ALSA: hda/realtek - Always immediately update mute LED with pin VREF
	mmc: dw_mmc: fix falling from idmac to PIO mode when dw_mci_reset occurs
	PCI: Add function 1 DMA alias quirk for Highpoint RocketRAID 644L
	ahci: Add PCI-id for the Highpoint Rocketraid 644L card
	clk: bcm2835: Protect sections updating shared registers
	Bluetooth: btusb: Fix quirk for Atheros 1525/QCA6174
	libata: fix length validation of ATAPI-relayed SCSI commands
	libata: remove WARN() for DMA or PIO command without data
	libata: Apply NOLPM quirk to Crucial MX100 512GB SSDs
	libata: disable LPM for Crucial BX100 SSD 500GB drive
	libata: Enable queued TRIM for Samsung SSD 860
	libata: Apply NOLPM quirk to Crucial M500 480 and 960GB SSDs
	libata: Make Crucial BX100 500GB LPM quirk apply to all firmware versions
	libata: Modify quirks for MX100 to limit NCQ_TRIM quirk to MU01 version
	mm/vmalloc: add interfaces to free unmapped page table
	x86/mm: implement free pmd/pte page interfaces
	drm/vmwgfx: Fix a destoy-while-held mutex problem.
	drm/radeon: Don't turn off DP sink when disconnected
	drm: udl: Properly check framebuffer mmap offsets
	acpi, numa: fix pxm to online numa node associations
	brcmfmac: fix P2P_DEVICE ethernet address generation
	rtlwifi: rtl8723be: Fix loss of signal
	tracing: probeevent: Fix to support minus offset from symbol
	mtd: nand: fsl_ifc: Fix nand waitfunc return value
	staging: ncpfs: memory corruption in ncp_read_kernel()
	can: cc770: Fix stalls on rt-linux, remove redundant IRQ ack
	can: cc770: Fix queue stall & dropped RTR reply
	can: cc770: Fix use after free in cc770_tx_interrupt()
	tty: vt: fix up tabstops properly
	kvm/x86: fix icebp instruction handling
	x86/build/64: Force the linker to use 2MB page size
	x86/boot/64: Verify alignment of the LOAD segment
	x86/entry/64: Don't use IST entry for #BP stack
	perf/x86/intel: Don't accidentally clear high bits in bdw_limit_period()
	staging: lustre: ptlrpc: kfree used instead of kvfree
	kbuild: disable clang's default use of -fmerge-all-constants
	bpf: skip unnecessary capability check
	bpf, x64: increase number of passes
	Linux 4.4.125

Change-Id: I14b307cd27ff088800174c74819a3ff1790b41ce
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-03-29 11:37:54 +02:00
Dan Carpenter
badf74b65f staging: ncpfs: memory corruption in ncp_read_kernel()
commit 4c41aa24baa4ed338241d05494f2c595c885af8f upstream.

If the server is malicious then *bytes_read could be larger than the
size of the "target" buffer.  It would lead to memory corruption when we
do the memcpy().

Reported-by: Dr Silvio Cesare of InfoSect <Silvio Cesare <silvio.cesare@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28 18:40:15 +02:00
Greg Kroah-Hartman
851fb4da32 This is the 4.4.124 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlq2IVkACgkQONu9yGCS
 aT5SGQ//eDH59qGQJFy8GwDQULnYV8JEeBZT2tIzx2LDVJvKKz/8BkJmcGZrQ0gH
 9EnMCiPkbxRH6HV6Cu1SHcyLFK+iVXGEw28Uk/sEcesQSZvrRxUKIgRLvyWhANgP
 E1FiPbI6V1tXTROmUk/lrGZYgHMVUMAa+kYRndpbijtB++7qO25mVngP0Yg6OveO
 oyw0MRx+v9mwQS/SID2EtlXnZOVGM+7kd3gg5fLY5w0qJT1jHjhrtZNWyKH2SMrS
 QnzkFDMGDIWk5TrHJY0LEvpZI/toKNtPrG/Mt5gOvtSgjmQ4+EEVndrlutPAOa3K
 xF6ERjtyBIRRG2ftk122fm6brjpxyCoqycD8JgCm1BxalNN7Bg+1ogQCGwE0EWNp
 yYEsxLmSbLllX/4rkZtx+WsgiG4rwNWG1/IsPsEo80La3M53WzA3My8E0aVLpbIj
 aqkzR62lZ1TSRgtbFjm6Bl4DtpJH4f9uELbO0VBi7b8LRTUI99Qcz4bOoKH+WKOC
 umvHsuyMwDm8wc0KhQ4hdyGb1le0nS4Y1xydp1p0pcASLfeA2VOIg20ZvxQVBTZA
 rlHo7zEQVkVYvphoPONUodOUVZ8/AGxoVvim3HlI6s5VvtdF6k/hQUWaOuWDh59J
 bjw2vnMWOLWSqmmkpK9HmU6ohIV310TJewPu8BShzAuZRSNDxl8=
 =OCjx
 -----END PGP SIGNATURE-----

Merge 4.4.124 into android-4.4

Changes in 4.4.124
	tpm: fix potential buffer overruns caused by bit glitches on the bus
	tpm_tis: fix potential buffer overruns caused by bit glitches on the bus
	SMB3: Validate negotiate request must always be signed
	CIFS: Enable encryption during session setup phase
	staging: android: ashmem: Fix possible deadlock in ashmem_ioctl
	platform/x86: asus-nb-wmi: Add wapf4 quirk for the X302UA
	regulator: anatop: set default voltage selector for pcie
	x86: i8259: export legacy_pic symbol
	rtc: cmos: Do not assume irq 8 for rtc when there are no legacy irqs
	Input: ar1021_i2c - fix too long name in driver's device table
	time: Change posix clocks ops interfaces to use timespec64
	ACPI/processor: Fix error handling in __acpi_processor_start()
	ACPI/processor: Replace racy task affinity logic
	cpufreq/sh: Replace racy task affinity logic
	genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs
	i2c: i2c-scmi: add a MS HID
	net: ipv6: send unsolicited NA on admin up
	media/dvb-core: Race condition when writing to CAM
	spi: dw: Disable clock after unregistering the host
	ath: Fix updating radar flags for coutry code India
	clk: ns2: Correct SDIO bits
	scsi: virtio_scsi: Always try to read VPD pages
	KVM: PPC: Book3S PR: Exit KVM on failed mapping
	ARM: 8668/1: ftrace: Fix dynamic ftrace with DEBUG_RODATA and !FRAME_POINTER
	iommu/omap: Register driver before setting IOMMU ops
	md/raid10: wait up frozen array in handle_write_completed
	NFS: Fix missing pg_cleanup after nfs_pageio_cond_complete()
	tcp: remove poll() flakes with FastOpen
	e1000e: fix timing for 82579 Gigabit Ethernet controller
	ALSA: hda - Fix headset microphone detection for ASUS N551 and N751
	IB/ipoib: Fix deadlock between ipoib_stop and mcast join flow
	IB/ipoib: Update broadcast object if PKey value was changed in index 0
	HSI: ssi_protocol: double free in ssip_pn_xmit()
	IB/mlx4: Take write semaphore when changing the vma struct
	IB/mlx4: Change vma from shared to private
	ASoC: Intel: Skylake: Uninitialized variable in probe_codec()
	Fix driver usage of 128B WQEs when WQ_CREATE is V1.
	netfilter: xt_CT: fix refcnt leak on error path
	openvswitch: Delete conntrack entry clashing with an expectation.
	mmc: host: omap_hsmmc: checking for NULL instead of IS_ERR()
	wan: pc300too: abort path on failure
	qlcnic: fix unchecked return value
	scsi: mac_esp: Replace bogus memory barrier with spinlock
	infiniband/uverbs: Fix integer overflows
	NFS: don't try to cross a mountpount when there isn't one there.
	iio: st_pressure: st_accel: Initialise sensor platform data properly
	mt7601u: check return value of alloc_skb
	rndis_wlan: add return value validation
	Btrfs: send, fix file hole not being preserved due to inline extent
	mac80211: don't parse encrypted management frames in ieee80211_frame_acked
	mfd: palmas: Reset the POWERHOLD mux during power off
	mtip32xx: use runtime tag to initialize command header
	staging: unisys: visorhba: fix s-Par to boot with option CONFIG_VMAP_STACK set to y
	staging: wilc1000: fix unchecked return value
	mmc: sdhci-of-esdhc: limit SD clock for ls1012a/ls1046a
	ARM: DRA7: clockdomain: Change the CLKTRCTRL of CM_PCIE_CLKSTCTRL to SW_WKUP
	ipmi/watchdog: fix wdog hang on panic waiting for ipmi response
	ACPI / PMIC: xpower: Fix power_table addresses
	drm/nouveau/kms: Increase max retries in scanout position queries.
	bnx2x: Align RX buffers
	power: supply: pda_power: move from timer to delayed_work
	Input: twl4030-pwrbutton - use correct device for irq request
	md/raid10: skip spare disk as 'first' disk
	ia64: fix module loading for gcc-5.4
	tcm_fileio: Prevent information leak for short reads
	video: fbdev: udlfb: Fix buffer on stack
	sm501fb: don't return zero on failure path in sm501fb_start()
	net: hns: fix ethtool_get_strings overflow in hns driver
	cifs: small underflow in cnvrtDosUnixTm()
	rtc: ds1374: wdt: Fix issue with timeout scaling from secs to wdt ticks
	rtc: ds1374: wdt: Fix stop/start ioctl always returning -EINVAL
	perf tests kmod-path: Don't fail if compressed modules aren't supported
	Bluetooth: hci_qca: Avoid setup failure on missing rampatch
	media: c8sectpfe: fix potential NULL pointer dereference in c8sectpfe_timer_interrupt
	drm/msm: fix leak in failed get_pages
	RDMA/iwpm: Fix uninitialized error code in iwpm_send_mapinfo()
	rtlwifi: rtl_pci: Fix the bug when inactiveps is enabled.
	media: bt8xx: Fix err 'bt878_probe()'
	media: [RESEND] media: dvb-frontends: Add delay to Si2168 restart
	cros_ec: fix nul-termination for firmware build info
	platform/chrome: Use proper protocol transfer function
	mmc: avoid removing non-removable hosts during suspend
	IB/ipoib: Avoid memory leak if the SA returns a different DGID
	RDMA/cma: Use correct size when writing netlink stats
	IB/umem: Fix use of npages/nmap fields
	vgacon: Set VGA struct resource types
	drm/omap: DMM: Check for DMM readiness after successful transaction commit
	pty: cancel pty slave port buf's work in tty_release
	coresight: Fix disabling of CoreSight TPIU
	pinctrl: Really force states during suspend/resume
	iommu/vt-d: clean up pr_irq if request_threaded_irq fails
	ip6_vti: adjust vti mtu according to mtu of lower device
	RDMA/ocrdma: Fix permissions for OCRDMA_RESET_STATS
	nfsd4: permit layoutget of executable-only files
	clk: si5351: Rename internal plls to avoid name collisions
	dmaengine: ti-dma-crossbar: Fix event mapping for TPCC_EVT_MUX_60_63
	RDMA/ucma: Fix access to non-initialized CM_ID object
	Linux 4.4.124

Change-Id: Iac6f5bda7941f032c5b1f58750e084140b0e3f23
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-03-25 10:51:55 +02:00
Benjamin Coddington
5c503ff494 nfsd4: permit layoutget of executable-only files
[ Upstream commit 66282ec1cf004c09083c29cb5e49019037937bbd ]

Clients must be able to read a file in order to execute it, and for pNFS
that means the client needs to be able to perform a LAYOUTGET on the file.

This behavior for executable-only files was added for OPEN in commit
a043226bc1 "nfsd4: permit read opens of executable-only files".

This fixes up xfstests generic/126 on block/scsi layouts.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24 10:58:48 +01:00
Dan Carpenter
8a58463396 cifs: small underflow in cnvrtDosUnixTm()
[ Upstream commit 564277eceeca01e02b1ef3e141cfb939184601b4 ]

January is month 1.  There is no zero-th month.  If someone passes a
zero month then it means we read from one space before the start of the
total_days_of_prev_months[] array.

We may as well also be strict about days as well.

Fixes: 1bd5bbcb65 ("[CIFS] Legacy time handling for Win9x and OS/2 part 1")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24 10:58:46 +01:00
Filipe Manana
266bbc907c Btrfs: send, fix file hole not being preserved due to inline extent
[ Upstream commit e1cbfd7bf6dabdac561c75d08357571f44040a45 ]

Normally we don't have inline extents followed by regular extents, but
there's currently at least one harmless case where this happens. For
example, when the page size is 4Kb and compression is enabled:

  $ mkfs.btrfs -f /dev/sdb
  $ mount -o compress /dev/sdb /mnt
  $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
  $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar

In this case we get a compressed inline extent, representing 4Kb of
data, followed by a hole extent and then a regular data extent. The
inline extent was not expanded/converted to a regular extent exactly
because it represents 4Kb of data. This does not cause any apparent
problem (such as the issue solved by commit e1699d2d7bf6
("btrfs: add missing memset while reading compressed inline extents"))
except trigger an unexpected case in the incremental send code path
that makes us issue an operation to write a hole when it's not needed,
resulting in more writes at the receiver and wasting space at the
receiver.

So teach the incremental send code to deal with this particular case.

The issue can be currently triggered by running fstests btrfs/137 with
compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24 10:58:44 +01:00
NeilBrown
99478468e8 NFS: don't try to cross a mountpount when there isn't one there.
[ Upstream commit 99bbf6ecc694dfe0b026e15359c5aa2a60b97a93 ]

consider the sequence of commands:
 mkdir -p /import/nfs /import/bind /import/etc
 mount --bind / /import/bind
 mount --make-private /import/bind
 mount --bind /import/etc /import/bind/etc

 exportfs -o rw,no_root_squash,crossmnt,async,no_subtree_check localhost:/
 mount -o vers=4 localhost:/ /import/nfs
 ls -l /import/nfs/etc

You would not expect this to report a stale file handle.
Yet it does.

The manipulations under /import/bind cause the dentry for
/etc to get the DCACHE_MOUNTED flag set, even though nothing
is mounted on /etc.  This causes nfsd to call
nfsd_cross_mnt() even though there is no mountpoint.  So an
upcall to mountd for "/etc" is performed.

The 'crossmnt' flag on the export of / causes mountd to
report that /etc is exported as it is a descendant of /.  It
assumes the kernel wouldn't ask about something that wasn't
a mountpoint.  The filehandle returned identifies the
filesystem and the inode number of /etc.

When this filehandle is presented to rpc.mountd, via
"nfsd.fh", the inode cannot be found associated with any
name in /etc/exports, or with any mountpoint listed by
getmntent().  So rpc.mountd says the filehandle doesn't
exist. Hence ESTALE.

This is fixed by teaching nfsd not to trust DCACHE_MOUNTED
too much.  It is just a hint, not a guarantee.
Change nfsd_mountpoint() to return '1' for a certain mountpoint,
'2' for a possible mountpoint, and 0 otherwise.

Then change nfsd_crossmnt() to check if follow_down()
actually found a mountpount and, if not, to avoid performing
a lookup if the location is not known to certainly require
an export-point.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24 10:58:44 +01:00