We don't need req in either of those. We don't need nr_segs in caller.
We don't really need len in caller either - iov_iter_count(&iter) will do.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
the only non-trivial detail is that we do it before rw_verify_area(),
so we'd better cap the length ourselves in aio_setup_single_rw()
case (for vectored case rw_copy_check_uvector() will do that for us).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We have observed a BUG() crash in fs/attr.c:notify_change(). The crash
occurs during an rsync into a filesystem that is exported via NFS.
1.) fs/attr.c:notify_change() modifies the caller's version of attr.
2.) 6de0ec00ba ("VFS: make notify_change pass ATTR_KILL_S*ID to
setattr operations") introduced a BUG() restriction such that "no
function will ever call notify_change() with both ATTR_MODE and
ATTR_KILL_S*ID set". Under some circumstances though, it will have
assisted in setting the caller's version of attr to this very
combination.
3.) 27ac0ffeac ("locks: break delegations on any attribute
modification") introduced code to handle breaking
delegations. This can result in notify_change() being re-called. attr
_must_ be explicitly reset to avoid triggering the BUG() established
in #2.
4.) The path that that triggers this is via fs/open.c:chmod_common().
The combination of attr flags set here and in the first call to
notify_change() along with a later failed break_deleg_wait()
results in notify_change() being called again via retry_deleg
without resetting attr.
Solution is to move retry_deleg in chmod_common() a bit further up to
ensure attr is completely reset.
There are other places where this seemingly could occur, such as
fs/utimes.c:utimes_common(), but the attr flags are not initially
set in such a way to trigger this.
Fixes: 27ac0ffeac ("locks: break delegations on any attribute modification")
Reported-by: Eric Meddaugh <etmsys@rit.edu>
Tested-by: Eric Meddaugh <etmsys@rit.edu>
Signed-off-by: Andrew Elble <aweits@rit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On a distributed filesystem it's possible for lookup to discover that a
directory it just found is already cached elsewhere in the directory
heirarchy. The dcache won't let us keep the directory in both places,
so we have to move the dentry to the new location from the place we
previously had it cached.
If the parent has changed, then this requires all the same locks as we'd
need to do a cross-directory rename. But we're already in lookup
holding one parent's i_mutex, so it's too late to acquire those locks in
the right order.
The (unreliable) solution in __d_unalias is to trylock() the required
locks and return -EBUSY if it fails.
I see no particular reason for returning -EBUSY, and -ESTALE is already
the result of some other lookup races on NFS. I think -ESTALE is the
more helpful error return. It also allows us to take advantage of the
logic Jeff Layton added in c6a9428401 "vfs: fix renameat to retry on
ESTALE errors" and ancestors, which hopefully resolves some of these
errors before they're returned to userspace.
I can reproduce these cases using NFS with:
ssh root@$client '
mount -olookupcache=pos '$server':'$export' /mnt/
mkdir /mnt/TO
mkdir /mnt/DIR
touch /mnt/DIR/test.txt
while true; do
strace -e open cat /mnt/DIR/test.txt 2>&1 | grep EBUSY
done
'
ssh root@$server '
while true; do
mv $export/DIR $export/TO/DIR
mv $export/TO/DIR $export/DIR
done
'
It also helps to add some other concurrent use of the directory on the
client (e.g., "ls /mnt/TO"). And you can replace the server-side mv's
by client-side mv's that are repeatedly killed. (If the client is
interrupted while waiting for the RENAME response then it's left with a
dentry that has to go under one parent or the other, but it doesn't yet
know which.)
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For one thing, LOOKUP_DIRECTORY will be dealt with in do_last().
For another, name can be an empty string, but not NULL - no callers
pass that and it would oops immediately if they would.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
just make const char iname[] the last member and compare name->name with
name->iname instead of checking name->separate
We need to make sure that out-of-line name doesn't end up allocated adjacent
to struct filename refering to it; fortunately, it's easy to achieve - just
allocate that struct filename with one byte in ->iname[], so that ->iname[0]
will be inside the same object and thus have an address different from that
of out-of-line name [spotted by Boqun Feng <boqun.feng@gmail.com>]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
During the roll-forward recovery, if we found a new data index written fsync
lastly, we need to recover new block address.
But, if that address was corrupted, we should not recover that.
Otherwise, f2fs gets kernel panic from:
In check_index_in_prev_nodes(),
sentry = get_seg_entry(sbi, segno);
--------------------------> out-of-range segno.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there are multiple fsynced dnodes having a dent flag, roll-forward routine
sets FI_INC_LINK for their inode, and recovery_dentry increases its link count
accordingly.
That results in normal file having a link count as 2, so we can't unlink those
files.
This was added to handle several inode blocks having same inode number with
different directory paths.
But, current f2fs doesn't replay all of path changes and only recover its dentry
for the last fsynced inode block.
So, there is no reason to do this.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If f2fs was corrupted with missing dot dentries, it needs to recover them after
fsck.f2fs detection.
The underlying precedure is:
1. The fsck.f2fs remains F2FS_INLINE_DOTS flag in directory inode, if it detects
missing dot dentries.
2. When f2fs looks up the corrupted directory, it triggers f2fs_add_link with
proper inode numbers and their dot and dotdot names.
3. Once f2fs recovers the directory without errors, it removes F2FS_INLINE_DOTS
finally.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the below warning.
sparse warnings: (new ones prefixed by >>)
>> fs/f2fs/inode.c:56:23: sparse: restricted __le32 degrades to integer
>> fs/f2fs/inode.c:56:52: sparse: restricted __le32 degrades to integer
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Map bh over max size which caller defined is not needed, limit it in
f2fs_map_bh.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to dirty inode for persisting i_advise of f2fs inode info into
on-disk inode if user sets system.advise through setxattr. Otherwise the new
value will be lost.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Normally, due to DIO_SKIP_HOLES flag is set by default, blockdev_direct_IO in
f2fs_direct_IO tries to skip DIO in holes when writing inside i_size, this
makes us falling back to buffered IO which shows lower performance.
So in commit 59b802e5a4 ("f2fs: allocate data blocks in advance for
f2fs_direct_IO"), we improve perfromance by allocating data blocks in advance
if we meet holes no matter in i_size or not, since with it we can avoid falling
back to buffered IO.
But we forget to consider for unwritten fallocated block in this commit.
This patch tries to fix it for fallocate case, this helps to improve
performance.
Test result:
Storage info: sandisk ultra 64G micro sd card.
touch /mnt/f2fs/file
truncate -s 67108864 /mnt/f2fs/file
fallocate -o 0 -l 67108864 /mnt/f2fs/file
time dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=64 conv=notrunc oflag=direct
Time before applying the patch:
67108864 bytes (67 MB) copied, 36.16 s, 1.9 MB/s
real 0m36.162s
user 0m0.000s
sys 0m0.180s
Time after applying the patch:
67108864 bytes (67 MB) copied, 27.7776 s, 2.4 MB/s
real 0m27.780s
user 0m0.000s
sys 0m0.036s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Enable inline_data feature by default since it brings us better
performance and space utilization and now has already stable.
Add another option noinline_data to disable it during mount.
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Suggested-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to preserve last extent info in extent tree cache into on-disk
inode, so this can help us to reuse the last extent info next time for
performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With normal extent info cache, we records largest extent mapping between logical
block and physical block into extent info, and we persist extent info in on-disk
inode.
When we enable extent tree cache, if extent info of on-disk inode is exist, and
the extent is not a small fragmented mapping extent. We'd better to load the
extent info into extent tree cache when inode is loaded. By this way we can have
more chance to hit extent tree cache rather than taking more time to read dnode
page for block address.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces __{find,grab}_extent_tree for reusing by following
patches.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Split __set_data_blkaddr from f2fs_update_extent_cache for readability.
Additionally rename __set_data_blkaddr to set_data_blkaddr for exporting.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fast symlink can utilize inline data flow to avoid using any
i_addr region, since we need to handle many cases such as
truncation, roll-forward recovery, and fsck/dump tools.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch is to avoid some punch_hole overhead when releasing volatile data.
If volatile data was not written yet, we just can make the first page as zero.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes wrong f2fs_bug_on in truncate_inline_inode.
When there is no space, it can happen a corner case where i_isze is over
MAX_INLINE_SIZE while its inode is still inline_data.
The scenario is
1. write small data into file #A.
2. fill the whole partition to 100%.
3. truncate 4096 on file #A.
4. write data at 8192 offset.
--> f2fs_write_begin
-> -ENOSPC = f2fs_convert_inline_page
-> f2fs_write_failed
-> truncate_blocks
-> truncate_inline_inode
BUG_ON, since i_size is 4096.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, f2fs_write_data_pages has a mutex, sbi->writepages, to serialize
data writes to maximize write bandwidth, while sacrificing multi-threads
performance.
Practically, however, multi-threads environment is much more important for
users. So this patch tries to remove the mutex.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch modifies to call set_buffer_new, if new blocks are allocated.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to set SBI_NEED_FSCK flag into sbi only when we fail to recover
in fill_super, so we could skip fscking image when we fail to fill super for
other reason.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the following call stack, f2fs changes the bitmap for dirty segments and # of
dirty sentries without grabbing sit_i->sentry_lock.
This can result in mismatch on bitmap and # of dirty sentries, since if there
are some direct_io operations.
In allocate_data_block,
- __allocate_new_segments
- mutex_lock(&curseg->curseg_mutex);
- s_ops->allocate_segment
- new_curseg/change_curseg
- reset_curseg
- __set_sit_entry_type
- __mark_sit_entry_dirty
- set_bit(dirty_sentries_bitmap)
- dirty_sentries++;
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In __allocate_data_blocks, we should check current blkaddr which is located at
ofs_in_node of dnode page instead of checking first blkaddr all the time.
Otherwise we can only allocate one blkaddr in each dnode page. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously if inode is with inline data, we will try to invalid partial inline
data in page #0 when we truncate size of inode in truncate_partial_data_page().
And then we set page #0 to dirty, after this we can synchronize inode page with
page #0 at ->writepage().
But sometimes we will fail to operate page #0 in truncate_partial_data_page()
due to below reason:
a) if offset is zero, we will skip setting page #0 to dirty.
b) if page #0 is not uptodate, we will fail to update it as it has no mapping
data.
So with following operations, we will meet recent data which should be
truncated.
1.write inline data to file
2.sync first data page to inode page
3.truncate file size to 0
4.truncate file size to max_inline_size
5.echo 1 > /proc/sys/vm/drop_caches
6.read file --> meet original inline data which is remained in inode page.
This patch renames truncate_inline_data() to truncate_inline_inode() for code
readability, then use truncate_inline_inode() to truncate inline data in inode
page in truncate_blocks() and truncate page #0 in truncate_partial_data_page()
for fixing.
v2:
o truncate partially #0 page in truncate_partial_data_page to avoid keeping
old data in #0 page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Our f2fs_acl_create is copied and modified from posix_acl_create to avoid
deadlock bug when inline_dentry feature is enabled.
Now, we got reference leaks in posix_acl_create, and this has been fixed in
commit fed0b588be ("posix_acl: fix reference leaks in posix_acl_create")
by Omar Sandoval.
https://lkml.org/lkml/2015/2/9/5
Let's fix this issue in f2fs_acl_create too.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Changman Lee <cm224.lee@ssamsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>