Commit graph

42638 commits

Author SHA1 Message Date
Jaegeuk Kim
a56c7c6fb3 f2fs: set GFP_NOFS for grab_cache_page
For normal inodes, their pages are allocated with __GFP_FS, which can cause
filesystem calls when reclaiming memory.
This can incur a dead lock condition accordingly.

So, this patch addresses this problem by introducing
f2fs_grab_cache_page(.., bool for_write), which calls
grab_cache_page_write_begin() with AOP_FLAG_NOFS.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12 13:38:03 -07:00
Jaegeuk Kim
6e2c64ad7c f2fs: fix SSA updates resulting in corruption
The f2fs_collapse_range and f2fs_insert_range changes the block addresses
directly. But that can cause uncovered SSA updates.
In that case, we need to give up to change the block addresses and do buffered
writes to keep filesystem consistency.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12 13:38:02 -07:00
Jaegeuk Kim
a125702326 Revert "f2fs: do not skip dentry block writes"
The periodic checkpoint can resolve the previous issue.
So, now we can use this again to improve the reported performance regression:

https://lkml.org/lkml/2015/10/8/20

This reverts commit 15bec0ff5a9ba6d203178fa8772259df6207942a.
2015-10-12 13:38:02 -07:00
Jaegeuk Kim
c912a8298c f2fs: add F2FS_GOING_DOWN_METAFLUSH to test power-failure
This patch introduces F2FS_GOING_DOWN_METAFLUSH which flushes meta pages like
SSA blocks and then blocks all the writes.
This can be used by power-failure tests.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12 13:37:54 -07:00
Tejun Heo
b817525a4a writeback: bdi_writeback iteration must not skip dying ones
bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi->cgwb_tree; however, the
tree only indexes wb's which are currently active.

For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained.  As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.

This patch adds a RCU protected @bdi->wb_list which lists all wb's
beloinging to that bdi.  wb's are added on creation and removed on
release rather than on the start of destruction.  bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com>
Fixes: ebe41ab0c7 ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12 10:31:12 -06:00
Tejun Heo
6fdf860f15 writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()
wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
unfortunately, it was always waking up bdi->wb instead of the wb being
walked.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 001fe6f617 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12 10:31:11 -06:00
Konstantin Khlebnikov
5ffdbe8bf1 ovl: free lower_mnt array in ovl_put_super
This fixes memory leak after umount.

Kmemleak report:

unreferenced object 0xffff8800ba791010 (size 8):
  comm "mount", pid 2394, jiffies 4294996294 (age 53.920s)
  hex dump (first 8 bytes):
    20 1c 13 02 00 88 ff ff                           .......
  backtrace:
    [<ffffffff811f8cd4>] create_object+0x124/0x2c0
    [<ffffffff817a059b>] kmemleak_alloc+0x7b/0xc0
    [<ffffffff811dffe6>] __kmalloc+0x106/0x340
    [<ffffffffa0152bfc>] ovl_fill_super+0x55c/0x9b0 [overlay]
    [<ffffffff81200ac4>] mount_nodev+0x54/0xa0
    [<ffffffffa0152118>] ovl_mount+0x18/0x20 [overlay]
    [<ffffffff81201ab3>] mount_fs+0x43/0x170
    [<ffffffff81220d34>] vfs_kern_mount+0x74/0x170
    [<ffffffff812233ad>] do_mount+0x22d/0xdf0
    [<ffffffff812242cb>] SyS_mount+0x7b/0xc0
    [<ffffffff817b6bee>] entry_SYSCALL_64_fastpath+0x12/0x76
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: dd662667e6 ("ovl: add mutli-layer infrastructure")
Cc: <stable@vger.kernel.org> # v4.0+
2015-10-12 17:11:44 +02:00
Konstantin Khlebnikov
0f95502ad8 ovl: free stack of paths in ovl_fill_super
This fixes small memory leak after mount.

Kmemleak report:

unreferenced object 0xffff88003683fe00 (size 16):
  comm "mount", pid 2029, jiffies 4294909563 (age 33.380s)
  hex dump (first 16 bytes):
    20 27 1f bb 00 88 ff ff 40 4b 0f 36 02 88 ff ff   '......@K.6....
  backtrace:
    [<ffffffff811f8cd4>] create_object+0x124/0x2c0
    [<ffffffff817a059b>] kmemleak_alloc+0x7b/0xc0
    [<ffffffff811dffe6>] __kmalloc+0x106/0x340
    [<ffffffffa01b7a29>] ovl_fill_super+0x389/0x9a0 [overlay]
    [<ffffffff81200ac4>] mount_nodev+0x54/0xa0
    [<ffffffffa01b7118>] ovl_mount+0x18/0x20 [overlay]
    [<ffffffff81201ab3>] mount_fs+0x43/0x170
    [<ffffffff81220d34>] vfs_kern_mount+0x74/0x170
    [<ffffffff812233ad>] do_mount+0x22d/0xdf0
    [<ffffffff812242cb>] SyS_mount+0x7b/0xc0
    [<ffffffff817b6bee>] entry_SYSCALL_64_fastpath+0x12/0x76
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: a78d9f0d5d ("ovl: support multiple lower layers")
Cc: <stable@vger.kernel.org> # v4.0+
2015-10-12 17:11:43 +02:00
Miklos Szeredi
1c8a47df36 ovl: fix open in stacked overlay
If two overlayfs filesystems are stacked on top of each other, then we need
recursion in ovl_d_select_inode().

I guess d_backing_inode() is supposed to do that.  But currently it doesn't
and that functionality is open coded in vfs_open().  This is now copied
into ovl_d_select_inode() to fix this regression.

Reported-by: Alban Crequy <alban.crequy@gmail.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay...")
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
2015-10-12 15:56:20 +02:00
David Howells
ab79efab0a ovl: fix dentry reference leak
In ovl_copy_up_locked(), newdentry is leaked if the function exits through
out_cleanup as this just to out after calling ovl_cleanup() - which doesn't
actually release the ref on newdentry.

The out_cleanup segment should instead exit through out2 as certainly
newdentry leaks - and possibly upper does also, though this isn't caught
given the catch of newdentry.

Without this fix, something like the following is seen:

	BUG: Dentry ffff880023e9eb20{i=f861,n=#ffff880023e82d90} still in use (1) [unmount of tmpfs tmpfs]
	BUG: Dentry ffff880023ece640{i=0,n=bigfile}  still in use (1) [unmount of tmpfs tmpfs]

when unmounting the upper layer after an error occurred in copyup.

An error can be induced by creating a big file in a lower layer with
something like:

	dd if=/dev/zero of=/lower/a/bigfile bs=65536 count=1 seek=$((0xf000))

to create a large file (4.1G).  Overlay an upper layer that is too small
(on tmpfs might do) and then induce a copy up by opening it writably.

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # v3.18+
2015-10-12 15:56:20 +02:00
David Howells
0480334fa6 ovl: use O_LARGEFILE in ovl_copy_up()
Open the lower file with O_LARGEFILE in ovl_copy_up().

Pass O_LARGEFILE unconditionally in ovl_copy_up_data() as it's purely for
catching 32-bit userspace dealing with a file large enough that it'll be
mishandled if the application isn't aware that there might be an integer
overflow.  Inside the kernel, there shouldn't be any problems.

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # v3.18+
2015-10-12 15:56:20 +02:00
Dave Chinner
1e2103cbf4 Merge branch 'xfs-misc-fixes-for-4.4-1' into for-next 2015-10-12 18:38:25 +11:00
Dave Chinner
8a56d7c305 Merge branch 'xfs-io-fixes' into for-next 2015-10-12 18:38:11 +11:00
Dave Chinner
316433beda Merge branch 'xfs-logging-fixes' into for-next 2015-10-12 18:37:58 +11:00
Eric Sandeen
9e92054e8e xfs: simplify /proc teardown & error handling
remove_proc_subtree() was added in 3.9, and can be
used to simplify our procfile creation error handling
and cleanup, removing the nested gotos.  It simply
removes fs/xfs and everything created under it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 18:21:22 +11:00
Bill O'Donnell
ff6d6af235 xfs: per-filesystem stats counter implementation
This patch modifies the stats counting macros and the callers
to those macros to properly increment, decrement, and add-to
the xfs stats counts. The counts for global and per-fs stats
are correctly advanced, and cleared by writing a "1" to the
corresponding clear file.

global counts: /sys/fs/xfs/stats/stats
per-fs counts: /sys/fs/xfs/sda*/stats/stats

global clear:  /sys/fs/xfs/stats/stats_clear
per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear

[dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
 stats structures and macros. ]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 18:21:22 +11:00
Bill O'Donnell
225e463558 xfs: per-filesystem stats in sysfs
This patch implements per-filesystem stats objects in sysfs. It
depends on the application of the previous patch series that
develops the infrastructure to support both xfs global stats and
xfs per-fs stats in sysfs.

Stats objects are instantiated when an xfs filesystem is mounted
and deleted on unmount. With this patch, the stats directory is
created and populated with the familiar stats and stats_clear files.
Example:
        /sys/fs/xfs/sda9/stats/stats
        /sys/fs/xfs/sda9/stats/stats_clear

With this patch, the individual counts within the new per-fs
stats file(s) remain at zero. Functions that use the the macros
to increment, decrement, and add-to the per-fs stats counts will
be covered in a separate new patch to follow this one. Note that
the counts within the global stats file (/sys/fs/xfs/stats/stats)
advance normally and can be cleared as it was prior to this patch.

[dchinner: move setup/teardown to xfs_fs_{fill|put}_super() so
it is down before/after any path that uses the per-mount stats. ]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 18:21:19 +11:00
Eric Sandeen
847f9f6875 xfs: more info from kmem deadlocks and high-level error msgs
In an effort to get more useful out of "possible memory
allocation deadlock" messages, print the size of the
requested allocation, and dump the stack if the xfs error
level is tuned high.

The stack dump is implemented in define_xfs_printk_level()
for error levels >= LOGLEVEL_ERR, partly because it
seems generically useful, and also because kmem.c has
no knowledge of xfs error level tunables or other such bits,
it's very kmem-specific.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 16:04:45 +11:00
Eric Sandeen
91f9f5fe1e xfs: avoid null *src in memcpy call in xlog_write
The gcc undefined behavior sanitizer caught this; surely
any sane memcpy implementation will no-op if size == 0,
but behavior with a *src of NULL is technically undefined
(declared nonnull), so avoid it here.

We are actually in this situation frequently via
xlog_commit_record(), because:

        struct xfs_log_iovec reg = {
                .i_addr = NULL,
                .i_len = 0,
                .i_type = XLOG_REG_TYPE_COMMIT,
        };

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 16:04:15 +11:00
Brian Foster
dbd5c8c9a2 xfs: pass total block res. as total xfs_bmapi_write() parameter
The total field from struct xfs_alloc_arg is a bit of an unknown
commodity. It is documented as the total block requirement for the
transaction and is used in this manner from most call sites by virtue of
passing the total block reservation of the transaction associated with
an allocation. Several xfs_bmapi_write() callers pass hardcoded values
of 0 or 1 for the total block requirement, which is a historical oddity
without any clear reasoning.

The xfs_iomap_write_direct() caller, for example, passes 0 for the total
block requirement. This has been determined to cause problems in the
form of ABBA deadlocks of AGF buffers due to incorrect AG selection in
the block allocator. Specifically, the xfs_alloc_space_available()
function incorrectly selects an AG that doesn't actually have sufficient
space for the allocation. This occurs because the args.total field is 0
and thus the remaining free space check on the AG doesn't actually
consider the size of the allocation request. This locks the AGF buffer,
the allocation attempt proceeds and ultimately fails (in
xfs_alloc_fix_minleft()), and xfs_alloc_vexent() moves on to the next
AG. In turn, this can lead to incorrect AG locking order (if the
allocator wraps around, attempting to lock AG 0 after acquiring AG N)
and thus deadlock if racing with another operation. This problem has
been reproduced via generic/299 on smallish (1GB) ramdisk test devices.

To avoid this problem, replace the undocumented hardcoded total
parameters from the iomap and utility callers to pass the block
reservation used for the associated transaction. This is consistent with
other xfs_bmapi_write() callers throughout XFS. The assumption is that
the total field allows the selection of an AG that can handle the entire
operation rather than simply the allocation/range being requested (e.g.,
resulting btree splits, etc.). This addresses the aforementioned
generic/299 hang by ensuring AG selection only occurs when the
allocation can be satisfied by the AG.

Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 16:04:13 +11:00
Jan Tulak
51fcbfe709 xfs: avoid dependency on Linux XATTR_SIZE_MAX
Currently, we depends on Linux XATTR value for on disk
definition. Which causes trouble on other platforms and
maybe also if this value was to change.

Fix it by creating a custom definition independent from
those in Linux (although with the same values), so it is OK
with the be16 fields used for holding these attributes.

This patch reflects a change in xfsprogs.

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 16:03:59 +11:00
Jan Tulak
4e247614a9 xfs: prefix XATTR_LIST_MAX with XFS_
Remove a hard dependency of Linux XATTR_LIST_MAX value by using
a prefixed version. This patch reflects the same change in xfsprogs.

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 16:02:56 +11:00
Geliang Tang
fef4ded8cb libxfs: fix two comment typos
Just fix two typos in code comments.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 16:02:32 +11:00
Brian Foster
0a50f162af xfs: add an xfs_zero_eof() tracepoint
Add a tracepoint in xfs_zero_eof() to facilitate tracking and debugging
EOF zeroing events. This has proven useful in the context of other
direct I/O tracepoints to ensure EOF zeroing occurs within appropriate
file ranges.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 16:02:08 +11:00
Brian Foster
3136e8bb30 xfs: always drain dio before extending aio write submission
XFS supports and typically allows concurrent asynchronous direct I/O
submission to a single file. One exception to the rule is that file
extending dio writes that start beyond the current EOF (e.g.,
potentially create a hole at EOF) require exclusive I/O access to the
file. This is because such writes must zero any pre-existing blocks
beyond EOF that are exposed by virtue of now residing within EOF as a
result of the write about to be submitted.

Before EOF zeroing can occur, the current file i_size must be stabilized
to avoid data corruption. In this scenario, XFS upgrades the iolock to
exclude any further I/O submission, waits on in-flight I/O to complete
to ensure i_size is up to date (i_size is updated on dio write
completion) and restarts the various checks against the state of the
file. The problem is that this protection sequence is triggered only
when the iolock is currently held shared. While this is true for async
dio in most cases, the caller may upgrade the lock in advance based on
arbitrary circumstances with respect to EOF zeroing. For example, the
iolock is always acquired exclusively if the start offset is not block
aligned. This means that even though the iolock is already held
exclusive for such I/Os, pending I/O is not drained and thus EOF zeroing
can occur based on an unstable i_size.

This problem has been reproduced as guest data corruption in virtual
machines with file-backed qcow2 virtual disks hosted on an XFS
filesystem. The virtual disks must be configured with aio=native mode
and the must not be truncated out to the maximum file size (as some virt
managers will do).

Update xfs_file_aio_write_checks() to unconditionally drain in-flight
dio before EOF zeroing can occur. Rather than trigger the wait based on
iolock state, use a new flag and upgrade the iolock when necessary. Note
that this results in a full restart of the inode checks even when the
iolock was already held exclusive when technically it is only required
to recheck i_size. This should be a rare enough occurrence that it is
preferable to keep the code simple rather than create an alternate
restart jump target.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 16:02:05 +11:00
Brian Foster
a45086e27d xfs: validate metadata LSNs against log on v5 superblocks
Since the onset of v5 superblocks, the LSN of the last modification has
been included in a variety of on-disk data structures. This LSN is used
to provide log recovery ordering guarantees (e.g., to ensure an older
log recovery item is not replayed over a newer target data structure).

While this works correctly from the point a filesystem is formatted and
mounted, userspace tools have some problematic behaviors that defeat
this mechanism. For example, xfs_repair historically zeroes out the log
unconditionally (regardless of whether corruption is detected). If this
occurs, the LSN of the filesystem is reset and the log is now in a
problematic state with respect to on-disk metadata structures that might
have a larger LSN. Until either the log catches up to the highest
previously used metadata LSN or each affected data structure is modified
and written out without incident (which resets the metadata LSN), log
recovery is susceptible to filesystem corruption.

This problem is ultimately addressed and repaired in the associated
userspace tools. The kernel is still responsible to detect the problem
and notify the user that something is wrong. Check the superblock LSN at
mount time and fail the mount if it is invalid. From that point on,
trigger verifier failure on any metadata I/O where an invalid LSN is
detected. This results in a filesystem shutdown and guarantees that we
do not log metadata changes with invalid LSNs on disk. Since this is a
known issue with a known recovery path, present a warning to instruct
the user how to recover.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 15:59:25 +11:00
Tetsuo Handa
5bf97b1cb4 xfs: Print name and pid when memory allocation loops
This patch adds comm name and pid to warning messages printed by
kmem_alloc(), kmem_zone_alloc() and xfs_buf_allocate_memory().
This will help telling which memory allocations (e.g. kernel worker
threads, OOM victim tasks, neither) are stalling because these functions
are passing __GFP_NOWARN which suppresses not only backtrace but comm name
and pid.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 15:41:29 +11:00
Brian Foster
b7cdc66be5 xfs: log local to remote symlink conversions correctly on v5 supers
A local format symlink inode is converted to extent format when an
extended attribute is set on an inode as part of the attribute fork
creation. This means a block is allocated, the local symlink target name
is copied to the block and the block is logged. Currently,
xfs_bmap_local_to_extents() handles logging the remote block data based
on the size of the data fork prior to the conversion. This is not
correct on v5 superblock filesystems, which add an additional header to
remote symlink blocks that is nonexistent in local format inodes.

As a result, the full length of the remote symlink block content is not
logged. This can lead to corruption should a crash occur and log
recovery replay this transaction.

Since a callout is already used to initialize the new remote symlink
block, update the local-to-extents conversion mechanism to make the
callout also responsible for logging the block. It is already required
to set the log buffer type and format the block appropriately based on
the superblock version. This ensures the remote symlink is always logged
correctly. Note that xfs_bmap_local_to_extents() is only called for
symlinks so there are no other callouts that require modification.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 15:40:24 +11:00
Brian Foster
009c6e871e xfs: add missing ilock around dio write last extent alignment
The iomap codepath (via get_blocks()) acquires and release the inode
lock in the case of a direct write that requires block allocation. This
is because xfs_iomap_write_direct() allocates a transaction, which means
the ilock must be dropped and reacquired after the transaction is
allocated and reserved.

xfs_iomap_write_direct() invokes xfs_iomap_eof_align_last_fsb() before
the transaction is created and thus before the ilock is reacquired. This
can lead to calls to xfs_iread_extents() and reads of the in-core extent
list without any synchronization (via xfs_bmap_eof() and
xfs_bmap_last_extent()). xfs_iread_extents() assert fails if the ilock
is not held, but this is not currently seen in practice as the current
callers had already invoked xfs_bmapi_read().

What has been seen in practice are reports of crashes down in the
xfs_bmap_eof() codepath on direct writes due to seemingly bogus pointer
references from xfs_iext_get_ext(). While an explicit reproducer is not
currently available to confirm the cause of the problem, crash analysis
and code inspection from David Jeffrey had identified the insufficient
locking.

xfs_iomap_eof_align_last_fsb() is called from other contexts with the
inode lock already held, so we cannot acquire it therein.
__xfs_get_blocks() acquires and drops the ilock with variable flags to
cover the event that the extent list must be read in. The common case is
that __xfs_get_blocks() acquires the shared ilock. To provide locking
around the last extent alignment call without adding more lock cycles to
the dio path, update xfs_iomap_write_direct() to expect the shared ilock
held on entry and do the extent alignment under its protection. Demote
the lock, if necessary, from __xfs_get_blocks() and push the
xfs_qm_dqattach() call outside of the shared lock critical section.
Also, add an assert to document that the extent list is always expected
to be present in this path. Otherwise, we risk a call to
xfs_iread_extents() while under the shared ilock. This is safe as all
current callers have executed an xfs_bmapi_read() call under the current
iolock context.

Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 15:34:20 +11:00
Zhaohongjiang
5cb13dcd0f cancel the setfilesize transation when io error happen
When I ran xfstest/073 case, the remount process was blocked to wait
transactions to be zero. I found there was a io error happened, and
the setfilesize transaction was not released properly. We should add
the changes to cancel the io error in this case.

Reproduction steps:
1. dd if=/dev/zero of=xfs1.img bs=1M count=2048
2. mkfs.xfs xfs1.img
3. losetup -f ./xfs1.img /dev/loop0
4. mount -t xfs /dev/loop0 /home/test_dir/
5. mkdir /home/test_dir/test
6. mkfs.xfs -dfile,name=image,size=2g
7. mount -t xfs -o loop image /home/test_dir/test
8. cp a file bigger than 2g to /home/test_dir/test
9. mount -t xfs -o remount,ro /home/test_dir/test

[ dchinner: moved io error detection to xfs_setfilesize_ioend() after
  transaction context restoration. ]

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 15:28:39 +11:00
Bill O'Donnell
80529c45ab xfs: pass xfsstats structures to handlers and macros
This patch is the next step toward per-fs xfs stats. The patch makes
the show and clear routines able to handle any stats structure
associated with a kobject.

Instead of a single global xfsstats structure, add kobject and a pointer
to a per-cpu struct xfsstats. Modify the macros that manipulate the stats
accordingly: XFS_STATS_INC, XFS_STATS_DEC, and XFS_STATS_ADD now access
xfsstats->xs_stats.

The sysfs functions need to get from the kobject back to the xfsstats
structure which contains it, and pass the pointer to the ->xs_stats
percpu structure into the show & clear routines.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 05:19:45 +11:00
Bill O'Donnell
a27c264009 xfs: consolidate sysfs ops
As a part of the series to move xfs global stats from procfs to sysfs,
this patch consolidates the sysfs ops functions and removes redundancy.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 05:18:45 +11:00
Bill O'Donnell
50cf5b7402 xfs: remove unused procfs code
As a part of the work to move xfs global stats from procfs to sysfs,
this patch removes the now unused procfs code that was xfs stat specific.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 05:17:45 +11:00
Bill O'Donnell
32f0ea0521 xfs: create symlink proc/fs/xfs/stat to sys/fs/xfs/stats
As a part of the work to move xfs global stats from procfs to sysfs,
this patch creates the symlink from proc/fs/xfs/stat to sys/fs/xfs/stats.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 05:16:45 +11:00
Bill O'Donnell
bb230c1247 xfs: create global stats and stats_clear in sysfs
Currently, xfs global stats are in procfs. This patch introduces
(replicates) the global stats in sysfs. Additionally a stats_clear file
is introduced in sysfs.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 05:15:45 +11:00
Trond Myklebust
daf3761c9f namei: results of d_is_negative() should be checked after dentry revalidation
Leandro Awa writes:
 "After switching to version 4.1.6, our parallelized and distributed
  workflows now fail consistently with errors of the form:

  T34: ./regex.c:39:22: error: config.h: No such file or directory

  From our 'git bisect' testing, the following commit appears to be the
  possible cause of the behavior we've been seeing: commit 766c4cbfacd8"

Al Viro says:
 "What happens is that 766c4cbfac got the things subtly wrong.

  We used to treat d_is_negative() after lookup_fast() as "fall with
  ENOENT".  That was wrong - checking ->d_flags outside of ->d_seq
  protection is unreliable and failing with hard error on what should've
  fallen back to non-RCU pathname resolution is a bug.

  Unfortunately, we'd pulled the test too far up and ran afoul of
  another kind of staleness.  The dentry might have been absolutely
  stable from the RCU point of view (and we might be on UP, etc), but
  stale from the remote fs point of view.  If ->d_revalidate() returns
  "it's actually stale", dentry gets thrown away and the original code
  wouldn't even have looked at its ->d_flags.

  What we need is to check ->d_flags where 766c4cbfac does (prior to
  ->d_seq validation) but only use the result in cases where we do not
  discard this dentry outright"

Reported-by: Leandro Awa <lawa@nvidia.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911
Fixes: 766c4cbfac ("namei: d_is_negative() should be checked...")
Tested-by: Leandro Awa <lawa@nvidia.com>
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-10 10:17:27 -07:00
David Sterba
ee86395458 btrfs: comment the rest of implicit barriers before waitqueue_active
There are atomic operations that imply the barrier for waitqueue_active
mixed in an if-condition.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:42:00 +02:00
David Sterba
779adf0f64 btrfs: remove extra barrier before waitqueue_active
Removing barriers is scary, but a call to atomic_dec_and_test implies
a barrier, so we don't need to issue another one.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:40:33 +02:00
David Sterba
a83342aa0c btrfs: add comments to barriers before waitqueue_active
Reduce number of undocumented barriers out there.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:40:04 +02:00
David Sterba
33a9eca7e4 btrfs: comment waitqueue_active implied by locks
Suggested-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:35:10 +02:00
David Sterba
b666a9cd99 btrfs: add barrier for waitqueue_active in clear_btree_io_tree
waitqueue_active should be preceded by a barrier, in this function we
don't need to call it all the time.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:24:48 +02:00
David Sterba
730d9ec36b btrfs: remove waitqueue_active check from btrfs_rm_dev_replace_unblocked
Normally the waitqueue_active would need a barrier, but this is not
necessary here because it's not a performance sensitive context and we
can call wake_up directly.

Suggested-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:16:38 +02:00
Linus Torvalds
175d58cfed Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are small and assorted.  Neil's is the oldest, I dropped the
  ball thinking he was going to send it in"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: support NFSv2 export
  Btrfs: open_ctree: Fix possible memory leak
  Btrfs: fix deadlock when finalizing block group creation
  Btrfs: update fix for read corruption of compressed and shared extents
  Btrfs: send, fix corner case for reference overwrite detection
2015-10-09 16:39:35 -07:00
Jaegeuk Kim
6066d8cdb6 f2fs: merge meta writes as many possible
This patch tries to merge IOs as many as possible when background flusher
conducts flushing the dirty meta pages.

[Before]

...
2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124320, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124560, size = 32768
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 95720, size = 987136
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123928, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123944, size = 8192
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123968, size = 45056
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124064, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 97648, size = 1007616
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123776, size = 8192
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123800, size = 32768
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124624, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 99616, size = 921600
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123608, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123624, size = 77824
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123792, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123864, size = 32768
...

[After]

...
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 92168, size = 892928
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 93912, size = 753664
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 95384, size = 716800
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 96784, size = 712704
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 104160, size = 364544
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 104872, size = 356352
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 105568, size = 278528
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 106112, size = 319488
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 106736, size = 258048
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 107240, size = 270336
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 107768, size = 180224
...

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:57 -07:00
Jaegeuk Kim
60b99b486b f2fs: introduce a periodic checkpoint flow
This patch introduces a periodic checkpoint feature.
Note that, this is not enforcing to conduct checkpoints very strictly in terms
of trigger timing, instead just hope to help user experiences.
The default value is 60 seconds.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:57 -07:00
Jaegeuk Kim
5c26743474 f2fs: add a tracepoint for background gc
This patch introduces a tracepoint to monitor background gc behaviors.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:57 -07:00
Jaegeuk Kim
6aefd93b01 f2fs: introduce background_gc=sync mount option
This patch introduce background_gc=sync enabling synchronous cleaning in
background.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:57 -07:00
Chao Yu
456b88e4d1 f2fs: introduce a new ioctl F2FS_IOC_WRITE_CHECKPOINT
This patch introduce a new ioctl for those users who want to trigger
checkpoint from userspace through ioctl.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:56 -07:00
Chao Yu
d530d4d8e2 f2fs: support synchronous gc in ioctl
This patch drops in batches gc triggered through ioctl, since user
can easily control the gc by designing the loop around the ->ioctl.

We support synchronous gc by forcing using FG_GC in f2fs_gc, so with
it, user can make sure that in this round all blocks gced were
persistent in the device until ioctl returned.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:56 -07:00
Chao Yu
3342bb303b f2fs: skip searching dirty map if dirty segment is not exist
When searching victim during gc, if there are no dirty segments in
filesystem, we will still take the time to search the whole dirty segment
map, it's not needed, it's better to skip in this condition.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:56 -07:00