Commit graph

38263 commits

Author SHA1 Message Date
Jaegeuk Kim
c52e1b10b1 f2fs: remove redundant operation during roll-forward recovery
If same data is updated multiple times, we don't need to redo whole the
operations.
Let's just update the lastest one.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:17 -07:00
Jaegeuk Kim
19c9c466e5 f2fs: do not skip latest inode information
In f2fs_sync_file, if there is no written appended writes, it skips
to write its node blocks.
But, if there is up-to-date inode page, we should write it to update
its metadata during the roll-forward recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:16 -07:00
Jaegeuk Kim
441ac5cb32 f2fs: fix roll-forward missing scenarios
We can summarize the roll forward recovery scenarios as follows.

[Term] F: fsync_mark, D: dentry_mark

1. inode(x) | CP | inode(x) | dnode(F)
-> Update the latest inode(x).

2. inode(x) | CP | inode(F) | dnode(F)
-> No problem.

3. inode(x) | CP | dnode(F) | inode(x)
-> Recover to the latest dnode(F), and drop the last inode(x)

4. inode(x) | CP | dnode(F) | inode(F)
-> No problem.

5. CP | inode(x) | dnode(F)
-> The inode(DF) was missing. Should drop this dnode(F).

6. CP | inode(DF) | dnode(F)
-> No problem.

7. CP | dnode(F) | inode(DF)
-> If f2fs_iget fails, then goto next to find inode(DF).

8. CP | dnode(F) | inode(x)
-> If f2fs_iget fails, then goto next to find inode(DF).
   But it will fail due to no inode(DF).

So, this patch adds some missing points such as #1, #5, #7, and #8.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:16 -07:00
Jaegeuk Kim
88bd02c947 f2fs: fix conditions to remain recovery information in f2fs_sync_file
This patch revisited whole the recovery information during the f2fs_sync_file.

In this patch, there are three information to make a decision.

a) IS_CHECKPOINTED,	/* is it checkpointed before? */
b) HAS_FSYNCED_INODE,	/* is the inode fsynced before? */
c) HAS_LAST_FSYNC,	/* has the latest node fsync mark? */

And, the scenarios for our rule are based on:

[Term] F: fsync_mark, D: dentry_mark

1. inode(x) | CP | inode(x) | dnode(F)
2. inode(x) | CP | inode(F) | dnode(F)
3. inode(x) | CP | dnode(F) | inode(x) | inode(F)
4. inode(x) | CP | dnode(F) | inode(F)
5. CP | inode(x) | dnode(F) | inode(DF)
6. CP | inode(DF) | dnode(F)
7. CP | dnode(F) | inode(DF)
8. CP | dnode(F) | inode(x) | inode(DF)

For example, #3, the three conditions should be changed as follows.

   inode(x) | CP | dnode(F) | inode(x) | inode(F)
a)    x       o      o          o          o
b)    x       x      x          x          o
c)    x       o      o          x          o

If f2fs_sync_file stops   ------^,
 it should write inode(F)    --------------^

So, the need_inode_block_update should return true, since
 c) get_nat_flag(e, HAS_LAST_FSYNC), is false.

For example, #8,
      CP | alloc | dnode(F) | inode(x) | inode(DF)
a)    o      x        x          x          x
b)    x               x          x          o
c)    o               o          x          o

If f2fs_sync_file stops   -------^,
 it should write inode(DF)    --------------^

Note that, the roll-forward policy should follow this rule, which means,
if there are any missing blocks, we doesn't need to recover that inode.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:15 -07:00
Jaegeuk Kim
7ef35e3b9e f2fs: introduce a flag to represent each nat entry information
This patch introduces a flag in the nat entry structure to merge various
information such as checkpointed and fsync_done marks.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:14 -07:00
Jaegeuk Kim
4c521f493b f2fs: use meta_inode cache to improve roll-forward speed
Previously, all the dnode pages should be read during the roll-forward recovery.
Even worsely, whole the chain was traversed twice.
This patch removes that redundant and costly read operations by using page cache
of meta_inode and readahead function as well.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:12 -07:00
Dave Chinner
33044dc408 Merge branch 'xfs-misc-fixes-for-3.18-2' into for-next 2014-09-23 22:55:51 +10:00
Dave Chinner
2ebff7bbd7 xfs: flush entire last page of old EOF on truncate up
On a sub-page sized filesystem, truncating a mapped region down
leaves us in a world of hurt. We truncate the pagecache, zeroing the
newly unused tail, then punch blocks out from under the page. If we
then truncate the file back up immediately, we expose that unmapped
hole to a dirty page mapped into the user application, and that's
where it all goes wrong.

In truncating the page cache, we avoid unmapping the tail page of
the cache because it still contains valid data. The problem is that
it also contains a hole after the truncate, but nobody told the mm
subsystem that. Therefore, if the page is dirty before the truncate,
we'll never get a .page_mkwrite callout after we extend the file and
the application writes data into the hole on the page.  Hence when
we come to writing that region of the page, it has no blocks and no
delayed allocation reservation and hence we toss the data away.

This patch adds code to the truncate up case to solve it, by
ensuring the partial page at the old EOF is always cleaned after we
do any zeroing and move the EOF upwards. We can't actually serialise
the page writeback and truncate against page faults (yes, that
problem AGAIN) so this is really just a best effort and assumes it
is extremely unlikely that someone is concurrently writing to the
page at the EOF while extending the file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 22:55:00 +10:00
Dave Chinner
7abbb8f928 xfs: xfs_swap_extent_flush can be static
Fix sparse warning introduced by commit 4ef897a ("xfs: flush both
inodes in xfs_swap_extents").

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 16:20:11 +10:00
Dave Chinner
02cc18764c xfs: xfs_buf_write_fail_rl_state can be static
Fix sparse warning introduced by commit ac8809f9 ("xfs: abort
metadata writeback on permanent errors").

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 16:15:45 +10:00
Fengguang Wu
ea95961df7 xfs: xfs_rtget_summary can be static
Fix sparse warning introduced by commit afabfd3 ("xfs: combine
xfs_rtmodify_summary and xfs_rtget_summary").

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 16:11:43 +10:00
Fabian Frederick
e3cf17962a xfs: remove second xfs_quota.h inclusion in xfs_icache.c
xfs_quota.h was included twice.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 16:05:55 +10:00
Eric Sandeen
fb04013156 xfs: don't ASSERT on corrupt ftype
xfs_dir3_data_get_ftype() gets the file type off disk, but ASSERTs
if it's invalid:

     ASSERT(type < XFS_DIR3_FT_MAX);

We shouldn't ASSERT on bad values read from disk.  V3 dirs are
CRC-protected, but V2 dirs + ftype are not.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 16:05:32 +10:00
Dave Chinner
8af3dcd3c8 xfs: xlog_cil_force_lsn doesn't always wait correctly
When running a tight mount/unmount loop on an older kernel, RedHat
QE found that unmount would occasionally hang in
xfs_buf_unpin_wait() on the superblock buffer. Tracing and other
debug work by Eric Sandeen indicated that it was hanging on the
writing of the superblock during unmount immediately after logging
the superblock counters in a synchronous transaction. Further debug
indicated that the synchronous transaction was not waiting for
completion correctly, and we narrowed it down to
xlog_cil_force_lsn() returning NULLCOMMITLSN and hence not pushing
the transaction in the iclog buffer to disk correctly.

While this unmount superblock write code is now very different in
mainline kernels, the xlog_cil_force_lsn() code is identical, and it
was bisected to the backport of commit f876e44 ("xfs: always do log
forces via the workqueue"). This commit made the CIL push
asynchronous for log forces and hence exposed a race condition that
couldn't occur on a synchronous push.

Essentially, the xlog_cil_force_lsn() relied implicitly on the fact
that the sequence push would be complete by the time
xlog_cil_push_now() returned, resulting in the context being pushed
being in the committing list. When it was made asynchronous, it was
recognised that there was a race condition in detecting whether an
asynchronous push has started or not and code was added to handle
it.

Unfortunately, the fix was not quite right and left a race condition
where it it would detect an empty CIL while a push was in progress
before the context had been added to the committing list. This was
incorrectly seen as a "nothing to do" condition and so would tell
xfs_log_force_lsn() that there is nothing to wait for, and hence it
would push the iclogbufs in memory.

The fix is simple, but explaining the logic and the race condition
is a lot more complex. The fix is to add the context to the
committing list before we start emptying the CIL. This allows us to
detect the difference between an empty "do nothing" push and a push
that has not started by adding a discrete "emptying the CIL" state
to avoid the transient, incorrect "empty" condition that the
(unchanged) waiting code was seeing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:57:59 +10:00
Dave Chinner
f6d31f4b04 Merge branch 'xfs-shift-extents-rework' into for-next 2014-09-23 15:51:14 +10:00
Brian Foster
8b5279e33f xfs: only writeback and truncate pages for the freed range
xfs_free_file_space() only affects the range of the file for which space
is being freed. It currently writes and truncates the page cache from
the start offset of the free to EOF.

Modify xfs_free_file_space() to write back and truncate page cache of
just the range being freed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:39:05 +10:00
Brian Foster
f71721d061 xfs: writeback and inval. file range to be shifted by collapse
The collapse range operation currently writes the entire file before
starting the collapse to avoid changes in the in-core extent list due to
writeback causing the extent count to change. Now that collapse range is
fsb based rather than extent index based it can sustain changes in the
extent list during the shift sequence without disruption.

Modify xfs_collapse_file_space() to writeback and invalidate pages
associated with the range of the file to be shifted.
xfs_free_file_space() currently has similar behavior, but the space free
need only affect the region of the file that is freed and this could
change in the future.

Also update the comments to reflect the current implementation. We
retain the eofblocks trim permanently as a best option for dealing with
delalloc extents. We don't shift delalloc extents because this scenario
only occurs with post-eof preallocation (since data must be flushed such
that the cache can be invalidated and data can be shifted). That means
said space must also be initialized before being shifted into the
accessible region of the file only to be immediately truncated off as
the last part of the collapse. In other words, the eofblocks trim will
happen anyways, we just run it first to ensure the file remains in a
consistent state throughout the collapse.

Finally, detect and fail explicitly in the event of a delalloc extent
during the extent shift. The implementation does not support delalloc
extents and the caller is expected to prevent this scenario in advance
as is done by collapse.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:39:05 +10:00
Brian Foster
a979bdfea1 xfs: refactor single extent shift into xfs_bmse_shift_one() helper
xfs_bmap_shift_extents() has a variety of conditions and error checks
that make the logic difficult to follow and indent heavy. Refactor the
loop body of this function into a new xfs_bmse_shift_one() helper. This
simplifies the error checks, eliminates index decrement on merge hack by
pushing the index increment down into the helper, and makes the code
more readable by reducing multiple levels of indentation.

This is a code refactor only. The behavior of extent shift and collapse
range is not modified.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:39:04 +10:00
Brian Foster
ddb19e3180 xfs: refactor shift-by-merge into xfs_bmse_merge() helper
The extent shift mechanism in xfs_bmap_shift_extents() is complicated
and handles several different, non-deterministic scenarios. These
include extent shifts, extent merges and potential btree updates in
either of the former scenarios.

Refactor the code to be more linear and readable. The loop logic in
xfs_bmap_shift_extents() and some initial error checking is adjusted
slightly. The associated btree lookup and update/delete operations are
condensed into single blocks of code. This reduces the number of
btree-specific blocks and facilitates the separation of the merge
operation into a new xfs_bmse_merge() and xfs_bmse_can_merge() helpers.

This is a code refactor only. The behavior of extent shift and collapse
range is not modified.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:38:09 +10:00
Brian Foster
2c845f5a5f xfs: track collapse via file offset rather than extent index
The collapse range implementation uses a transaction per extent shift.
The progress of the overall operation is tracked via the current extent
index of the in-core extent list. This is racy because the ilock must be
dropped and reacquired for each transaction according to locking and log
reservation rules. Therefore, writeback to prior regions of the file is
possible and can change the extent count. This changes the extent to
which the current index refers and causes the collapse to fail mid
operation. To avoid this problem, the entire file is currently written
back before the collapse operation starts.

To eliminate the need to flush the entire file, use the file offset
(fsb) to track the progress of the overall extent shift operation rather
than the extent index. Modify xfs_bmap_shift_extents() to
unconditionally convert the start_fsb parameter to an extent index and
return the file offset of the extent where the shift left off, if
further extents exist. The bulk of ths function can remain based on
extent index as ilock is held by the caller. xfs_collapse_file_space()
now uses the fsb output as the starting point for the subsequent shift.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:37:09 +10:00
Dave Chinner
0d085a529b xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly
XFS has been having trouble with stray delayed allocation extents
beyond EOF for a long time. Recent changes to the collapse range
code has triggered erroneous EBUSY errors on page invalidtion for
block size smaller than page size filesystems. These
have been caused by dirty buffers beyond EOF on a partial page which
do not get written to disk during a sync.

The issue is that write-ahead in xfs_cluster_write() finds such a
partial page and handles it by leaving the page dirty but pushing it
into a writeback state. This used to work just fine, as the
write_cache_pages() code would then find the dirty partial page in
the next mapping tree lookup as the dirty tag is still set.

Unfortunately, when we moved to a mark and sweep approach to
writeback to fix other writeback sync issues, we broken this. THe
act of marking the page as under writeback now clears the TOWRITE
tag in the radix tree, even though the page is still dirty. This
causes the TOWRITE tag to be cleared, and hence the next lookup on
the mapping tree does not find the dirty partial page and so doesn't
try to write it again.

This same writeback bug was found recently in ext4 and fixed in
commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
without communication to the wider filesystem community. We can use
exactly the same fix here so the TOWRITE flag is not cleared on
partial page writes.

cc: stable@vger.kernel.org # dependent on 1c8349a171
Root-cause-found-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:36:27 +10:00
Ingo Molnar
6273143359 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull the v3.18 RCU changes from Paul E. McKenney:

"
  * Update RCU documentation.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/378.

  * Miscellaneous fixes.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/386.  An additional fix that
    eliminates a documented (but now inconvenient) deadlock between
    RCU hotplug and expedited grace periods was posted at
    https://lkml.org/lkml/2014/8/28/573.

  * Changes related to No-CBs CPUs and NO_HZ_FULL.  These were posted
    to LKML at https://lkml.org/lkml/2014/8/28/412.

  * Torture-test updates.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/546 and at
    https://lkml.org/lkml/2014/9/11/1114.

  * RCU-tasks implementation.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/540.
"

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-23 07:21:42 +02:00
Linus Torvalds
e2519c2c85 FS-Cache fixes
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAVB/u8ROxKuMESys7AQJGxA/9Ep4IwIAclMYcVQtk4zzvz+A9jVw0pu+X
 aItU70GcGoxehGlOxyqKXp/dg3CZ7ZPcxmQqzl7nzCZY28or5z/Nu6Y/YII+V0NB
 f6vgHPhfGO7CSSdwlSzFY/6DdU/EXhjA3X4suowRPIO3B/2ydLxDe3sahIWY/AhM
 4h3ioDR+A4ZZvU8fmw62l9Iae46mkk7KT8Am/2bUJQpENu4caVEtNtyJ2IsiPtZ8
 pddgvPL/f+piz4Ufdmc/XH4KB8kiLwU+7t3xfoZnSzYz9Vzcq4sQVvSxqNSaYzSC
 REEnELuLEweIh57PP67giZGPE6MA50kISH/soc43yl/oWj9i3cqkNb2JW+nxpvfS
 pIHfgRmzqUPe9MrmVUtle9c1HhjAyts3y27JB4onpLghN96JgIcTWJ1vZxnxbXzp
 ADVCgc3z8nhttLQMU+NsfD4fHUvj+bxULW1bSPdB5DfSrkpqmQJZBRexeFZWGyA9
 p1p9Yaty2Sdq/RPmtIfMPA3cu1D9LG4wkvGPmmEAjnIlKebRBjriKDmCdwf9okWA
 m+fAbuBnTjc+R0ERE6Lh5OI72hbwcgyAyDx5RdWkXVIPMBuBK67nR6nJK7vPW4x0
 qK0B5Pk2iwobKQMicVSxI+GImeeiQLi798KHJwN4uFBy74rIISWilIsylBrxlxK8
 s2hdI83XJJI=
 =ojl4
 -----END PGP SIGNATURE-----

Merge tag 'fscache-fixes-20140917' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fs-cache fixes from David Howells:

 - Put a timeout in releasepage() to deal with a recursive hang between
   the memory allocator, writeback, ext4 and fscache under memory
   pressure.

 - Fix a pair of refcount bugs in the fscache error handling.

 - Remove a couple of unused pagevecs.

 - The cachefiles requirement that the base directory support rename
   should permit rename2 as an alternative - otherwise certain
   filesystems cannot now be used as backing stores (such as ext4).

* tag 'fscache-fixes-20140917' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  CacheFiles: Handle rename2
  cachefiles: remove two unused pagevecs.
  FS-Cache: refcount becomes corrupt under vma pressure.
  FS-Cache: Reduce cookie ref count if submit fails.
  FS-Cache: Timeout for releasepage()
2014-09-22 17:52:16 -07:00
Josef Bacik
1d52c78afb Btrfs: try not to ENOSPC on log replay
When doing log replay we may have to update inodes, which traditionally goes
through our delayed inode stuff.  This will try to move space over from the
trans handle, but we don't reserve space in our trans handle on replay since we
don't know how much we will need, so instead we try to flush.  But because we
have a trans handle open we won't flush anything, so if we are out of reserve
space we will simply return ENOSPC.  Since we know that if an operation made it
into the log then we definitely had space before the box bought the farm then we
don't need to worry about doing this space reservation.  Use the
fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
item directly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:36 -07:00
Josef Bacik
f6acfd5011 Btrfs: don't do async reclaim during log replay
Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
during log replay.  This is because we use fs_info->fs_root as our root for
shrinking and such.  Technically we can use whatever root we want, but let's
just not allow async reclaim while we're doing log replay.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:31 -07:00
Josef Bacik
47ab2a6c68 Btrfs: remove empty block groups automatically
One problem that has plagued us is that a user will use up all of his space with
data, remove a bunch of that data, and then try to create a bunch of small files
and run out of space.  This happens because all the chunks were allocated for
data since the metadata requirements were so low.  But now there's a bunch of
empty data block groups and not enough metadata space to do anything.  This
patch solves this problem by automatically deleting empty block groups.  If we
notice the used count go down to 0 when deleting or on mount notice that a block
group has a used count of 0 then we will queue it to be deleted.

When the cleaner thread runs we will double check to make sure the block group
is still empty and then we will delete it.  This patch has the side effect of no
longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
helpful for both this and relocate.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:21 -07:00
Jens Axboe
6d11fb454b Merge branch 'for-linus' into for-3.18/core
Moving patches from for-linus to 3.18 instead, pull in this changes
that will go to Linus today.
2014-09-22 11:57:32 -06:00
Anton Altaparmakov
f2d5a94436 Fix nasty 32-bit overflow bug in buffer i/o code.
On 32-bit architectures, the legacy buffer_head functions are not always
handling the sector number with the proper 64-bit types, and will thus
fail on 4TB+ disks.

Any code that uses __getblk() (and thus bread(), breadahead(),
sb_bread(), sb_breadahead(), sb_getblk()), and calls it using a 64-bit
block on a 32-bit arch (where "long" is 32-bit) causes an inifinite loop
in __getblk_slow() with an infinite stream of errors logged to dmesg
like this:

  __find_get_block_slow() failed. block=6740375944, b_blocknr=2445408648
  b_state=0x00000020, b_size=512
  device sda1 blocksize: 512

Note how in hex block is 0x191C1F988 and b_blocknr is 0x91C1F988 i.e. the
top 32-bits are missing (in this case the 0x1 at the top).

This is because grow_dev_page() is broken and has a 32-bit overflow due
to shifting the page index value (a pgoff_t - which is just 32 bits on
32-bit architectures) left-shifted as the block number.  But the top
bits to get lost as the pgoff_t is not type cast to sector_t / 64-bit
before the shift.

This patch fixes this issue by type casting "index" to sector_t before
doing the left shift.

Note this is not a theoretical bug but has been seen in the field on a
4TiB hard drive with logical sector size 512 bytes.

This patch has been verified to fix the infinite loop problem on 3.17-rc5
kernel using a 4TB disk image mounted using "-o loop".  Without this patch
doing a "find /nt" where /nt is an NTFS volume causes the inifinite loop
100% reproducibly whilst with the patch it works fine as expected.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-22 08:41:16 -07:00
Trond Myklebust
5466112f09 pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
kbuild test robot reports:

   fs/built-in.o: In function `bl_map_stripe':
   >> :(.text+0x965b4): undefined reference to `__aeabi_uldivmod'
   >> :(.text+0x965cc): undefined reference to `__aeabi_uldivmod'
   >> :(.text+0x96604): undefined reference to `__aeabi_uldivmod'

Fixes: 5c83746a0c (pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing)
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-21 14:20:20 -04:00
Linus Torvalds
46be7b73e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I've got a revert to fix a regression with btrfs device registration,
  and Filipe has part two of his fsync fix from last week"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Revert "Btrfs: device_list_add() should not update list when mounted"
  Btrfs: set inode's logged_trans/last_log_commit after ranged fsync
2014-09-19 13:10:53 -07:00
Linus Torvalds
81770f4144 NFS client fixes for 3.17
Highligts:
 - Fix an Oops in nfs4_open_and_get_state
 - Fix an Oops in the nfs4_state_manager
 - Fix another bug in the close/open_downgrade code
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUHIRLAAoJEGcL54qWCgDykZkP/jHDs/0HcK3x8jW+zbxKP6tf
 xfyhJGySwnTo2v0UPD1pETtQke9bWnm38RVl04wf2H4Gb7jR/BoDZ5J1C7956vuN
 FFXt9lcnTj2Cijn/8wz2S9GneY/mjsWf9OP7NUM3O6DgxORhdoviOnYOAzqzEXjG
 ylqTP/3FVglDbawKaLy3ubI0dteNxOu9U4gLveP617Ysd8h4s5XsYHPYKOOltybS
 HhVNf/3EdoD3lms67Zj7yPl7PtdDhNKFrS32nhnfdLLgsMiwTyb9ZYaFpK2XcD9v
 KDKblibH/wpQCsnReB66dKBR8P4ktTvXM1ovkb7LFUZD5tsOcb1Bp5ROzHXUSmiI
 sXh5Ueue0FPKExU5WFKROE43+G5KOJG5pB2RwgugsqVlZjFhGotZrIle17Zuqxz0
 kVR+vGZ50O/nLQ+EoRhDRRbDBrUMT7/xxHDSPQ6d4HK2hNTbrXuSXcoe8/BvbSTt
 JXQCdbWDPZ5oR6z8RoBN1xHhJvXC3Y2w/d7ZzOpl3yLzsKpJ7K4tys4Z29iv3ut6
 ziRS1AvJvedwSK73fWTs+zEHKm+pFMqq2U+DncvWWOWOVpIv6eKRlY9O8enP7IeW
 qNHj4UVYnr9w4oAhvk2WJt1TZrhzBX9NhMjHSxUCSOs5v/YeiBjPTTy40N4O0Y6Z
 DwKwDNxZq49ILEznntsd
 =EOW6
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.17-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Highligts:
   - fix an Oops in nfs4_open_and_get_state
   - fix an Oops in the nfs4_state_manager
   - fix another bug in the close/open_downgrade code"

* tag 'nfs-for-3.17-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Fix another bug in the close/open_downgrade code
  NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists()
  NFS: remove BUG possibility in nfs4_open_and_get_state
2014-09-19 13:07:49 -07:00
Richard Weinberger
d577bc104f UBIFS: Remove bogus assert
This assertion was only correct before UBIFS had xattr support.
Now with xattr support also a directory node can carry data
and can act as host node.

Suggested-by: Artem Bityutskiy <dedekind1@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-09-19 18:11:50 +03:00
Filipe Manana
8407f55326 Btrfs: fix data corruption after fast fsync and writeback error
When we do a fast fsync, we start all ordered operations and then while
they're running in parallel we visit the list of modified extent maps
and construct their matching file extent items and write them to the
log btree. After that, in btrfs_sync_log() we wait for all the ordered
operations to finish (via btrfs_wait_logged_extents).

The problem with this is that we were completely ignoring errors that
can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
or -EIO errors for example. When such error happens, it means we have parts
of the on disk extent that weren't written to, and so we end up logging
file extent items that point to these extents that contain garbage/random
data - so after a crash/reboot plus log replay, we get our inode's metadata
pointing to those extents.

This worked in contrast with the full (non-fast) fsync path, where we
start all ordered operations, wait for them to finish and then write
to the log btree. In this path, after each ordered operation completes
we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
-EIO if so (via btrfs_wait_ordered_range).

So if an error happens with any ordered operation, just return a -EIO
error to userspace, so that it knows that not all of its previous writes
were durably persisted and the application can take proper action (like
redo the writes for e.g.) - and definitely not leave any file extent items
in the log refer to non fully written extents.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-19 06:57:51 -07:00
Filipe Manana
669249eea3 Btrfs: fix fsync race leading to invalid data after log replay
When the fsync callback (btrfs_sync_file) starts, it first waits for
the writeback of any dirty pages to start and finish without holding
the inode's mutex (to reduce contention). After this it acquires the
inode's mutex and repeats that process via btrfs_wait_ordered_range
only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag
is set on the inode).

This is not safe for a non full sync - we need to start and wait for
writeback to finish for any pages that might have been made dirty
before acquiring the inode's mutex and after that first step mentioned
before. Why this is needed is explained by the following comment added
to btrfs_sync_file:

  "Right before acquiring the inode's mutex, we might have new
   writes dirtying pages, which won't immediately start the
   respective ordered operations - that is done through the
   fill_delalloc callbacks invoked from the writepage and
   writepages address space operations. So make sure we start
   all ordered operations before starting to log our inode. Not
   doing this means that while logging the inode, writeback
   could start and invoke writepage/writepages, which would call
   the fill_delalloc callbacks (cow_file_range,
   submit_compressed_extents). These callbacks add first an
   extent map to the modified list of extents and then create
   the respective ordered operation, which means in
   tree-log.c:btrfs_log_inode() we might capture all existing
   ordered operations (with btrfs_get_logged_extents()) before
   the fill_delalloc callback adds its ordered operation, and by
   the time we visit the modified list of extent maps (with
   btrfs_log_changed_extents()), we see and process the extent
   map they created. We then use the extent map to construct a
   file extent item for logging without waiting for the
   respective ordered operation to finish - this file extent
   item points to a disk location that might not have yet been
   written to, containing random data - so after a crash a log
   replay will make our inode have file extent items that point
   to disk locations containing invalid data, as we returned
   success to userspace without waiting for the respective
   ordered operation to finish, because it wasn't captured by
   btrfs_get_logged_extents()."

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-19 06:57:50 -07:00
Kirill Tkhai
f139caf2e8 sched, cleanup, treewide: Remove set_current_state(TASK_RUNNING) after schedule()
schedule(), io_schedule() and schedule_timeout() always return
with TASK_RUNNING state set, so one more setting is unnecessary.

(All places in patch are visible good, only exception is
 kiblnd_scheduler() from:

      drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c

 Its schedule() is one line above standard 3 lines of unified diff)

No places where set_current_state() is used for mb().

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Anil Belur <askb23@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: David Airlie <airlied@linux.ie>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry Eremin <dmitry.eremin@intel.com>
Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Isaac Huang <he.huang@intel.com>
Cc: James E.J. Bottomley <JBottomley@parallels.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Liang Zhen <liang.zhen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Masaru Nomura <massa.nomura@gmail.com>
Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleg Drokin <green@linuxhacker.ru>
Cc: Peng Tao <bergwolf@gmail.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Robert Love <robert.w.love@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Ursula Braun <ursula.braun@de.ibm.com>
Cc: Zi Shen Lim <zlim.lnx@gmail.com>
Cc: devel@driverdev.osuosl.org
Cc: dm-devel@redhat.com
Cc: dri-devel@lists.freedesktop.org
Cc: fcoe-devel@open-fcoe.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux390@de.ibm.com
Cc: linux-afs@lists.infradead.org
Cc: linux-cris-kernel@axis.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: qla2xxx-upstream@qlogic.com
Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: user-mode-linux-user@lists.sourceforge.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:17 +02:00
Abhi Das
00a158be83 GFS2: fix bad inode i_goal values during block allocation
This patch checks if i_goal is either zero or if doesn't exist
within any rgrp (i.e gfs2_blk2rgrpd() returns NULL). If so, it
assigns the ip->i_no_addr block as the i_goal.

There are two scenarios where a bad i_goal can result in a
-EBADSLT error.

1. Attempting to allocate to an existing inode:
Control reaches gfs2_inplace_reserve() and ip->i_goal is bad.
We need to fix i_goal here.

2. A new inode is created in a directory whose i_goal is hosed:
In this case, the parent dir's i_goal is copied onto the new
inode. Since the new inode is not yet created, the ip->i_no_addr
field is invalid and so, the fix in gfs2_inplace_reserve() as per
1) won't work in this scenario. We need to catch and fix it sooner
in the parent dir itself (gfs2_create_inode()), before it is
copied to the new inode.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-09-19 10:45:18 +01:00
Linus Torvalds
d9773ceabf Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs/smb3 fixes from Steve French:
 "Fixes for problems found during testing and debugging at the SMB3
  storage test event (plugfest) this week"

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  Fix mfsymlinks file size check
  Update version number displayed by modinfo for cifs.ko
  cifs: remove dead code
  Revert "cifs: No need to send SIGKILL to demux_thread during umount"
  [SMB3] Fix oops when creating symlinks on smb3
  [CIFS] Fix setting time before epoch (negative time values)
2014-09-18 11:10:35 -07:00
Zefan Li
52de4779f2 cpuset: simplify proc_cpuset_show()
Use the ONE macro instead of REG, and we can simplify proc_cpuset_show().

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18 13:27:23 -04:00
Zefan Li
006f4ac497 cgroup: simplify proc_cgroup_show()
Use the ONE macro instead of REG, and we can simplify proc_cgroup_show().

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18 13:27:23 -04:00
Trond Myklebust
cd9288ffae NFSv4: Fix another bug in the close/open_downgrade code
James Drew reports another bug whereby the NFS client is now sending
an OPEN_DOWNGRADE in a situation where it should really have sent a
CLOSE: the client is opening the file for O_RDWR, but then trying to
do a downgrade to O_RDONLY, which is not allowed by the NFSv4 spec.

Reported-by: James Drews <drews@engr.wisc.edu>
Link: http://lkml.kernel.org/r/541AD7E5.8020409@engr.wisc.edu
Fixes: aee7af356e (NFSv4: Fix problems with close in the presence...)
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-18 13:04:22 -04:00
Steve Dickson
080af20cc9 NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists()
There is a race between nfs4_state_manager() and
nfs_server_remove_lists() that happens during a nfsv3 mount.

The v3 mount notices there is already a supper block so
nfs_server_remove_lists() called which uses the nfs_client_lock
spin lock to synchronize access to the client list.

At the same time nfs4_state_manager() is running through
the client list looking for work to do, using the same
lock. When nfs4_state_manager() wins the race to the
list, a v3 client pointer is found and not ignored
properly which causes the panic.

Moving some protocol checks before the state checking
avoids the panic.

CC: Stable Tree <stable@vger.kernel.org>
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-18 13:04:21 -04:00
Chris Mason
0f23ae74f5 Revert "Btrfs: device_list_add() should not update list when mounted"
This reverts commit b96de000bc.

This commit is triggering failures to mount by subvolume id in some
configurations.  The main problem is how many different ways this
scanning function is used, both for scanning while mounted and
unmounted.  A proper cleanup is too big for late rcs.

For now, just revert the commit and we'll put a better fix into a later
merge window.

Signed-off-by: Chris Mason <clm@fb.com>
2014-09-18 07:49:05 -07:00
Qu Wenruo
e6c4efd87a btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map
The following commit enhanced the merge_extent_mapping() to reduce
fragment in extent map tree, but it can't handle case which existing
lies before map_start:
51f39 btrfs: Use right extent length when inserting overlap extent map.

[BUG]
When existing extent map's start is before map_start,
the em->len will be minus, which will corrupt the extent map and fail to
insert the new extent map.
This will happen when someone get a large extent map, but when it is
going to insert it into extent map tree, some one has already commit
some write and split the huge extent into small parts.

[REPRODUCER]
It is very easy to tiger using filebench with randomrw personality.
It is about 100% to reproduce when using 8G preallocated file in 60s
randonrw test.

[FIX]
This patch can now handle any existing extent position.
Since it does not directly use existing->start, now it will find the
previous and next extent around map_start.
So the old existing->start < map_start bug will never happen again.

[ENHANCE]
This patch will insert the best fitted extent map into extent map tree,
other than the oldest [map_start, map_start + sectorsize) or the
relatively newer but not perfect [map_start, existing->start).

The patch will first search existing extent that does not intersects with
the desired map range [map_start, map_start + len).
The existing extent will be either before or behind map_start, and based
on the existing extent, we can find out the previous and next extent
around map_start.

So the best fitted extent would be [prev->end, next->start).
For prev or next is not found, em->start would be prev->end and em->end
wold be next->start.

With this patch, the fragment in extent map tree should be reduced much
more than the 51f39 commit and reduce an unneeded extent map tree search.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-18 07:14:46 -07:00
David Howells
e2cf1f1cc7 CacheFiles: Handle rename2
Not all filesystems now provide the rename i_op - ext4 for one - but rather
provide the rename2 i_op.  CacheFiles checks that the filesystem has rename
and so will reject ext4 now with EPERM:

	CacheFiles: Failed to register: -1

Fix this by checking for rename2 as an alternative.  The call to vfs_rename()
actually handles selection of the appropriate function, so we needn't worry
about that.

Turning on debugging shows:

	[cachef] ==> cachefiles_get_directory(,,cache)
	[cachef] subdir -> ffff88000b22b778 positive
	[cachef] <== cachefiles_get_directory() = -1 [check]

where -1 is EPERM.

Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17 23:29:53 +01:00
NeilBrown
696382f938 cachefiles: remove two unused pagevecs.
These two have been unused since

commit c4d6d8dbf3
    CacheFiles: Fix the marking of cached pages

in 3.8.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17 23:29:50 +01:00
Milosz Tanski
3e1199dcad FS-Cache: refcount becomes corrupt under vma pressure.
In rare cases under heavy VMA pressure the ref count for a fscache cookie
becomes corrupt. In this case we decrement ref count even if we fail before
incrementing the refcount.

FS-Cache: Assertion failed bnode-eca5f9c6/syslog
0 > 0 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/cookie.c:519!
invalid opcode: 0000 [#1] SMP
Call Trace:
[<ffffffffa01ba060>] __fscache_relinquish_cookie+0x50/0x220 [fscache]
[<ffffffffa02d64ce>] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph]
[<ffffffffa02ae1d3>] ceph_destroy_inode+0x33/0x200 [ceph]
[<ffffffff811cf67e>] ? __fsnotify_inode_delete+0xe/0x10
[<ffffffff811a9e0c>] destroy_inode+0x3c/0x70
[<ffffffff811a9f51>] evict+0x111/0x180
[<ffffffff811aa763>] iput+0x103/0x190
[<ffffffff811a5de8>] __dentry_kill+0x1c8/0x220
[<ffffffff811a5f31>] shrink_dentry_list+0xf1/0x250
[<ffffffff811a762c>] prune_dcache_sb+0x4c/0x60
[<ffffffff811930af>] super_cache_scan+0xff/0x170
[<ffffffff8113d7a0>] shrink_slab_node+0x140/0x2c0
[<ffffffff8113f2da>] shrink_slab+0x8a/0x130
[<ffffffff81142572>] balance_pgdat+0x3e2/0x5d0
[<ffffffff811428ca>] kswapd+0x16a/0x4a0
[<ffffffff810a43f0>] ? __wake_up_sync+0x20/0x20
[<ffffffff81142760>] ? balance_pgdat+0x5d0/0x5d0
[<ffffffff81083e09>] kthread+0xc9/0xe0
[<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90
[<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0
[<ffffffff8159f63c>] ret_from_fork+0x7c/0xb0
[<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0
RIP [<ffffffffa01b984b>] __fscache_disable_cookie+0x1db/0x210 [fscache]
RSP <ffff8803bc85f9b8>
---[ end trace 254d0d7c74a01f25 ]---

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17 22:41:40 +01:00
Liu Bo
4d1a40c66b Btrfs: fix up bounds checking in lseek
An user reported this, it is because that lseek's SEEK_SET/SEEK_CUR/SEEK_END
allow a negative value for @offset, but btrfs's SEEK_DATA/SEEK_HOLE don't
prepare for that and convert the negative @offset into unsigned type,
so we get (end < start) warning.

[ 1269.835374] ------------[ cut here ]------------
[ 1269.836809] WARNING: CPU: 0 PID: 1241 at fs/btrfs/extent_io.c:430 insert_state+0x11d/0x140()
[ 1269.838816] BTRFS: end < start 4094 18446744073709551615
[ 1269.840334] CPU: 0 PID: 1241 Comm: a.out Tainted: G        W      3.16.0+ #306
[ 1269.858229] Call Trace:
[ 1269.858612]  [<ffffffff81801a69>] dump_stack+0x4e/0x68
[ 1269.858952]  [<ffffffff8107894c>] warn_slowpath_common+0x8c/0xc0
[ 1269.859416]  [<ffffffff81078a36>] warn_slowpath_fmt+0x46/0x50
[ 1269.859929]  [<ffffffff813b0fbd>] insert_state+0x11d/0x140
[ 1269.860409]  [<ffffffff813b1396>] __set_extent_bit+0x3b6/0x4e0
[ 1269.860805]  [<ffffffff813b21c7>] lock_extent_bits+0x87/0x200
[ 1269.861697]  [<ffffffff813a5b28>] btrfs_file_llseek+0x148/0x2a0
[ 1269.862168]  [<ffffffff811f201e>] SyS_lseek+0xae/0xc0
[ 1269.862620]  [<ffffffff8180b212>] system_call_fastpath+0x16/0x1b
[ 1269.862970] ---[ end trace 4d33ea885832054b ]---

This assumes that btrfs starts finding DATA/HOLE from the beginning of file
if the assigned @offset is negative.

Also we add alignment for lock_extent_bits 's range.

Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:46:30 -07:00
Miao Xie
f612496bca Btrfs: cleanup the read failure record after write or when the inode is freeing
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:02 -07:00
Miao Xie
8b110e393c Btrfs: implement repair function when direct read fails
This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:01 -07:00
Miao Xie
28e1cc7d1b Btrfs: Set real mirror number for read operation on RAID0/5/6
We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:00 -07:00