Commit graph

27049 commits

Author SHA1 Message Date
Christoph Hellwig
a8569171ba xfs: remove some obsolete comments in xfs_trans_ail.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:32 -05:00
Christoph Hellwig
43ff2122e6 xfs: on-stack delayed write buffer lists
Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
and write back the buffers per-process instead of by waking up xfsbufd.

This is now easily doable given that we have very few places left that write
delwri buffers:

 - log recovery:
	Only done at mount time, and already forcing out the buffers
	synchronously using xfs_flush_buftarg

 - quotacheck:
	Same story.

 - dquot reclaim:
	Writes out dirty dquots on the LRU under memory pressure.  We might
	want to look into doing more of this via xfsaild, but it's already
	more optimal than the synchronous inode reclaim that writes each
	buffer synchronously.

 - xfsaild:
	This is the main beneficiary of the change.  By keeping a local list
	of buffers to write we reduce latency of writing out buffers, and
	more importably we can remove all the delwri list promotions which
	were hitting the buffer cache hard under sustained metadata loads.

The implementation is very straight forward - xfs_buf_delwri_queue now gets
a new list_head pointer that it adds the delwri buffers to, and all callers
need to eventually submit the list using xfs_buf_delwi_submit or
xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
list.  The biggest change to pass down the buffer list was done to the AIL
pushing. Now that we operate on buffers the trylock, push and pushbuf log
item methods are merged into a single push routine, which tries to lock the
item, and if possible add the buffer that needs writeback to the buffer list.
This leads to much simpler code than the previous split but requires the
individual IOP_PUSH instances to unlock and reacquire the AIL around calls
to blocking routines.

Given that xfsailds now also handle writing out buffers, the conditions for
log forcing and the sleep times needed some small changes.  The most
important one is that we consider an AIL busy as long we still have buffers
to push, and the other one is that we do increment the pushed LSN for
buffers that are under flushing at this moment, but still count them towards
the stuck items for restart purposes.  Without this we could hammer on stuck
items without ever forcing the log and not make progress under heavy random
delete workloads on fast flash storage devices.

[ Dave Chinner:
	- rebase on previous patches.
	- improved comments for XBF_DELWRI_Q handling
	- fix XBF_ASYNC handling in queue submission (test 106 failure)
	- rename delwri submit function buffer list parameters for clarity
	- xfs_efd_item_push() should return XFS_ITEM_PINNED ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:31 -05:00
Christoph Hellwig
960c60af8b xfs: do not add buffers to the delwri queue until pushed
Instead of adding buffers to the delwri list as soon as they are logged,
even if they can't be written until commited because they are pinned
defer adding them to the delwri list until xfsaild pushes them.  This
makes the code more similar to other log items and prepares for writing
buffers directly from xfsaild.

The complication here is that we need to fail buffers that were added
but not logged yet in xfs_buf_item_unpin, borrowing code from
xfs_bioerror.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:30 -05:00
Christoph Hellwig
fe7257fd4b xfs: do not write the buffer from xfs_qm_dqflush
Instead of writing the buffer directly from inside xfs_qm_dqflush return it
to the caller and let the caller decide what to do with the buffer.  Also
remove the pincount check in xfs_qm_dqflush that all non-blocking callers
already implement and the now unused flags parameter and the XFS_DQ_IS_DIRTY
check that all callers already perform.

[ Dave Chinner: fixed build error cause by missing '{'. ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:29 -05:00
Christoph Hellwig
4c46819a80 xfs: do not write the buffer from xfs_iflush
Instead of writing the buffer directly from inside xfs_iflush return it to
the caller and let the caller decide what to do with the buffer.  Also
remove the pincount check in xfs_iflush that all non-blocking callers already
implement and the now unused flags parameter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:28 -05:00
Christoph Hellwig
8a48088f64 xfs: don't flush inodes from background inode reclaim
We already flush dirty inodes throug the AIL regularly, there is no reason
to have second thread compete with it and disturb the I/O pattern.  We still
do write inodes when doing a synchronous reclaim from the shrinker or during
unmount for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:28 -05:00
Christoph Hellwig
211e4d434b xfs: implement freezing by emptying the AIL
Now that we write back all metadata either synchronously or through
the AIL we can simply implement metadata freezing in terms of
emptying the AIL.

The implementation for this is fairly simply and straight-forward:
A new routine is added that asks the xfsaild to push the AIL to the
end and waits for it to complete and send a wakeup. The routine will
then loop if the AIL is not actually empty, and continue to do so
until the AIL is compeltely empty.

We keep an inode reclaim pass in the freeze process to avoid having
memory pressure have to reclaim inodes that require dirtying the
filesystem to be reclaimed after the freeze has completed. This
means we can also treat unmount in the exact same way as freeze.

As an upside we can now remove the radix tree based inode writeback
and xfs_unmountfs_writesb.

[ Dave Chinner:
	- Cleaned up commit message.
	- Added inode reclaim passes back into freeze.
	- Cleaned up wakeup mechanism to avoid the use of a new
	  sleep counter variable. ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:27 -05:00
Christoph Hellwig
1c30462542 xfs: allow assigning the tail lsn with the AIL lock held
Provide a variant of xlog_assign_tail_lsn that has the AIL lock already
held.  By doing so we do an additional atomic_read + atomic_set under
the lock, which comes down to two instructions.

Switch xfs_trans_ail_update_bulk and xfs_trans_ail_delete_bulk to the
new version to reduce the number of lock roundtrips, and prepare for
a new addition that would require a third lock roundtrip in
xfs_trans_ail_delete_bulk.  This addition is also the reason for
slightly rearranging the conditionals and relying on xfs_log_space_wake
for checking that the filesystem has been shut down internally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:26 -05:00
Christoph Hellwig
32ce90a4b7 xfs: remove log item from AIL in xfs_iflush after a shutdown
If a filesystem has been forced shutdown we are never going to write inodes
to disk, which means the inode items will stay in the AIL until we free
the inode. Currently that is not a problem, but a pending change requires us
to empty the AIL before shutting down the filesystem. In that case leaving
the inode in the AIL is lethal. Make sure to remove the log item from the AIL
to allow emptying the AIL on shutdown filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:25 -05:00
Christoph Hellwig
dea9609527 xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown
If a filesystem has been forced shutdown we are never going to write dquots
to disk, which means the dquot items will stay in the AIL forever.
Currently that is not a problem, but a pending chance requires us to
empty the AIL before shutting down the filesystem, in which case this
behaviour is lethal.  Make sure to remove the log item from the AIL
to allow emptying the AIL on shutdown filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:24 -05:00
Shaohua Li
7582df516c xfs: using GFP_NOFS for blkdev_issue_flush
Issuing a block device flush request in transaction context using GFP_KERNEL
directly can cause deadlocks due to memory reclaim recursion. Use GFP_NOFS to
avoid recursion from reclaim context.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:23 -05:00
Dave Chinner
01c84d2dc1 xfs: punch all delalloc blocks beyond EOF on write failure.
I've been seeing regular ASSERT failures in xfstests when running
fsstress based tests over the past month. xfs_getbmap() has been
failing this test:

XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) ||
(map[i].br_startblock != DELAYSTARTBLOCK), file: fs/xfs/xfs_bmap.c,
line: 5650

where it is encountering a delayed allocation extent after writing
all the dirty data to disk and then walking the extent map
atomically by holding the XFS_IOLOCK_SHARED to prevent new delayed
allocation extents from being created.

Test 083 on a 512 byte block size filesystem was used to reproduce
the problem, because it only had a 5s run timeand would usually fail
every 3-4 runs. This test is exercising ENOSPC behaviour by running
fsstress on a nearly full filesystem. The following trace extract
shows the final few events on the inode that tripped the assert:

 xfs_ilock:             flags ILOCK_EXCL caller xfs_setfilesize
 xfs_setfilesize:       isize 0x180000 disize 0x12d400 offset 0x17e200 count 7680

file size updated to 0x180000 by IO completion

 xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
 xfs_iext_insert:       state  idx 3 offset 3072 block 4503599627239432 count 1 flag 0 caller xfs_bmap_add_extent_hole_delay
 xfs_get_blocks_alloc:  size 0x180000 offset 0x180000 count 512 type  startoff 0xc00 startblock -1 blockcount 0x1
 xfs_ilock:             flags ILOCK_EXCL caller __xfs_get_blocks

delalloc write, adding a single block at offset 0x180000

 xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180200 count 512

ENOSPC trying to allocate a dellalloc block at offset 0x180200

 xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
 xfs_get_blocks_alloc:  size 0x180000 offset 0x180200 count 512 type  startoff 0xc00 startblock -1 blockcount 0x2

And succeeding on retry after flushing dirty inodes.

 xfs_ilock:             flags ILOCK_EXCL caller __xfs_get_blocks
 xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180400 count 512

ENOSPC trying to allocate a dellalloc block at offset 0x180400

 xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
 xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180400 count 512

And failing the retry, giving a real ENOSPC error.

 xfs_ilock:             flags ILOCK_EXCL caller xfs_vm_write_failed
                                                ^^^^^^^^^^^^^^^^^^^
The smoking gun - the write being failed and cleaning up delalloc
blocks beyond EOF allocated by the failed write.

 xfs_getattr:
 xfs_ilock:             flags IOLOCK_SHARED caller xfs_getbmap
 xfs_ilock:             flags ILOCK_SHARED caller xfs_ilock_map_shared

And that's where we died almost immediately afterwards.
xfs_bmapi_read() found delalloc extent beyond current file in memory
file size. Some debug I added to xfs_getbmap() showed the state just
before the assert failure:

 ino 0x80e48: off 0xc00, fsb 0xffffffffffffffff, len 0x1, size 0x180000
 start_fsb 0x106, end_fsb 0x638
 ino flags 0x2 nex 0xd bmvcnt 0x555, len 0x3c58a6f23c0bf1, start 0xc00
 ext 0: off 0x1fc, fsb 0x24782, len 0x254
 ext 1: off 0x450, fsb 0x40851, len 0x30
 ext 2: off 0x480, fsb 0xd99, len 0x1b8
 ext 3: off 0x92f, fsb 0x4099a, len 0x3b
 ext 4: off 0x96d, fsb 0x41844, len 0x98
 ext 5: off 0xbf1, fsb 0x408ab, len 0xf

which shows that we found a single delalloc block beyond EOF (first
line of output) when we were returning the map for a length
somewhere around 10^16 bytes long (second line), and the on-disk
extents showed they didn't go past EOF (last lines).

Further debug added to xfs_vm_write_failed() showed this happened
when punching out delalloc blocks beyond the end of the file after
the failed write:

[  132.606693] ino 0x80e48: vwf to 0x181000, sze 0x180000
[  132.609573] start_fsb 0xc01, end_fsb 0xc08

It punched the range 0xc01 -> 0xc08, but the range we really need to
punch is 0xc00 -> 0xc07 (8 blocks from 0xc00) as this testing was
run on a 512 byte block size filesystem (8 blocks per page).
the punch from is 0xc00. So end_fsb is correct, but start_fsb is
wrong as we punch from start_fsb for (end_fsb - start_fsb) blocks.
Hence we are not punching the delalloc block beyond EOF in the case.

The fix is simple - it's a silly off-by-one mistake in calculating
the range. It's especially silly because the macro used to calculate
the start_fsb already takes into account the case where the inode
size is an exact multiple of the filesystem block size...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:22 -05:00
Dave Chinner
507630b29f xfs: use shared ilock mode for direct IO writes by default
For the direct IO write path, we only really need the ilock to be taken in
exclusive mode during IO submission if we need to do extent allocation
instead of all the time.

Change the block mapping code to take the ilock in shared mode for the
initial block mapping, and only retake it exclusively when we actually
have to perform extent allocations.  We were already dropping the ilock
for the transaction allocation, so this doesn't introduce new race windows.

Based on an earlier patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:21 -05:00
Christoph Hellwig
193aec1050 xfs: push the ilock into xfs_zero_eof
Instead of calling xfs_zero_eof with the ilock held only take it internally
for the minimall required critical section around xfs_bmapi_read.  This
also requires changing the calling convention for xfs_zero_last_block
slightly.  The actual zeroing operation is still serialized by the iolock,
which must be taken exclusively over the call to xfs_zero_eof.

We could in fact use a shared lock for the xfs_bmapi_read calls as long as
the extent list has been read in, but given that we already hold the iolock
exclusively there is little reason to micro optimize this further.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:20 -05:00
Christoph Hellwig
f38996f576 xfs: reduce ilock hold times in xfs_setattr_size
We do not need the ilock for most checks done in the beginning of
xfs_setattr_size.  Replace the long critical section before starting the
transaction with a smaller one around xfs_zero_eof and an optional one
inside xfs_qm_dqattach that isn't entered unless using quotas.  While
this isn't a big optimization for xfs_setattr_size itself it will allow
pushing the ilock into xfs_zero_eof itself later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2012-05-14 16:20:18 -05:00
Christoph Hellwig
467f78992a xfs: reduce ilock hold times in xfs_file_aio_write_checks
We do not need the ilock for generic_write_checks and the i_size_read,
which are protected by i_mutex and/or iolock, so reduce the ilock
critical section to just the call to xfs_zero_eof.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:17 -05:00
Christoph Hellwig
b4d05e3019 xfs: avoid taking the ilock unnessecarily in xfs_qm_dqattach
Check if we actually need to attach a dquot before taking the ilock in
xfs_qm_dqattach.  This avoid superflous lock roundtrips for the common cases
of quota support compiled in but not activated on a filesystem and an
inode that already has the dquots attached.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14 16:20:15 -05:00
Alan Stern
356c05d58a sysfs: get rid of some lockdep false positives
This patch (as1554) fixes a lockdep false-positive report.  The
problem arises because lockdep is unable to deal with the
tree-structured locks created by the device core and sysfs.

This particular problem involves a sysfs attribute method that
unregisters itself, not from the device it was called for, but from a
descendant device.  Lockdep doesn't understand the distinction and
reports a possible deadlock, even though the operation is safe.

This is the sort of thing that would normally be handled by using a
nested lock annotation; unfortunately it's not feasible to do that
here.  There's no sensible way to tell sysfs when attribute removal
occurs in the context of a parent attribute method.

As a workaround, the patch adds a new flag to struct attribute
telling sysfs not to inform lockdep when it acquires a readlock on a
sysfs_dirent instance for the attribute.  The readlock is still
acquired, but lockdep doesn't know about it and hence does not
complain about impossible deadlock scenarios.

Also added are macros for static initialization of attribute
structures with the ignore_lockdep flag set.  The three offending
attributes in the USB subsystem are converted to use the new macros.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Tejun Heo <tj@kernel.org>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-14 12:19:56 -07:00
Linus Torvalds
9ff00d58a9 Three fixes for 3.4:
- Fix a lock ordering deadlock in JFFS2
  - Fix an oops in the dataflash driver, triggered by a dummy call to test
    whether it has OTP functionality.
  - Fix request_mem_region() failure on amsdelta NAND driver.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iEYEABECAAYFAk+vekgACgkQdwG7hYl686N8bQCfdizsFrliKbDW20R/pO66NoAV
 aloAn0ln+mwe3rIdNt8qKynW8e8dbudF
 =R7XS
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-3.4-20120513' of git://git.infradead.org/linux-mtd

Pull three MTD fixes from David Woodhouse:
 - Fix a lock ordering deadlock in JFFS2
 - Fix an oops in the dataflash driver, triggered by a dummy call to test
   whether it has OTP functionality.
 - Fix request_mem_region() failure on amsdelta NAND driver.

* tag 'for-linus-3.4-20120513' of git://git.infradead.org/linux-mtd:
  mtd: ams-delta: fix request_mem_region() failure
  jffs2: Fix lock acquisition order bug in gc path
  mtd: fix oops in dataflash driver
2012-05-13 11:33:09 -07:00
Rafael J. Wysocki
351520a9eb Merge branch 'pm-sleep'
* pm-sleep:
  PM / Sleep: User space wakeup sources garbage collector Kconfig option
  PM / Sleep: Make the limit of user space wakeup sources configurable
  PM / Documentation: suspend-and-cpuhotplug.txt: Fix typo
  PM / Sleep: Fix a mistake in a conditional in autosleep_store()
  epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready
  PM / Sleep: Add user space interface for manipulating wakeup sources, v3
  PM / Sleep: Add "prevent autosleep time" statistics to wakeup sources
  PM / Sleep: Implement opportunistic sleep, v2
  PM / Sleep: Add wakeup_source_activate and wakeup_source_deactivate tracepoints
  PM / Sleep: Change wakeup source statistics to follow Android
  PM / Sleep: Use wait queue to signal "no wakeup events in progress"
  PM / Sleep: Look for wakeup events in later stages of device suspend
  PM / Hibernate: Hibernate/thaw fixes/improvements
2012-05-11 21:15:09 +02:00
Bernd Schubert
f908ee9463 bio allocation failure due to bio_get_nr_vecs()
The number of bio_get_nr_vecs() is passed down via bio_alloc() to
bvec_alloc_bs(), which fails the bio allocation if
nr_iovecs > BIO_MAX_PAGES. For the underlying caller this causes an
unexpected bio allocation failure.
Limiting to queue_max_segments() is not sufficient, as max_segments
also might be very large.

bvec_alloc_bs(gfp_mask, nr_iovecs, ) => NULL when nr_iovecs  > BIO_MAX_PAGES
bio_alloc_bioset(gfp_mask, nr_iovecs, ...)
bio_alloc(GFP_NOIO, nvecs)
xfs_alloc_ioend_bio()

Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-11 16:45:12 +02:00
Jeff Moyer
080399aaaf block: don't mark buffers beyond end of disk as mapped
Hi,

We have a bug report open where a squashfs image mounted on ppc64 would
exhibit errors due to trying to read beyond the end of the disk.  It can
easily be reproduced by doing the following:

[root@ibm-p750e-02-lp3 ~]# ls -l install.img
-rw-r--r-- 1 root root 142032896 Apr 30 16:46 install.img
[root@ibm-p750e-02-lp3 ~]# mount -o loop ./install.img /mnt/test
[root@ibm-p750e-02-lp3 ~]# dd if=/dev/loop0 of=/dev/null
dd: reading `/dev/loop0': Input/output error
277376+0 records in
277376+0 records out
142016512 bytes (142 MB) copied, 0.9465 s, 150 MB/s

In dmesg, you'll find the following:

squashfs: version 4.0 (2009/01/31) Phillip Lougher
[   43.106012] attempt to access beyond end of device
[   43.106029] loop0: rw=0, want=277410, limit=277408
[   43.106039] Buffer I/O error on device loop0, logical block 138704
[   43.106053] attempt to access beyond end of device
[   43.106057] loop0: rw=0, want=277412, limit=277408
[   43.106061] Buffer I/O error on device loop0, logical block 138705
[   43.106066] attempt to access beyond end of device
[   43.106070] loop0: rw=0, want=277414, limit=277408
[   43.106073] Buffer I/O error on device loop0, logical block 138706
[   43.106078] attempt to access beyond end of device
[   43.106081] loop0: rw=0, want=277416, limit=277408
[   43.106085] Buffer I/O error on device loop0, logical block 138707
[   43.106089] attempt to access beyond end of device
[   43.106093] loop0: rw=0, want=277418, limit=277408
[   43.106096] Buffer I/O error on device loop0, logical block 138708
[   43.106101] attempt to access beyond end of device
[   43.106104] loop0: rw=0, want=277420, limit=277408
[   43.106108] Buffer I/O error on device loop0, logical block 138709
[   43.106112] attempt to access beyond end of device
[   43.106116] loop0: rw=0, want=277422, limit=277408
[   43.106120] Buffer I/O error on device loop0, logical block 138710
[   43.106124] attempt to access beyond end of device
[   43.106128] loop0: rw=0, want=277424, limit=277408
[   43.106131] Buffer I/O error on device loop0, logical block 138711
[   43.106135] attempt to access beyond end of device
[   43.106139] loop0: rw=0, want=277426, limit=277408
[   43.106143] Buffer I/O error on device loop0, logical block 138712
[   43.106147] attempt to access beyond end of device
[   43.106151] loop0: rw=0, want=277428, limit=277408
[   43.106154] Buffer I/O error on device loop0, logical block 138713
[   43.106158] attempt to access beyond end of device
[   43.106162] loop0: rw=0, want=277430, limit=277408
[   43.106166] attempt to access beyond end of device
[   43.106169] loop0: rw=0, want=277432, limit=277408
...
[   43.106307] attempt to access beyond end of device
[   43.106311] loop0: rw=0, want=277470, limit=2774

Squashfs manages to read in the end block(s) of the disk during the
mount operation.  Then, when dd reads the block device, it leads to
block_read_full_page being called with buffers that are beyond end of
disk, but are marked as mapped.  Thus, it would end up submitting read
I/O against them, resulting in the errors mentioned above.  I fixed the
problem by modifying init_page_buffers to only set the buffer mapped if
it fell inside of i_size.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Nick Piggin <npiggin@kernel.dk>

--

Changes from v1->v2: re-used max_block, as suggested by Nick Piggin.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-11 16:42:14 +02:00
Bob Peterson
41db1ab9be GFS2: Add rgrp information to block_alloc trace point
This is a second attempt at a patch that adds rgrp information to the
block allocation trace point for GFS2. As suggested, the patch was
modified to list the rgrp information _after_ the fields that exist today.

Again, the reason for this patch is to allow us to trace and debug
problems with the block reservations patch, which is still in the works.
We can debug problems with reservations if we can see what block allocations
result from the block reservations. It may also be handy in figuring out
if there are problems in rgrp free space accounting. In other words,
we can use it to track the rgrp and its free space along side the allocations
that are taking place.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-05-11 10:31:34 +01:00
Bob Peterson
f2f9c81244 GFS2: Eliminate unused "new" parameter to gfs2_meta_indirect_buffer
It turns out that the "new" parameter to function gfs2_meta_indirect_buffer
was always being passed in as zero. Therefore, this patch eliminates it
and simplifies the function.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-05-11 10:19:23 +01:00
Linus Torvalds
26fe575028 vfs: make it possible to access the dentry hash/len as one 64-bit entry
This allows comparing hash and len in one operation on 64-bit
architectures.  Right now only __d_lookup_rcu() takes advantage of this,
since that is the case we care most about.

The use of anonymous struct/unions hides the alternate 64-bit approach
from most users, the exception being a few cases where we initialize a
'struct qstr' with a static initializer.  This makes the problematic
cases use a new QSTR_INIT() helper function for that (but initializing
just the name pointer with a "{ .name = xyzzy }" initializer remains
valid, as does just copying another qstr structure).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 19:54:35 -07:00
Linus Torvalds
ee983e8967 vfs: move dentry name length comparison from dentry_cmp() into callers
All callers do want to check the dentry length, but some of them can
check the length and the hash together, so doing it in dentry_cmp() can
be counter-productive.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 19:54:35 -07:00
Linus Torvalds
94753db5ed vfs: do the careful dentry name access for all dentry_cmp cases
Commit 12f8ad4b05 ("vfs: clean up __d_lookup_rcu() and dentry_cmp()
interfaces") did the careful ACCESS_ONCE() of the dentry name only for
the word-at-a-time case, even though the issue is generic.

Admittedly I don't really see gcc ever reloading the value in the middle
of the loop, so the ACCESS_ONCE() protects us from a fairly theoretical
issue. But better safe than sorry.

Also, this consolidates the common parts of the word-at-a-time and
bytewise logic, which includes checking the length.  We'll be changing
that later.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 19:54:09 -07:00
Linus Torvalds
8c01a529b8 vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu
The check for d_unhashed() is not strictly incorrect, but at the same
time it is also not sensible.  The actual dentry removal from the dentry
hash chains is totally asynchronous to the __d_lookup_rcu() logic, and
we depend on __d_drop() updating the sequence number to invalidate any
lookup of an unhashed dentry.

So checking d_unhashed() is not incorrect, but it's not useful either:
the code has to work correctly even without it. So just remove it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 19:52:35 -07:00
Linus Torvalds
7c283324da Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc fixes from Andrew Morton.

* emailed from Andrew Morton <akpm@linux-foundation.org>: (8 patches)
  MAINTAINERS: add maintainer for LED subsystem
  mm: nobootmem: fix sign extend problem in __free_pages_memory()
  drivers/leds: correct __devexit annotations
  memcg: free spare array to avoid memory leak
  namespaces, pid_ns: fix leakage on fork() failure
  hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow()
  mm: fix division by 0 in percpu_pagelist_fraction()
  proc/pid/pagemap: correctly report non-present ptes and holes between vmas
2012-05-10 15:17:24 -07:00
Konstantin Khlebnikov
16fbdce62d proc/pid/pagemap: correctly report non-present ptes and holes between vmas
Reset the current pagemap-entry if the current pte isn't present, or if
current vma is over.  Otherwise pagemap reports last entry again and
again.

Non-present pte reporting was broken in commit 092b50bacd ("pagemap:
introduce data structure for pagemap entry")

Reporting for holes was broken in commit 5aaabe831e ("pagemap: avoid
splitting thp when reading /proc/pid/pagemap")

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 15:06:44 -07:00
Dan Carpenter
48a5730e5b cifs: fix revalidation test in cifs_llseek()
This test is always true so it means we revalidate the length every
time, which generates more network traffic.  When it is SEEK_SET or
SEEK_CUR, then we don't need to revalidate.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2012-05-09 15:16:22 -05:00
Trond Myklebust
0427708657 NFS: Clean up - Simplify reference counting in fs/nfs/direct.c
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
2012-05-09 15:17:49 -04:00
Trond Myklebust
1d1afcbc29 NFS: Clean up - Rename nfs_unlock_request and nfs_unlock_request_dont_release
Function rename to ensure that the functionality of nfs_unlock_request()
mirrors that of nfs_lock_request(). Then let nfs_unlock_and_release_request()
do the work of what used to be called nfs_unlock_request()...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
2012-05-09 15:17:43 -04:00
Trond Myklebust
7ad84aa944 NFS: Clean up - simplify nfs_lock_request()
We only have two places where we need to grab a reference when trying
to lock the nfs_page. We're better off making that explicit.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
2012-05-09 15:17:34 -04:00
Trond Myklebust
d1182b33ed NFS: nfs_set_page_writeback no longer needs to reference the page
We now hold a reference to the nfs_page across the calls to
nfs_set_page_writeback and nfs_end_page_writeback, and that
means we already have a reference to the struct page.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
2012-05-09 15:17:28 -04:00
Trond Myklebust
3aff4ebb95 NFS: Prevent a deadlock in the new writeback code
We have to unlock the nfs_page before we call nfs_end_page_writeback
to avoid races with functions that expect the page to be unlocked
when PG_locked and PG_writeback are not set.
The problem is that nfs_unlock_request also releases the nfs_page,
causing a deadlock if the release of the nfs_open_context
triggers an iput() while the PG_writeback flag is still set...

The solution is to separate the unlocking and release of the nfs_page,
so that we can do the former before nfs_end_page_writeback and the
latter after.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
2012-05-09 15:16:07 -04:00
Trond Myklebust
dc327ed4cd NFSv4: nfs_client_return_marked_delegations can't flush data
Since even filemap_flush() needs to lock pages that are dirty, we
cannot risk calling it from the state manager context. Therefore,
we need to move the call to filemap_flush() to
nfs_async_inode_return_delegation().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-05-08 12:53:21 -04:00
Trond Myklebust
c57d1bc5e0 NFS: nfs_inode_return_delegation() should always flush dirty data
The assumption is that if you are in a situation where you need to
return the delegation, then you should probably stop caching the
data anyway.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-05-08 12:53:21 -04:00
Trond Myklebust
14546c3375 NFS: Don't do a full flush to disk on close() if we hold a delegation
If we hold a delegation then we know that it should be safe to continue
to cache the data beyond the close(). However since the process that wrote
the data may die after close(), we may still want to send the data to
server before those RPCSEC_GSS credentials expire. We therefore compromise
by starting writeback to the server, but don't wait for completion.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-05-08 12:53:21 -04:00
Bob Peterson
6de1e2f34a GFS2: Remove redundant metadata block type check
This patch removes a redundant metadata block check. See description below.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-05-08 16:18:55 +01:00
David S. Miller
0d6c4a2e46 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/e1000e/param.c
	drivers/net/wireless/iwlwifi/iwl-agn-rx.c
	drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c
	drivers/net/wireless/iwlwifi/iwl-trans.h

Resolved the iwlwifi conflict with mainline using 3-way diff posted
by John Linville and Stephen Rothwell.  In 'net' we added a bug
fix to make iwlwifi report a more accurate skb->truesize but this
conflicted with RX path changes that happened meanwhile in net-next.

In e1000e a conflict arose in the validation code for settings of
adapter->itr.  'net-next' had more sophisticated logic so that
logic was used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-07 23:35:40 -04:00
Josh Cartwright
226bb7df3d jffs2: Fix lock acquisition order bug in gc path
The locking policy is such that the erase_complete_block spinlock is
nested within the alloc_sem mutex.  This fixes a case in which the
acquisition order was erroneously reversed.  This issue was caught by
the following lockdep splat:

   =======================================================
   [ INFO: possible circular locking dependency detected ]
   3.0.5 #1
   -------------------------------------------------------
   jffs2_gcd_mtd6/299 is trying to acquire lock:
    (&c->alloc_sem){+.+.+.}, at: [<c01f7714>] jffs2_garbage_collect_pass+0x314/0x890

   but task is already holding lock:
    (&(&c->erase_completion_lock)->rlock){+.+...}, at: [<c01f7708>] jffs2_garbage_collect_pass+0x308/0x890

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #1 (&(&c->erase_completion_lock)->rlock){+.+...}:
          [<c008bec4>] validate_chain+0xe6c/0x10bc
          [<c008c660>] __lock_acquire+0x54c/0xba4
          [<c008d240>] lock_acquire+0xa4/0x114
          [<c046780c>] _raw_spin_lock+0x3c/0x4c
          [<c01f744c>] jffs2_garbage_collect_pass+0x4c/0x890
          [<c01f937c>] jffs2_garbage_collect_thread+0x1b4/0x1cc
          [<c0071a68>] kthread+0x98/0xa0
          [<c000f264>] kernel_thread_exit+0x0/0x8

   -> #0 (&c->alloc_sem){+.+.+.}:
          [<c008ad2c>] print_circular_bug+0x70/0x2c4
          [<c008c08c>] validate_chain+0x1034/0x10bc
          [<c008c660>] __lock_acquire+0x54c/0xba4
          [<c008d240>] lock_acquire+0xa4/0x114
          [<c0466628>] mutex_lock_nested+0x74/0x33c
          [<c01f7714>] jffs2_garbage_collect_pass+0x314/0x890
          [<c01f937c>] jffs2_garbage_collect_thread+0x1b4/0x1cc
          [<c0071a68>] kthread+0x98/0xa0
          [<c000f264>] kernel_thread_exit+0x0/0x8

   other info that might help us debug this:

    Possible unsafe locking scenario:

          CPU0                    CPU1
          ----                    ----
     lock(&(&c->erase_completion_lock)->rlock);
                                  lock(&c->alloc_sem);
                                  lock(&(&c->erase_completion_lock)->rlock);
     lock(&c->alloc_sem);

    *** DEADLOCK ***

   1 lock held by jffs2_gcd_mtd6/299:
    #0:  (&(&c->erase_completion_lock)->rlock){+.+...}, at: [<c01f7708>] jffs2_garbage_collect_pass+0x308/0x890

   stack backtrace:
   [<c00155dc>] (unwind_backtrace+0x0/0x100) from [<c0463dc0>] (dump_stack+0x20/0x24)
   [<c0463dc0>] (dump_stack+0x20/0x24) from [<c008ae84>] (print_circular_bug+0x1c8/0x2c4)
   [<c008ae84>] (print_circular_bug+0x1c8/0x2c4) from [<c008c08c>] (validate_chain+0x1034/0x10bc)
   [<c008c08c>] (validate_chain+0x1034/0x10bc) from [<c008c660>] (__lock_acquire+0x54c/0xba4)
   [<c008c660>] (__lock_acquire+0x54c/0xba4) from [<c008d240>] (lock_acquire+0xa4/0x114)
   [<c008d240>] (lock_acquire+0xa4/0x114) from [<c0466628>] (mutex_lock_nested+0x74/0x33c)
   [<c0466628>] (mutex_lock_nested+0x74/0x33c) from [<c01f7714>] (jffs2_garbage_collect_pass+0x314/0x890)
   [<c01f7714>] (jffs2_garbage_collect_pass+0x314/0x890) from [<c01f937c>] (jffs2_garbage_collect_thread+0x1b4/0x1cc)
   [<c01f937c>] (jffs2_garbage_collect_thread+0x1b4/0x1cc) from [<c0071a68>] (kthread+0x98/0xa0)
   [<c0071a68>] (kthread+0x98/0xa0) from [<c000f264>] (kernel_thread_exit+0x0/0x8)

This was introduce in '81cfc9f jffs2: Fix serious write stall due to erase'.

Cc: stable@kernel.org [2.6.37+]
Signed-off-by: Josh Cartwright <joshc@linux.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2012-05-07 20:30:14 +01:00
Linus Torvalds
8529f613b6 vfs: don't force a big memset of stat data just to clear padding fields
Admittedly this is something that the compiler should be able to just do
for us, but gcc just isn't that smart.  And trying to use a structure
initializer (which would get us the right semantics) ends up resulting
in gcc allocating stack space for _two_ 'struct stat', and then copying
one into the other.

So do it by hand - just have a per-architecture macro that initializes
the padding fields.  And if the architecture doesn't provide one, fall
back to the old behavior of just doing the whole memset() first.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-06 18:02:40 -07:00
Linus Torvalds
a52dd971f9 vfs: de-crapify "cp_new_stat()" function
It's an unreadable mess of 32-bit vs 64-bit #ifdef's that mostly follow
a rather simple pattern.

Make a helper #define to handle that pattern, in the process making the
code both shorter and more readable.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-06 17:47:30 -07:00
Linus Torvalds
271fd5d728 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The big ones here are a memory leak we introduced in rc1, and a
  scheduling while atomic if the transid on disk doesn't match the
  transid we expected.  This happens for corrupt blocks, or out of date
  disks.

  It also fixes up the ioctl definition for our ioctl to resolve logical
  inode numbers.  The __u32 was a merging error and doesn't match what
  we ship in the progs."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: avoid sleeping in verify_parent_transid while atomic
  Btrfs: fix crash in scrub repair code when device is missing
  btrfs: Fix mismatching struct members in ioctl.h
  Btrfs: fix page leak when allocing extent buffers
  Btrfs: Add properly locking around add_root_to_dirty_list
2012-05-06 10:20:07 -07:00
Chris Mason
b9fab919b7 Btrfs: avoid sleeping in verify_parent_transid while atomic
verify_parent_transid needs to lock the extent range to make
sure no IO is underway, and so it can safely clear the
uptodate bits if our checks fail.

But, a few callers are using it with spinlocks held.  Most
of the time, the generation numbers are going to match, and
we don't want to switch to a blocking lock just for the error
case.  This adds an atomic flag to verify_parent_transid,
and changes it to return EAGAIN if it needs to block to
properly verifiy things.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-06 07:23:47 -04:00
Jan Kara
169ebd9013 writeback: Avoid iput() from flusher thread
Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
because iput() can do a lot of work - for example truncate the inode if it's
the last iput on unlinked file. Some filesystems depend on flusher thread
progressing (e.g. because they need to flush delay allocated blocks to reduce
allocation uncertainty) and so flusher thread doing truncate creates
interesting dependencies and possibilities for deadlocks.

We get rid of iput() in flusher thread by using the fact that I_SYNC inode
flag effectively pins the inode in memory. So if we take care to either hold
i_lock or have I_SYNC set, we can get away without taking inode reference
in writeback_sb_inodes().

As a side effect of these changes, we also fix possible use-after-free in
wb_writeback() because inode_wait_for_writeback() call could try to reacquire
i_lock on the inode that was already free.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-06 13:43:41 +08:00
Jan Kara
dbd5768f87 vfs: Rename end_writeback() to clear_inode()
After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-06 13:43:41 +08:00
Jan Kara
7994e6f725 vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
Currently, I_SYNC can never be set when evict_inode() (and thus
end_writeback()) is called because flusher thread holds inode reference while
inode is under writeback. As a result inode_sync_wait() in those places
currently does nothing. However that is going to change and unveils problems
with calling inode_sync_wait() from end_writeback(). Several filesystems call
end_writeback() after they have deleted the inode (btrfs, gfs2, ...) and other
filesystems (ext3, ext4, reiserfs, ...) can deadlock when waiting for I_SYNC
because they call end_writeback() from within a transaction.

To avoid these issues, we move inode_sync_wait() into evict_inode() before
calling ->evict_inode(). That way we preserve the current property that
->evict_inode() and writeback never run in parallel and all filesystems are
safe.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-06 13:43:40 +08:00
Jan Kara
4f8ad655db writeback: Refactor writeback_single_inode()
The code in writeback_single_inode() is relatively complex. The list requeing
logic makes sense only for flusher thread but not really for sync_inode() or
write_inode_now() callers. Also when we want to get rid of inode references
held by flusher thread, we will need a special I_SYNC handling there.

So separate part of writeback_single_inode() which does the real writeback work
into __writeback_single_inode() and make writeback_single_inode() do only stuff
necessary for callers writing only one inode, moving the special list handling
into writeback_sb_inodes(). As a sideeffect this fixes a possible race where we
could skip some inode during sync(2) because other writer refiled it from b_io
to b_dirty list. Also I_SYNC handling is moved into the callers of
__writeback_single_inode() to make locking easier.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-06 13:43:40 +08:00