Commit graph

28355 commits

Author SHA1 Message Date
Chris Mason
cbea5ac1ee Btrfs: reduce calls to wake_up on uncontended locks
The btrfs locks were unconditionally calling wake_up as the
locks were released.  This lead to extra thrashing on the waitqueue,
especially for locks that were dominated by readers.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-23 15:36:18 -04:00
Chris Mason
e39e64ac0c Btrfs: don't wait around for new log writers on an SSD
Waiting on spindles improves performance, but ssds want all the
IO as quickly as we can push it down.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-23 15:36:17 -04:00
Linus Torvalds
a66d2c8f7e Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull the big VFS changes from Al Viro:
 "This one is *big* and changes quite a few things around VFS.  What's in there:

   - the first of two really major architecture changes - death to open
     intents.

     The former is finally there; it was very long in making, but with
     Miklos getting through really hard and messy final push in
     fs/namei.c, we finally have it.  Unlike his variant, this one
     doesn't introduce struct opendata; what we have instead is
     ->atomic_open() taking preallocated struct file * and passing
     everything via its fields.

     Instead of returning struct file *, it returns -E...  on error, 0
     on success and 1 in "deal with it yourself" case (e.g.  symlink
     found on server, etc.).

     See comments before fs/namei.c:atomic_open().  That made a lot of
     goodies finally possible and quite a few are in that pile:
     ->lookup(), ->d_revalidate() and ->create() do not get struct
     nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
     flags instead, ->create() gets "do we want it exclusive" flag.

     With the introduction of new helper (kern_path_locked()) we are rid
     of all struct nameidata instances outside of fs/namei.c; it's still
     visible in namei.h, but not for long.  Come the next cycle,
     declaration will move either to fs/internal.h or to fs/namei.c
     itself.  [me, miklos, hch]

   - The second major change: behaviour of final fput().  Now we have
     __fput() done without any locks held by caller *and* not from deep
     in call stack.

     That obviously lifts a lot of constraints on the locking in there.
     Moreover, it's legal now to call fput() from atomic contexts (which
     has immediately simplified life for aio.c).  We also don't need
     anti-recursion logics in __scm_destroy() anymore.

     There is a price, though - the damn thing has become partially
     asynchronous.  For fput() from normal process we are guaranteed
     that pending __fput() will be done before the caller returns to
     userland, exits or gets stopped for ptrace.

     For kernel threads and atomic contexts it's done via
     schedule_work(), so theoretically we might need a way to make sure
     it's finished; so far only one such place had been found, but there
     might be more.

     There's flush_delayed_fput() (do all pending __fput()) and there's
     __fput_sync() (fput() analog doing __fput() immediately).  I hope
     we won't need them often; see warnings in fs/file_table.c for
     details.  [me, based on task_work series from Oleg merged last
     cycle]

   - sync series from Jan

   - large part of "death to sync_supers()" work from Artem; the only
     bits missing here are exofs and ext4 ones.  As far as I understand,
     those are going via the exofs and ext4 trees resp.; once they are
     in, we can put ->write_super() to the rest, along with the thread
     calling it.

   - preparatory bits from unionmount series (from dhowells).

   - assorted cleanups and fixes all over the place, as usual.

  This is not the last pile for this cycle; there's at least jlayton's
  ESTALE work and fsfreeze series (the latter - in dire need of fixes,
  so I'm not sure it'll make the cut this cycle).  I'll probably throw
  symlink/hardlink restrictions stuff from Kees into the next pile, too.
  Plus there's a lot of misc patches I hadn't thrown into that one -
  it's large enough as it is..."

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
  ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
  btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
  switch dentry_open() to struct path, make it grab references itself
  spufs: shift dget/mntget towards dentry_open()
  zoran: don't bother with struct file * in zoran_map
  ecryptfs: don't reinvent the wheels, please - use struct completion
  don't expose I_NEW inodes via dentry->d_inode
  tidy up namei.c a bit
  unobfuscate follow_up() a bit
  ext3: pass custom EOF to generic_file_llseek_size()
  ext4: use core vfs llseek code for dir seeks
  vfs: allow custom EOF in generic_file_llseek code
  vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
  vfs: Remove unnecessary flushing of block devices
  vfs: Make sys_sync writeout also block device inodes
  vfs: Create function for iterating over block devices
  vfs: Reorder operations during sys_sync
  quota: Move quota syncing to ->sync_fs method
  quota: Split dquot_quota_sync() to writeback and cache flushing part
  vfs: Move noop_backing_dev_info check from sync into writeback
  ...
2012-07-23 12:27:27 -07:00
Cong Wang
906adea153 jbd2: remove the second argument of kmap_atomic
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-07-23 14:11:22 +08:00
Prasad Joshi
9f0bbd8ca7 logfs: query block device for number of pages to send with bio
The block device driver puts a limit on maximum number of pages that
can be sent with the bio. Not all block devices can handle
BIO_MAX_PAGES number of pages in bio. Specifically the virtio-blk
diriver limits it to 126. When the LogFS file system was excersized in
KVM, the following bug from do_virtblk_request() was observed

static void do_virtblk_request(struct request_queue *q)
{
	....
	....
	while ((req = blk_peek_request(q)) != NULL) {
		BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
		....
		....
	}
	....
}

The patch fixes the problem by querring the maximum number of pages in
bio allowed from block device driver and then using those many pages
during submit_bio.

Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-07-23 10:32:11 +05:30
Prasad Joshi
41b93bc1ee logfs: maintain the ordering of meta-inode destruction
LogFS does not use a specialized area to maintain the inodes. The
inodes information is kept in a specialized file called inode file.
Similarly, the segment information is kept in a segment file. Since
the segment file also has an inode which is kept in the inode file,
the inode for segment file must be evicted before the inode for inode
file. The change fixes the following BUG during unmount

Pid: 2057, comm: umount Not tainted 3.5.0-rc6+ #25 Bochs Bochs
RIP: 0010:[<ffffffffa005c5f2>]  [<ffffffffa005c5f2>] move_page_to_btree+0x32/0x1f0 [logfs]
Process umount (pid: 2057, threadinfo ...)
Call Trace:
[<ffffffff8112adca>] ? find_get_pages+0x2a/0x180
[<ffffffffa00549f5>] logfs_invalidatepage+0x85/0x90 [logfs]
[<ffffffff81136c51>] truncate_inode_page+0xb1/0xd0
[<ffffffff81136dcf>] truncate_inode_pages_range+0x15f/0x490
[<ffffffff81558549>] ? printk+0x78/0x7a
[<ffffffff81137185>] truncate_inode_pages+0x15/0x20
[<ffffffffa005b7fc>] logfs_evict_inode+0x6c/0x190 [logfs]
[<ffffffff8155c75b>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff8119e3d7>] evict+0xa7/0x1b0
[<ffffffff8119ea6e>] dispose_list+0x3e/0x60
[<ffffffff8119f1c4>] evict_inodes+0xf4/0x110
[<ffffffff81185b53>] generic_shutdown_super+0x53/0xf0
[<ffffffffa005d8f2>] logfs_kill_sb+0x52/0xf0 [logfs]
[<ffffffff81185ec5>] deactivate_locked_super+0x45/0x80
[<ffffffff81186a4a>] deactivate_super+0x4a/0x70
[<ffffffff811a228e>] mntput_no_expire+0xde/0x140
[<ffffffff811a30ff>] sys_umount+0x6f/0x3a0
[<ffffffff8155d8e9>] system_call_fastpath+0x16/0x1b
---[ end trace 45f7752082cefafd ]---

Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-07-23 09:35:52 +05:30
Theodore Ts'o
03179fe923 ext4: undo ext4_calc_metadata_amount if we fail to claim space
The function ext4_calc_metadata_amount() has side effects, although
it's not obvious from its function name.  So if we fail to claim
space, regardless of whether we retry to claim the space again, or
return an error, we need to undo these side effects.

Otherwise we can end up incorrectly calculating the number of metadata
blocks needed for the operation, which was responsible for an xfstests
failure for test #271 when using an ext2 file system with delalloc
enabled.

Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-07-23 00:00:20 -04:00
Brian Foster
97795d2a5b ext4: don't let i_reserved_meta_blocks go negative
If we hit a condition where we have allocated metadata blocks that
were not appropriately reserved, we risk underflow of
ei->i_reserved_meta_blocks.  In turn, this can throw
sbi->s_dirtyclusters_counter significantly out of whack and undermine
the nondelalloc fallback logic in ext4_nonda_switch().  Warn if this
occurs and set i_allocated_meta_blocks to avoid this problem.

This condition is reproduced by xfstests 270 against ext2 with
delalloc enabled:

Mar 28 08:58:02 localhost kernel: [  171.526344] EXT4-fs (loop1): delayed block allocation failed for inode 14 at logical offset 64486 with max blocks 64 with error -28
Mar 28 08:58:02 localhost kernel: [  171.526346] EXT4-fs (loop1): This should not happen!! Data will be lost

270 ultimately fails with an inconsistent filesystem and requires an
fsck to repair.  The cause of the error is an underflow in
ext4_da_update_reserve_space() due to an unreserved meta block
allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-07-22 23:59:40 -04:00
Prasad Joshi
ddb24bbac3 logfs: create a pagecache page if it is not present
While writing the partial journal entries we assumed that the page
associated with the journal would always in locatable. This incorrect
assumption resulted in the following BUG

kernel BUG at /home/benixon/WD_SMR/kernels/linux-3.3.7-logfs/fs/logfs/journal.c:569!
EIP is at logfs_write_area+0xb6/0x109 [logfs]
EAX: 00000000 EBX: 00000000 ECX: ef6efea4 EDX: 00000000
ESI: 001b9000 EDI: f009e000 EBP: c3c13f14 ESP: c3c13ef0
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process sync (pid: 1799, ti=c3c12000 task=f07825b0 task.ti=c3c12000)
Stack:
01001000 c3c13f26 781b9000 00000000 f009e000 f7286000 f1f83400 f8445071
f1f83400 c3c13f30 f8445ae9 c3c13f20 0000100a 000ee000 f009e000 00000001
c3c13f5c f8445d17 c05eb0ee 00000000 f1f83400 ef718000 f009e25c ea9c3d80
Call Trace:
[<f8445071>] ? account_shadow+0x16d/0x16d [logfs]
[<f8445ae9>] logfs_write_je+0x2a/0x44 [logfs]
[<f8445d17>] logfs_write_anchor+0x114/0x228 [logfs]
[<c05eb0ee>] ? empty+0x5/0x5
[<f8444522>] logfs_sync_fs+0x1e/0x31 [logfs]
[<c051be66>] __sync_filesystem+0x5d/0x6f
[<c051be8d>] sync_one_sb+0x15/0x17
[<c04ff8b0>] iterate_supers+0x59/0x9a
[<c051be78>] ? __sync_filesystem+0x6f/0x6f
[<c051befc>] sys_sync+0x29/0x4f
[<c084285f>] sysenter_do_call+0x12/0x28
EIP: [<f8445127>] logfs_write_area+0xb6/0x109 [logfs] SS:ESP 0068:c3c13ef0
---[ end trace ef6e9ef52601a945 ]---

The fix is to create the pagecache page if it is not locatable.

Reported-and-tested-by: Benixon Dhas <Benixon.Dhas@wdc.com>
Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-07-23 09:18:14 +05:30
Ashish Sangwan
968dee7722 ext4: fix hole punch failure when depth is greater than 0
Whether to continue removing extents or not is decided by the return
value of function ext4_ext_more_to_rm() which checks 2 conditions:
a) if there are no more indexes to process.
b) if the number of entries are decreased in the header of "depth -1".

In case of hole punch, if the last block to be removed is not part of
the last extent index than this index will not be deleted, hence the
number of valid entries in the extent header of "depth - 1" will
remain as it is and ext4_ext_more_to_rm will return 0 although the
required blocks are not yet removed.

This patch fixes the above mentioned problem as instead of removing
the extents from the end of file, it starts removing the blocks from
the particular extent from which removing blocks is actually required
and continue backward until done.

Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2012-07-22 22:49:08 -04:00
Artem Bityutskiy
b50924c2c6 ext4: remove unnecessary argument from __ext4_handle_dirty_metadata()
The '__ext4_handle_dirty_metadata()' does not need the 'now' argument
anymore and we can kill it.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22 20:37:31 -04:00
Artem Bityutskiy
4d47603d97 ext4: weed out ext4_write_super
We do not depend on VFS's '->write_super()' anymore and do not need
the 's_dirt' flag anymore, so weed out 'ext4_write_super()' and
's_dirt'.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22 20:35:31 -04:00
Artem Bityutskiy
58c5873a76 ext4: remove unnecessary superblock dirtying
This patch changes the 'ext4_handle_dirty_super()' function which
submits the superblock for I/O in the following cases:

1. When creating the first large file on a file system without
   EXT4_FEATURE_RO_COMPAT_LARGE_FILE feature.
2. When re-sizing the file-system.
3. When creating an xattr on a file-system without the
   EXT4_FEATURE_COMPAT_EXT_ATTR feature.

If the file-system has journal enabled, the superblock is written via
the journal. We do not modify this path.

If the file-system has no journal, this function, falls back to just
marking the superblock as dirty using the 's_dirt' superblock
flag. This means that it delays the actual superblock I/O submission
by 5 seconds (default setting).  Namely, the 'sync_supers()' kernel
thread will call 'ext4_write_super()' later and will actually submit
the superblock for I/O.

And this is the behavior this patch modifies: we stop using 's_dirt'
and just mark the superblock buffer as dirty right away. Indeed, all 3
cases above are extremely rare and it does not add any value to delay
the I/O submission for them.

Note: 'ext4_handle_dirty_super()' executes
'__ext4_handle_dirty_super()' with 'now = 0'. This patch basically
makes the 'now' argument unneeded and it will be deleted in one of the
next patches.

This patch also removes 's_dirt' condition on the unmount path because
we never set it anymore, so we should not test it.

Tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22 20:33:31 -04:00
Jan Kara
044ce47fec ext4: convert last user of ext4_mark_super_dirty() to ext4_handle_dirty_super()
The last user of ext4_mark_super_dirty() in ext4_file_open() is so
rare it can well be modifying the superblock properly by journalling
the change.  Change it and get rid of ext4_mark_super_dirty() as it's
not needed anymore.

Artem: small amendments.
Artem: tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-07-22 20:31:31 -04:00
Jan Kara
97a7406880 ext4: remove useless marking of superblock dirty
Commit a0375156 properly notes that superblock doesn't need to be marked
as dirty when only number of free inodes / blocks / number of directories
changes since that is recomputed on each mount anyway. However that comment
leaves some unnecessary markings as dirty in place. Remove these.

Artem: tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-07-22 20:29:31 -04:00
Al Viro
254706056b ext4: fix ext4 mismerge back in January
Duplicate caused, AFAICS, by mismerge in
ff9cb1c4eead5e4c292e75cd3170a82d66944101>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-07-22 20:27:31 -04:00
Theodore Ts'o
3108b54bce ext4: remove dynamic array size in ext4_chksum()
The ext4_checksum() inline function was using a dynamic array size,
which is not legal C.  (It is a gcc extension).

Remove it.

Cc: "Darrick J. Wong" <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-22 20:25:31 -04:00
Theodore Ts'o
8a9918497b ext4: remove unused variable in ext4_update_super()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-22 20:23:31 -04:00
Aditya Kali
7c319d3285 ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.

It is based on the proposal at:                                                                                                           
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4

This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields.  Some changes introduced by this patch that should be pointed
out are:

1) Two new ext4-superblock fields - s_usr_quota_inum and
   s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
   for tracking group quota. The superblock fields can be set to use
   other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
   superblock, the quota usage tracking is turned on at mount time. On
   'quotaon' ioctl, the quota limits enforcement is turned
   on. 'quotaoff' ioctl turns off only the limits enforcement in this
   case.
4) When QUOTA feature is in use, the quota mount options 'quota',
   'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
   quota inodes. The default reserved inodes will not be visible to user
   as regular files.
6) The quota-tools will need to be modified to support hidden quota
   files on ext4. E2fsprogs will also include support for creating and
   fixing quota files.
7) Support is only for the new V2 quota file format.

Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-22 20:21:31 -04:00
Zheng Liu
4bd809dbbf ext4: don't take the i_mutex lock when doing DIO overwrites
Aligned and overwrite direct I/O can be parallelized.  In
ext4_file_dio_write, we first check whether these conditions are
satisfied or not.  If so, we take i_data_sem and release i_mutex lock
directly.  Meanwhile iocb->private is set to indicate that this is a
dio overwrite, and it will be handled in ext4_ext_direct_IO.

[ Added fix from Dan Carpenter to fix locking bug on the error path. ]

CC: Tao Ma <tm@tao.ma>
CC: Eric Sandeen <sandeen@redhat.com>
CC: Robin Dong <hao.bigrat@gmail.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-22 20:19:31 -04:00
Al Viro
8cae6f7158 ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:55 +04:00
Al Viro
11e62a8fab btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:43 +04:00
Al Viro
765927b2d5 switch dentry_open() to struct path, make it grab references itself
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:29 +04:00
Al Viro
3b8b487114 ecryptfs: don't reinvent the wheels, please - use struct completion
... and keep the sodding requests on stack - they are small enough.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:02 +04:00
Al Viro
8fc37ec54c don't expose I_NEW inodes via dentry->d_inode
d_instantiate(dentry, inode);
	unlock_new_inode(inode);

is a bad idea; do it the other way round...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:58 +04:00
Al Viro
32a7991b6a tidy up namei.c a bit
locking/unlocking for rcu walk taken to a couple of inline helpers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:55 +04:00
Al Viro
3c0a616368 unobfuscate follow_up() a bit
really convoluted test in there has grown up during struct mount
introduction; what it checks is that we'd reached the root of
mount tree.
2012-07-23 00:00:45 +04:00
Eric Sandeen
de9b942202 ext3: pass custom EOF to generic_file_llseek_size()
Use the new custom EOF argument to generic_file_llseek_size so
that SEEK_END will go to the max hash value for htree dirs
in ext3 rather than to i_size_read()

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:30 +04:00
Eric Sandeen
ec7268ce21 ext4: use core vfs llseek code for dir seeks
Use the new functionality in generic_file_llseek_size() to
accept a custom EOF position, and un-cut-and-paste all the
vfs llseek code from ext4.

Also fix up comments on ext4_llseek() to reflect reality.

Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:28 +04:00
Eric Sandeen
e8b96eb503 vfs: allow custom EOF in generic_file_llseek code
For ext3/4 htree directories, using the vfs llseek function with
SEEK_END goes to i_size like for any other file, but in reality
we want the maximum possible hash value.  Recent changes
in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
but replicating this core code seems like a bad idea, especially
since the copy has already diverged from the vfs.

This patch updates generic_file_llseek_size to accept
both a custom maximum offset, and a custom EOF position.  With this
in place, ext4_dir_llseek can pass in the appropriate maximum hash
position for both maxsize and eof, and get what it wants.

As far as I know, this does not fix any bugs - nfs in the kernel
doesn't use SEEK_END, and I don't know of any user who does.  But
some ext4 folks seem keen on doing the right thing here, and I can't
really argue.

(Patch also fixes up some comments slightly)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:15 +04:00
Jan Kara
4ea425b63a vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
wakeup_flusher_threads(0) will queue work doing complete writeback for each
flusher thread. Thus there is not much point in submitting another work doing
full inode WB_SYNC_NONE writeback by writeback_inodes_sb().

After this change it does not make sense to call nonblocking ->sync_fs and
block device flush before calling sync_inodes_sb() because
wakeup_flusher_threads() is completely asynchronous and thus these functions
would be called in parallel with inode writeback running which will effectively
void any work they do. So we move sync_inodes_sb() call before these two
functions.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:59:01 +04:00
Jan Kara
d0e91b13eb vfs: Remove unnecessary flushing of block devices
It is not necessary to write block devices twice. The reason why we first did
flush and then proper sync is that
  for_each_bdev() {
    write_bdev()
    wait_for_completion()
  }
is much slower than
  for_each_bdev()
    write_bdev()
  for_each_bdev()
    wait_for_completion()
when there is bigger amount of data. But as is seen in the above, there's no real
need to scan pages and submit them twice. We just need to separate the submission
and waiting part. This patch does that.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:53 +04:00
Jan Kara
a8c7176b6d vfs: Make sys_sync writeout also block device inodes
In case block device does not have filesystem mounted on it, sys_sync will just
ignore it and doesn't writeout its dirty pages. This is because writeback code
avoids writing inodes from superblock without backing device and
blockdev_superblock is such a superblock.  Since it's unexpected that sync
doesn't writeout dirty data for block devices be nice to users and change the
behavior to do so. So now we iterate over all block devices on blockdev_super
instead of iterating over all superblocks when syncing block devices.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:49 +04:00
Jan Kara
5c0d6b60a0 vfs: Create function for iterating over block devices
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:45 +04:00
Jan Kara
b3de653105 vfs: Reorder operations during sys_sync
Change the order of operations during sync from

for_each_sb {
        writeback_inodes_sb();
        sync_fs(nowait);
        __sync_blockdev(nowait);
}
for_each_sb {
        sync_inodes_sb();
        sync_fs(wait);
        __sync_blockdev(wait);
}

to

for_each_sb
        writeback_inodes_sb();
for_each_sb
        sync_fs(nowait);
for_each_sb
        __sync_blockdev(nowait);
for_each_sb
        sync_inodes_sb();
for_each_sb
        sync_fs(wait);
for_each_sb
        __sync_blockdev(wait);

This is a preparation for the following patches in this series.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:41 +04:00
Jan Kara
a117782571 quota: Move quota syncing to ->sync_fs method
Since the moment writes to quota files are using block device page cache and
space for quota structures is reserved at the moment they are first accessed we
have no reason to sync quota before inode writeback. In fact this order is now
only harmful since quota information can easily change during inode writeback
(either because conversion of delayed-allocated extents or simply because of
allocation of new blocks for simple filesystems not using page_mkwrite).

So move syncing of quota information after writeback of inodes into ->sync_fs
method. This way we do not have to use ->quota_sync callback which is primarily
intended for use by quotactl syscall anyway and we get rid of calling
->sync_fs() twice unnecessarily. We skip quota syncing for OCFS2 since it does
proper quota journalling in all cases (unlike ext3, ext4, and reiserfs which
also support legacy non-journalled quotas) and thus there are no dirty quota
structures.

CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Joel Becker <jlbec@evilplan.org>
CC: reiserfs-devel@vger.kernel.org
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Dave Kleikamp <shaggy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:34 +04:00
Jan Kara
ceed17236a quota: Split dquot_quota_sync() to writeback and cache flushing part
Split off part of dquot_quota_sync() which writes dquots into a quota file
to a separate function. In the next patch we will use the function from
filesystems and we do not want to abuse ->quota_sync quotactl callback more
than necessary.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:19 +04:00
Jan Kara
6eedc70150 vfs: Move noop_backing_dev_info check from sync into writeback
In principle, a filesystem may want to have ->sync_fs() called during sync(1)
although it does not have a bdi (i.e. s_bdi is set to noop_backing_dev_info).
Only writeback code really needs bdi set to something reasonable. So move the
checks where they are more logical.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:18 +04:00
Artem Bityutskiy
9e9ad5f408 fs/ufs: get rid of write_super
This patch makes UFS stop using the VFS '->write_super()' method along with
the 's_dirt' superblock flag, because they are on their way out.

The way we implement this is that we schedule a delay job instead relying on
's_dirt' and '->write_super()'.

The whole "superblock write-out" VFS infrastructure is served by the
'sync_supers()' kernel thread, which wakes up every 5 (by default) seconds and
writes out all dirty superblocks using the '->write_super()' call-back.  But the
problem with this thread is that it wastes power by waking up the system every
5 seconds, even if there are no diry superblocks, or there are no client
file-systems which would need this (e.g., btrfs does not use
'->write_super()'). So we want to kill it completely and thus, we need to make
file-systems to stop using the '->write_super()' VFS service, and then remove
it together with the kernel thread.

Tested using fsstress from the LTP project.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:16 +04:00
Artem Bityutskiy
7bd54ef722 fs/ufs: re-arrange the code a bit
This patch does not do any functional changes. It only moves 3 functions
in fs/ufs/super.c a little bit up in order to prepare for further changes
where I'll need this new arrangement to avoid forward declarations.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:14 +04:00
Artem Bityutskiy
65e5e83f7d fs/ufs: remove extra superblock write on unmount
UFS calls 'ufs_write_super()' from 'ufs_put_super()' in order to write the
superblocks to the media. However, it is not needed because VFS calls
'->sync_fs()' before calling '->put_super()' - so by the time we are in
'ufs_write_super()', the superblocks are already synchronized.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:14 +04:00
Artem Bityutskiy
9d46be294d fs/sysv: stop using write_super and s_dirt
It does not look like sysv FS needs 'write_super()' at all, because all it
does is a timestamp update. I cannot test this patch, because this
file-system is so old and probably has not been used by anyone for years,
so there are no tools to create it in Linux. But from the code I see that
marking the superblock as dirty is basically marking the superblock buffers as
drity and then setting the s_dirt flag. And when 'write_super()' is executed to
handle the s_dirt flag, we just update the timestamp and again mark the
superblock buffer as dirty. Seems pointless.

It looks like we can update the timestamp more opprtunistically - on unmount
or remount of sync, and nothing should change.

Thus, this patch removes 'sysv_write_super()' and 's_dirt'.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:12 +04:00
Artem Bityutskiy
eee458936b fs/sysv: remove another useless write_super call
We do not need to call 'sysv_write_super()' from 'sysv_remount()',
because VFS has called 'sysv_sync_fs()' before calling '->remount()'.
So remove it. Remove also '(un)lock_super()' which obvioulsy is becoming
useless in this function.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:11 +04:00
Artem Bityutskiy
a4d05d315a fs/sysv: remove useless write_super call
We do not need to call 'sysv_write_super()' from 'sysv_put_super()',
because VFS has called 'sysv_sync_fs()' before calling '->put_super()'.
So remove it.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:10 +04:00
Artem Bityutskiy
5687b5780e hfs: get rid of hfs_sync_super
This patch makes hfs stop using the VFS '->write_super()' method along with
the 's_dirt' superblock flag, because they are on their way out.

The whole "superblock write-out" VFS infrastructure is served by the
'sync_supers()' kernel thread, which wakes up every 5 (by default) seconds and
writes out all dirty superblocks using the '->write_super()' call-back.  But the
problem with this thread is that it wastes power by waking up the system every
5 seconds, even if there are no diry superblocks, or there are no client
file-systems which would need this (e.g., btrfs does not use
'->write_super()'). So we want to kill it completely and thus, we need to make
file-systems to stop using the '->write_super()' VFS service, and then remove
it together with the kernel thread.

Tested using fsstress from the LTP project.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:09 +04:00
Artem Bityutskiy
b16ca62635 hfs: introduce VFS superblock object back-reference
Add an 'sb' VFS superblock back-reference to the 'struct hfs_sb_info' data
structure - we will need to find the VFS superblock from a
'struct hfs_sb_info' object in the next patch, so this change is jut a
preparation.

Remove few useless newlines while on it.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:08 +04:00
Artem Bityutskiy
4527440d5d hfs: simplify a bit checking for R/O
We have the following pattern in 2 places in HFS

if (!RDONLY)
	hfs_mdb_commit();

This patch pushes the RDONLY check down to 'hfs_mdb_commit()'. This will
make the following patches a bit simpler.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:07 +04:00
Artem Bityutskiy
a3742d4828 hfs: remove extra mdb write on unmount
HFS calls 'hfs_write_super()' from 'hfs_put_super()' in order to write the MDB
to the media. However, it is not needed because VFS calls '->sync_fs()' before
calling '->put_super()' - so by the time we are in 'hfs_write_super()', the MDB
is already synchronized.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:07 +04:00
Artem Bityutskiy
b59352359d hfs: get rid of lock_super
Stop using lock_super for serializing the MDB changes - use the buffer-head own
lock instead. Tested with fsstress.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:06 +04:00
Artem Bityutskiy
715189d836 hfs: push lock_super down
HFS uses 'lock_super()'/'unlock_super()' around 'hfs_mdb_commit()' in order
to serialize MDB (Master Directory Block) changes. Push it down to
'hfs_mdb_commit()' in order to simplify the code a bit.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:05 +04:00