Commit graph

15797 commits

Author SHA1 Message Date
Jan Kara
18f2ee705d vfs: Remove generic_osync_inode() and sync_page_range{_nolock}()
Remove these three functions since nobody uses them anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:17 +02:00
Jan Kara
2f3d675bcd fat: Opencode sync_page_range_nolock()
fat_cont_expand() is the only user of sync_page_range_nolock(). It's also the
only user of generic_osync_inode() which does not have a file open.  So
opencode needed actions for FAT so that we can convert generic_osync_inode() to
a standard syncing path.

Update a comment about generic_osync_inode().

CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:17 +02:00
Jan Kara
af0f4414f3 xfs: Convert sync_page_range() to simple filemap_write_and_wait_range()
Christoph Hellwig says that it is enough for XFS to call
filemap_write_and_wait_range() instead of sync_page_range() because we do
all the metadata syncing when forcing the log.

CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:17 +02:00
Jan Kara
d23c937b0f ocfs2: Update syncing after splicing to match generic version
Update ocfs2 specific splicing code to use generic syncing helper. The sync now
does not happen under rw_lock because generic_write_sync() acquires i_mutex
which ranks above rw_lock. That should not matter because standard fsync path
does not hold it either.

Acked-by: Joel Becker <Joel.Becker@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
CC: ocfs2-devel@oss.oracle.com
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:16 +02:00
Jan Kara
ebbbf757c6 ntfs: Use new syncing helpers and update comments
Use new syncing helpers in .write and .aio_write functions. Also
remove superfluous syncing in ntfs_file_buffered_write() and update
comments about generic_osync_inode().

CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:16 +02:00
Jan Kara
0d34ec62e1 ext4: Remove syncing logic from ext4_file_write
The syncing is now properly handled by generic_file_aio_write() so
no special ext4 code is needed.

CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:16 +02:00
Jan Kara
e367626b61 ext3: Remove syncing logic from ext3_file_write
Syncing is now properly done by generic_file_aio_write() so no special logic is
needed in ext3.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:16 +02:00
Jan Kara
a2a735ad66 ext2: Update comment about generic_osync_inode
We rely on generic_write_sync() now.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:16 +02:00
Jan Kara
148f948ba8 vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
Introduce new function for generic inode syncing (vfs_fsync_range) and use
it from fsync() path. Introduce also new helper for syncing after a sync
write (generic_write_sync) using the generic function.

Use these new helpers for syncing from generic VFS functions. This makes
O_SYNC writes to block devices acquire i_mutex for syncing. If we really
care about this, we can make block_fsync() drop the i_mutex and reacquire
it before it returns.

CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:15 +02:00
Christoph Hellwig
eef9938067 vfs: Rename generic_file_aio_write_nolock
generic_file_aio_write_nolock() is now used only by block devices and raw
character device. Filesystems should use __generic_file_aio_write() in case
generic_file_aio_write() doesn't suit them. So rename the function to
blkdev_aio_write() and move it to fs/blockdev.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:15 +02:00
Jan Kara
918941a3f3 ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
Use the new helper. We have to submit data pages ourselves in case of O_SYNC
write because __generic_file_aio_write does not do it for us. OCFS2 developpers
might think about moving the sync out of i_mutex which seems to be easily
possible but that's out of scope of this patch.

CC: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:15 +02:00
Ryusuke Konishi
41f4db0f48 fs/Kconfig: move nilfs2 outside misc filesystems
Some people asked me questions like the following:

On Wed, 15 Jul 2009 13:11:21 +0200, Leon Woestenberg wrote:
> just wondering, any reasons why NILFS2 is one of the miscellaneous
> filesystems and, for example, btrfs, is not in Kconfig?

Actually, nilfs is NOT a filesystem came from other operating systems,
but a filesystem created purely for Linux.  Nor is it a flash
filesystem but that for generic block devices.

So, this moves nilfs outside the misc category as I responded in LKML
"Re: Why does NILFS2 hide under Miscellaneous filesystems?"
(Message-Id: <20090716.002526.93465395.ryusuke@osrg.net>).

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:16 +09:00
Ryusuke Konishi
0f3fe33b39 nilfs2: convert nilfs_bmap_lookup to an inline function
The nilfs_bmap_lookup() is now a wrapper function of
nilfs_bmap_lookup_at_level().

This moves the nilfs_bmap_lookup() to a header file converting it to
an inline function and gives an opportunity for optimization.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:16 +09:00
Ryusuke Konishi
2e0c2c7392 nilfs2: allow btree code to directly call dat operations
The current btree code is written so that btree functions call dat
operations via wrapper functions in bmap.c when they allocate, free,
or modify virtual block addresses.

This abstraction requires additional function calls and causes
frequent call of nilfs_bmap_get_dat() function since it is used in the
every wrapper function.

This removes the wrapper functions and makes them available from
btree.c and direct.c, which will increase the opportunity of
compiler optimization.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:16 +09:00
Ryusuke Konishi
bd8169efae nilfs2: add update functions of virtual block address to dat
This is a preparation for the successive cleanup ("nilfs2: allow btree
to directly call dat operations").

This adds functions bundling a few operations to change an entry of
virtual block address on the dat file.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:15 +09:00
Ryusuke Konishi
7a102b0923 nilfs2: remove individual gfp constants for each metadata file
This gets rid of NILFS_CPFILE_GFP, NILFS_SUFILE_GFP, NILFS_DAT_GFP,
and NILFS_IFILE_GFP.  All of these constants refer to NILFS_MDT_GFP,
and can be removed.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:15 +09:00
Ryusuke Konishi
3218929dbd nilfs2: stop zero-fill of btree path just before free it
The btree path object is cleared just before it is freed.

This will remove the code doing the unnecessary clear operation.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:15 +09:00
Ryusuke Konishi
6d28f7ea43 nilfs2: remove unused btree argument from btree functions
Even though many btree functions take a btree object as their first
argument, most of them are not used in their functions.

This sticky use of the btree argument is hurting code readability and
giving the possibility of inefficient code generation.

So, this removes the unnecessary btree arguments.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:15 +09:00
Ryusuke Konishi
9ead986373 nilfs2: remove nilfs_dat_abort_start and nilfs_dat_abort_free
These functions are not called from any functions.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:15 +09:00
Jiro SEKIBA
1cf58fa840 nilfs2: shorten freeze period due to GC in write operation v3
This is a re-revised patch to shorten freeze period.
This version include a fix of the bug Konishi-san mentioned last time.

When GC is runnning, GC moves live block to difference segments.
Copying live blocks into memory is done in a transaction,
however it is not necessarily to be in the transaction.
This patch will get the nilfs_ioctl_move_blocks() out from
transaction lock and put it before the transaction.

I ran sysbench fileio test against nilfs partition.
I copied some DVD/CD images and created snapshot to create live blocks
before starting the benchmark.

Followings are summary of rc8 and rc8 w/ the patch of per-request
statistics, which is min/max and avg.  I ran each test three times and
bellow is average of those numers.

According to this benchmark result, average time is slightly degrated.
However, worstcase (max) result is significantly improved.
This can address a few seconds write freeze.

- random write per-request performance of rc8
 min   0.843ms
 max 680.406ms
 avg   3.050ms
- random write per-request performance of rc8 w/ this patch
 min   0.843ms -> 100.00%
 max 380.490ms ->  55.90%
 avg   3.233ms -> 106.00%

- sequential write per-request performance of rc8
 min   0.736ms
 max 774.343ms
 avg   2.883ms
- sequential write per-request performance of rc8 w/ this patch
 min   0.720ms ->  97.80%
 max  644.280ms->  83.20%
 avg   3.130ms -> 108.50%

-----8<-----8<-----nilfs_cleanerd.conf-----8<-----8<-----
protection_period       150
selection_policy        timestamp       # timestamp in ascend order
nsegments_per_clean     2
cleaning_interval       2
retry_interval          60
use_mmap
log_priority            info
-----8<-----8<-----nilfs_cleanerd.conf-----8<-----8<-----

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:15 +09:00
Zhu Yanhai
43be0ec038 nilfs2: add more check routines in mount process
nilfs2: Add more safeguard routines and protections in mount process,
which also makes nilfs2 report consistency error messages when
checkpoint number is invalid.

Signed-off-by: Zhu Yanhai <zhu.yanhai@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:14 +09:00
Zhang Qiang
a4f0b9c5b4 nilfs2: An unassigned variable is assigned to a never used structure member
nilfs2: In procedure 'nilfs_get_sb()', when a nilfs filesysttem is
mounted for the first time, local variable 'nilfs->ns_last_cno' is
used before loading the latest checkpoint number from disk (in
'nilfs_fill_super'). 'nilfs->ns_last_cno' is assigned to 'sd.cno', but
'sd.cno' has never been used in the procedure.

Signed-off-by: Zhang Qiang <zhangqiang.buaa@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:14 +09:00
Ryusuke Konishi
c1b353f04a nilfs2: use GFP_NOIO for bio_alloc instead of GFP_NOWAIT
Alberto Bertogli advised me about bio_alloc() use in nilfs:
On Sat, 13 Jun 2009 22:52:40 -0300, Alberto Bertogli wrote:
> By the way, those bio_alloc()s are using GFP_NOWAIT but it looks
> like they could use at least GFP_NOIO or GFP_NOFS, since the caller
> can (and sometimes do) sleep. The only caller is nilfs_submit_bh(),
> which calls nilfs_submit_seg_bio() which can sleep calling
> wait_for_completion().

This takes in the comment and replaces the use of GFP_NOWAIT flag with
GFP_NOIO.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:14 +09:00
Jiro SEKIBA
1dfa27105a nilfs2: stop using periodic write_super callback
This removes nilfs_write_super and commit super block in nilfs
internal thread, instead of periodic write_super callback.

VFS layer calls ->write_super callback periodically.  However,
it looks like that calling back is ommited when disk I/O is busy.
And when cleanerd (nilfs GC) is runnig, disk I/O tend to be busy thus
nilfs superblock is not synchronized as nilfs designed.

To avoid it, syncing superblock by nilfs thread instead of pdflush.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:14 +09:00
Jiro SEKIBA
79efdd9411 nilfs2: clean up nilfs_write_super
Separate conditions that check if syncing super block and alternative
super block are required as inline functions to reuse the conditions.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:14 +09:00
Jiro SEKIBA
6233caa9d5 nilfs2: fix disorder of nilfs_write_super in nilfs_sync_fs
This fixes disorder of nilfs_write_super in nilfs_sync_fs.  Commiting
super block must be the end of the function so that every changes are
reflected.

->sync_fs() is not called frequently so this makes nilfs_sync_fs call
nilfs_commit_super instead of nilfs_write_super.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:14 +09:00
Jiro SEKIBA
ec5d66abdb nilfs2: remove redundant super block commit
This removes redundant super block commit.

nilfs_write_super will call nilfs_commit_super to store super block
into block device.  However, nilfs_put_super will call
nilfs_commit_super right after calling nilfs_write_super.  So calling
nilfs_write_super in nilfs_put_super would be redundant.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:13 +09:00
Jiro SEKIBA
b58a285ba4 nilfs2: implement nilfs_show_options to display mount options in /proc/mounts
This is a patch to display mount options in procfs.
Mount options will show up in the /proc/mounts as other fs does.

...
/dev/sda6 /mnt nilfs2 ro,relatime,barrier=off,cp=3,order=strict 0 0
...

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:13 +09:00
Ryusuke Konishi
1435110467 nilfs2: always lookup disk block address before reading metadata block
The current metadata file code skips disk address lookup for its data
block if the buffer has a mapped flag.

This has a potential risk to cause read request to be performed
against the stale block address that GC moved, and it may lead to meta
data corruption.  The mapped flag is safe if the buffer has an
uptodate flag, otherwise it may prevent necessary update of disk
address in the next read.

This will avoid the potential problem by ensuring disk address lookup
before reading metadata block even for buffers with the mapped flag.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:13 +09:00
Ryusuke Konishi
027d6404eb nilfs2: use semaphore to protect pointer to a writable FS-instance
will get rid of nilfs_get_writer() and nilfs_put_writer() pair used to
retain a writable FS-instance for a period.

The pair functions were making up some kind of recursive lock with a
mutex, but they became overkill since the commit
201913ed74.  Furthermore, they caused
the following lockdep warning because the mutex can be released by a
task which didn't lock it:

 =====================================
 [ BUG: bad unlock balance detected! ]
 -------------------------------------
 kswapd0/422 is trying to release lock (&nilfs->ns_writer_mutex) at:
 [<c1359ff5>] mutex_unlock+0x8/0xa
 but there are no more locks to release!

 other info that might help us debug this:
 no locks held by kswapd0/422.

 stack backtrace:
 Pid: 422, comm: kswapd0 Not tainted 2.6.31-rc4-nilfs #51
 Call Trace:
  [<c1358f97>] ? printk+0xf/0x18
  [<c104fea7>] print_unlock_inbalance_bug+0xcc/0xd7
  [<c11578de>] ? prop_put_global+0x3/0x35
  [<c1050195>] lock_release+0xed/0x1dc
  [<c1359ff5>] ? mutex_unlock+0x8/0xa
  [<c1359f83>] __mutex_unlock_slowpath+0xaf/0x119
  [<c1359ff5>] mutex_unlock+0x8/0xa
  [<d1284add>] nilfs_mdt_write_page+0xd8/0xe1 [nilfs2]
  [<c1092653>] shrink_page_list+0x379/0x68d
  [<c109171b>] ? isolate_pages_global+0xb4/0x18c
  [<c1092bd2>] shrink_list+0x26b/0x54b
  [<c10930be>] shrink_zone+0x20c/0x2a2
  [<c10936b7>] kswapd+0x407/0x591
  [<c1091667>] ? isolate_pages_global+0x0/0x18c
  [<c1040603>] ? autoremove_wake_function+0x0/0x33
  [<c10932b0>] ? kswapd+0x0/0x591
  [<c104033b>] kthread+0x69/0x6e
  [<c10402d2>] ? kthread+0x0/0x6e
  [<c1003e33>] kernel_thread_helper+0x7/0x1a

This patch uses a reader/writer semaphore instead of the own lock and
kills this warning.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:13 +09:00
Heiko Carstens
b5696e5e0d nilfs2: fix format string compile warning (ino_t)
Unlike on most other architectures ino_t is an unsigned int on s390.
So add an explicit cast to avoid this compile warning:

fs/nilfs2/recovery.c: In function 'recover_dsync_blocks':
fs/nilfs2/recovery.c:555: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'ino_t'

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:13 +09:00
Ryusuke Konishi
1b2f5a641b nilfs2: fix ignored error code in __nilfs_read_inode()
The __nilfs_read_inode function is ignoring the error code returned
from nilfs_read_inode_common(), and wrongly delivers a success code
(zero) when it escapes from the function in erroneous cases.

This adds the missing error handling.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-14 18:27:12 +09:00
Steven Whitehouse
86d0063656 GFS2: Whitespace fixes
Reported-by: Daniel Walker <dwalker@fifo99.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-09-14 09:50:57 +01:00
Christoph Hellwig
746cd1e7e4 block: use blkdev_issue_discard in blk_ioctl_discard
blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
the only difference between the two is that blkdev_issue_discard needs to
send a barrier discard request and blk_ioctl_discard a non-barrier one,
and blk_ioctl_discard needs to wait on the request.  To facilitates this
add a flags argument to blkdev_issue_discard to control both aspects of the
behaviour.  This will be very useful later on for using the waiting
funcitonality for other callers.

Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-14 08:24:53 +02:00
Nikanth Karthikesan
a9327cac44 Seperate read and write statistics of in_flight requests
Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.

This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-14 08:24:52 +02:00
Benny Halevy
4be36ca0ce nfsd4: fix whitespace in NFSPROC4_CLNT_CB_NULL definition
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-13 15:57:39 -04:00
Linus Torvalds
86d710146f Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (87 commits)
  NFSv4: Disallow 'mount -t nfs4 -overs=2' and 'mount -t nfs4 -overs=3'
  NFS: Allow the "nfs" file system type to support NFSv4
  NFS: Move details of nfs4_get_sb() to a helper
  NFS: Refactor NFSv4 text-based mount option validation
  NFS: Mount option parser should detect missing "port="
  NFS: out of date comment regarding O_EXCL above nfs3_proc_create()
  NFS: Handle a zero-length auth flavor list
  SUNRPC: Ensure that sunrpc gets initialised before nfs, lockd, etc...
  nfs: fix compile error in rpc_pipefs.h
  nfs: Remove reference to generic_osync_inode from a comment
  SUNRPC: cache must take a reference to the cache detail's module on open()
  NFS: Use the DNS resolver in the mount code.
  NFS: Add a dns resolver for use with NFSv4 referrals and migration
  SUNRPC: Fix a typo in cache_pipefs_files
  nfs: nfs4xdr: optimize low level decoding
  nfs: nfs4xdr: get rid of READ_BUF
  nfs: nfs4xdr: simplify decode_exchange_id by reusing decode_opaque_inline
  nfs: nfs4xdr: get rid of COPYMEM
  nfs: nfs4xdr: introduce decode_sessionid helper
  nfs: nfs4xdr: introduce decode_verifier helper
  ...
2009-09-11 16:39:11 -07:00
Chris Mason
83ebade34b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable 2009-09-11 19:07:25 -04:00
Theodore Ts'o
7ad9bb651f ext4: Fix initalization of s_flex_groups
The s_flex_groups array should have been initialized using atomic_add
to sum up the free counts from the block groups that make up a
flex_bg.  By using atomic_set, the value of the s_flex_groups array
was set to the values of the last block group in the flex_bg.  

The impact of this bug is that the block and inode allocation
algorithms might not pick the best flex_bg for new allocation.

Thanks to Damien Guibouret for pointing out this problem!

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-11 16:51:28 -04:00
Linus Torvalds
774a694f8c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (64 commits)
  sched: Fix sched::sched_stat_wait tracepoint field
  sched: Disable NEW_FAIR_SLEEPERS for now
  sched: Keep kthreads at default priority
  sched: Re-tune the scheduler latency defaults to decrease worst-case latencies
  sched: Turn off child_runs_first
  sched: Ensure that a child can't gain time over it's parent after fork()
  sched: enable SD_WAKE_IDLE
  sched: Deal with low-load in wake_affine()
  sched: Remove short cut from select_task_rq_fair()
  sched: Turn on SD_BALANCE_NEWIDLE
  sched: Clean up topology.h
  sched: Fix dynamic power-balancing crash
  sched: Remove reciprocal for cpu_power
  sched: Try to deal with low capacity, fix update_sd_power_savings_stats()
  sched: Try to deal with low capacity
  sched: Scale down cpu_power due to RT tasks
  sched: Implement dynamic cpu_power
  sched: Add smt_gain
  sched: Update the cpu_power sum during load-balance
  sched: Add SD_PREFER_SIBLING
  ...
2009-09-11 13:23:18 -07:00
Trond Myklebust
ab3bbaa8b2 Merge branch 'nfs-for-2.6.32' 2009-09-11 14:59:37 -04:00
Chris Mason
93c82d5750 Btrfs: zero page past end of inline file items
When btrfs_get_extent is reading inline file items for readpage,
it needs to copy the inline extent into the page.  If the
inline extent doesn't cover all of the page, that means there
is a hole in the file, or that our file is smaller than one
page.

readpage does zeroing for the case where the file is smaller than one
page, but nobody is currently zeroing for the case where there is
a hole after the inline item.

This commit changes btrfs_get_extent to zero fill the page past
the end of the inline item.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:08 -04:00
Chris Mason
50a9b214bc Btrfs: fix btrfs page_mkwrite to return locked page
This closes a whole where the page may be written before
the page_mkwrite caller has a chance to dirty it

(thanks to Nick Piggin)

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:08 -04:00
Chris Mason
a1ed835e1a Btrfs: Fix extent replacment race
Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones.  There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.

Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.

This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:07 -04:00
Chris Mason
8b62b72b26 Btrfs: Use PagePrivate2 to track pages in the data=ordered code.
Btrfs writes go through delalloc to the data=ordered code.  This
makes sure that all of the data is on disk before the metadata
that references it.  The tracking means that we have to make sure
each page in an extent is fully written before we add that extent into
the on-disk btree.

This was done in the past by setting the EXTENT_ORDERED bit for the
range of an extent when it was added to the data=ordered code, and then
clearing the EXTENT_ORDERED bit in the extent state tree as each page
finished IO.

One of the reasons we had to do this was because sometimes pages are
magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
bit is checked at writepage time, and if it isn't there, our page become
dirty without going through the proper path.

These bit operations make for a number of rbtree searches for each page,
and can cause considerable lock contention.

This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
As pages go into the ordered code, PagePrivate2 is set on each one.
This is a cheap operation because we already have all the pages locked
and ready to go.

As IO finishes, the PagePrivate2 bit is cleared and the ordered
accoutning is updated for each page.

At writepage time, if the PagePrivate2 bit is missing, we go into the
writepage fixup code to handle improperly dirtied pages.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:07 -04:00
Chris Mason
9655d2982b Btrfs: use a cached state for extent state operations during delalloc
This changes the btrfs code to find delalloc ranges in the extent state
tree to use the new state caching code from set/test bit.  It reduces
one of the biggest causes of rbtree searches in the writeback path.

test_range_bit is also modified to take the cached state as a starting
point while searching.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:07 -04:00
Chris Mason
d5550c6315 Btrfs: don't lock bits in the extent tree during writepage
At writepage time, we have the page locked and we have the
extent_map entry for this extent pinned in the extent_map tree.
So, the page can't go away and its mapping can't change.

There is no need for the extra extent_state lock bits during writepage.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:06 -04:00
Chris Mason
2c64c53d8d Btrfs: cache values for locking extents
Many of the btrfs extent state tree users follow the same pattern.
They lock an extent range in the tree, do some operation and then
unlock.

This translates to at least 2 rbtree searches, and maybe more if they
are doing operations on the extent state tree.  A locked extent
in the tree isn't going to be merged or changed, and so we can
safely return the extent state structure as a cached handle.

This changes set_extent_bit to give back a cached handle, and also
changes both set_extent_bit and clear_extent_bit to use the cached
handle if it is available.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:06 -04:00
Chris Mason
1edbb734b4 Btrfs: reduce CPU usage in the extent_state tree
Btrfs is currently mirroring some of the page state bits into
its extent state tree.  The goal behind this was to use it in supporting
blocksizes other than the page size.

But, we don't currently support that, and we're using quite a lot of CPU
on the rb tree and its spin lock.  This commit starts a series of
cleanups to reduce the amount of work done in the extent state tree as
part of each IO.

This commit:

* Adds the ability to lock an extent in the state tree and also set
other bits.  The idea is to do locking and delalloc in one call

* Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits.  Btrfs is using
a combination of the page bits and the ordered write code for this
instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:06 -04:00
Chris Mason
e48c465bb3 Btrfs: Fix new state initialization order
As the extent state tree is manipulated, there are call backs
that are used to take extra actions when different state bits are set
or cleared.  One example of this is a counter for the total number
of delayed allocation bytes in a single inode and in the whole FS.

When new states are inserted, this callback is being done before we
properly setup the new state.  This hasn't caused problems before
because the lock bit was always done first, and the existing call backs
don't care about the lock bit.

This patch makes sure the state is properly setup before using the
callback, which is important for later optimizations that do more work
without using the lock bit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:05 -04:00