Commit graph

18447 commits

Author SHA1 Message Date
Josef Bacik
c2c6ca417e direct-io: do not merge logically non-contiguous requests
Btrfs cannot handle having logically non-contiguous requests submitted.  For
example if you have

Logical:  [0-4095][HOLE][8192-12287]
Physical: [0-4095]      [4096-8191]

Normally the DIO code would put these into the same BIO's.  The problem is we
need to know exactly what offset is associated with what BIO so we can do our
checksumming and unlocking properly, so putting them in the same BIO doesn't
work.  So add another check where we submit the current BIO if the physical
blocks are not contigous OR the logical blocks are not contiguous.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:56 -04:00
Josef Bacik
facd07b07d direct-io: add a hook for the fs to provide its own submit_bio function
Because BTRFS can do RAID and such, we need our own submit hook so we can setup
the bio's in the correct fashion, and handle checksum errors properly.  So there
are a few changes here

1) The submit_io hook.  This is straightforward, just call this instead of
submit_bio.

2) Allow the fs to return -ENOTBLK for reads.  Usually this has only worked for
writes, since writes can fallback onto buffered IO.  But BTRFS needs the option
of falling back on buffered IO if it encounters a compressed extent, since we
need to read the entire extent in and decompress it.  So if we get -ENOTBLK back
from get_block we'll return back and fallback on buffered just like the write
case.

I've tested these changes with fsx and everything seems to work.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:55 -04:00
Yan, Zheng
3fd0a5585e Btrfs: Metadata ENOSPC handling for balance
This patch adds metadata ENOSPC handling for the balance code.
It is consisted by following major changes:

1. Avoid COW tree leave in the phrase of merging tree.

2. Handle interaction with snapshot creation.

3. make the backref cache can live across transactions.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:54 -04:00
Yan, Zheng
efa5646456 Btrfs: Pre-allocate space for data relocation
Pre-allocate space for data relocation. This can detect ENOPSC
condition caused by fragmentation of free space.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:53 -04:00
Yan, Zheng
4a500fd178 Btrfs: Metadata ENOSPC handling for tree log
Previous patches make the allocater return -ENOSPC if there is no
unreserved free metadata space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:53 -04:00
Yan, Zheng
d68fc57b7e Btrfs: Metadata reservation for orphan inodes
reserve metadata space for handling orphan inodes

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:52 -04:00
Yan, Zheng
8929ecfa50 Btrfs: Introduce global metadata reservation
Reserve metadata space for extent tree, checksum tree and root tree

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:52 -04:00
Yan, Zheng
0ca1f7ceb1 Btrfs: Update metadata reservation for delayed allocation
Introduce metadata reservation context for delayed allocation
and update various related functions.

This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:51 -04:00
Yan, Zheng
a22285a6a3 Btrfs: Integrate metadata reservation with start_transaction
Besides simplify the code, this change makes sure all metadata
reservation for normal metadata operations are released after
committing transaction.

Changes since V1:

Add code that check if unlink and rmdir will free space.

Add ENOSPC handling for clone ioctl.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:50 -04:00
Yan, Zheng
f0486c68e4 Btrfs: Introduce contexts for metadata reservation
Introducing metadata reseravtion contexts has two major advantages.
First, it makes metadata reseravtion more traceable. Second, it can
reclaim freed space and re-add them to the itself after transaction
committed.

Besides add btrfs_block_rsv structure and related helper functions,
This patch contains following changes:

Move code that decides if freed tree block should be pinned into
btrfs_free_tree_block().

Make space accounting more accurate, mainly for handling read only
block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:50 -04:00
Yan, Zheng
2ead6ae770 Btrfs: Kill init_btrfs_i()
All code in init_btrfs_i can be moved into btrfs_alloc_inode()

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:49 -04:00
Yan, Zheng
5da9d01b66 Btrfs: Shrink delay allocated space in a synchronized
Shrink delayed allocation space in a synchronized manner is more
controllable than flushing all delay allocated space in an async
thread.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:48 -04:00
Yan, Zheng
424499dbd0 Btrfs: Kill allocate_wait in space_info
We already have fs_info->chunk_mutex to avoid concurrent
chunk creation.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:48 -04:00
Yan, Zheng
b742bb82f1 Btrfs: Link block groups of different raid types
The size of reserved space is stored in space_info. If block groups
of different raid types are linked to separate space_info, changing
allocation profile will corrupt reserved space accounting.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:47 -04:00
Miklos Szeredi
c3021629a0 fuse: support splice() reading from fuse device
Allow userspace filesystem implementation to use splice() to read from
the fuse device.

The userspace filesystem can now transfer data coming from a WRITE
request to an arbitrary file descriptor (regular file, block device or
socket) without having to go through a userspace buffer.

The semantics of using splice() to read messages are:

 1)  with a single splice() call move the whole message from the fuse
     device to a temporary pipe
 2)  read the header from the pipe and determine the message type
 3a) if message is a WRITE then splice data from pipe to destination
 3b) else read rest of message to userspace buffer

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 15:06:07 +02:00
Miklos Szeredi
ce534fb052 fuse: allow splice to move pages
When splicing buffers to the fuse device with SPLICE_F_MOVE, try to
move pages from the pipe buffer into the page cache.  This allows
populating the fuse filesystem's cache without ever touching the page
contents, i.e. zero copy read capability.

The following steps are performed when trying to move a page into the
page cache:

 - buf->ops->confirm() to make sure the new page is uptodate
 - buf->ops->steal() to try to remove the new page from it's previous place
 - remove_from_page_cache() on the old page
 - add_to_page_cache_locked() on the new page

If any of the above steps fail (non fatally) then the code falls back
to copying the page.  In particular ->steal() will fail if there are
external references (other than the page cache and the pipe buffer) to
the page.

Also since the remove_from_page_cache() + add_to_page_cache_locked()
are non-atomic it is possible that the page cache is repopulated in
between the two and add_to_page_cache_locked() will fail.  This could
be fixed by creating a new atomic replace_page_cache_page() function.

fuse_readpages_end() needed to be reworked so it works even if
page->mapping is NULL for some or all pages which can happen if the
add_to_page_cache_locked() failed.

A number of sanity checks were added to make sure the stolen pages
don't have weird flags set, etc...  These could be moved into generic
splice/steal code.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 15:06:07 +02:00
Miklos Szeredi
dd3bb14f44 fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device.  The semantics of using splice() are:

 1) buffer the message header and data in a temporary pipe
 2) with a *single* splice() call move the message from the temporary pipe
    to the fuse device

The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer.  It will
also allow zero copy moving of pages.

One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header.  But the length of
the data transferred into the temporary pipe may not be known in
advance.  The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):

	struct fuse_out_header out;

	iov.iov_base = &out;
	iov.iov_len = sizeof(struct fuse_out_header);
	vmsplice(pip[1], &iov, 1, 0);
	len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
	/* retrospectively modify the header: */
	out.len = len + sizeof(struct fuse_out_header);
	splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);

This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.

Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 15:06:06 +02:00
Miklos Szeredi
b5dd328537 fuse: get page reference for readpages
Acquire a page ref on pages in ->readpages() and release them when the
read has finished.  Not acquiring a reference didn't seem to cause any
trouble since the page is locked and will not be kicked out of the
page cache during the read.

However the following patches will want to remove the page from the
cache so a separate ref is needed.  Making the reference in req->pages
explicit also makes the code easier to understand.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 15:06:06 +02:00
Miklos Szeredi
1bf94ca73e fuse: use get_user_pages_fast()
Replace uses of get_user_pages() with get_user_pages_fast().  It looks
nicer and should be faster in most cases.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 15:06:06 +02:00
Dan Carpenter
4aa0edd294 fuse: remove unneeded variable
"map" isn't needed any more after: 0bd87182d3 "fuse: fix kunmap in
fuse_ioctl_copy_user" 

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 15:06:05 +02:00
Nick Piggin
0ae0b5d055 fs/splice.c: fix mapping_gfp_mask usage
mapping_gfp_mask() is not supposed to store allocation contex details,
only page location details.  So mapping_gfp_mask should be applied to the
pagecache page allocation, wheras normal (kernel mapped) memory should be
used for surrounding allocations such as radix-tree nodes allocated by
add_to_page_cache.  Context modifiers should be applied on a per-callsite
basis.

So change splice to follow this convention (which is followed in similar
code patterns in core code).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-25 10:25:26 +02:00
Jens Axboe
b9598db340 pipe: make F_{GET,SET}PIPE_SZ deal with byte sizes
Instead of requiring an exact number of pages as the argument and
return value, change the API to deal with number of bytes instead.

This also relaxes the requirement that the passed in size must
result in a power-of-2 page array size. Round up to the nearest
power-of-2 automatically and return the resulting size of the pipe
on success.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-24 19:34:43 +02:00
Jens Axboe
0191f8697b pipe: F_SETPIPE_SZ should return -EPERM for non-root
If the passed in size is larger than what has been set as the
system wide limit and the user is not root, we want to return
permission denied (not invalid value).

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-24 19:15:57 +02:00
Alex Elder
88e88374ee Merge branch 'delayed-logging-for-2.6.35' into for-linus 2010-05-24 11:57:36 -05:00
Dave Chinner
ccf7c23fc1 xfs: Ensure inode allocation buffers are fully replayed
With delayed logging, we can get inode allocation buffers in the
same transaction inode unlink buffers. We don't currently mark inode
allocation buffers in the log, so inode unlink buffers take
precedence over allocation buffers.

The result is that when they are combined into the same checkpoint,
only the unlinked inode chain fields are replayed, resulting in
uninitialised inode buffers being detected when the next inode
modification is replayed.

To fix this, we need to ensure that we do not set the inode buffer
flag in the buffer log item format flags if the inode allocation has
not already hit the log. To avoid requiring a change to log
recovery, we really need to make this a modification that relies
only on in-memory sate.

We can do this by checking during buffer log formatting (while the
CIL cannot be flushed) if we are still in the same sequence when we
commit the unlink transaction as the inode allocation transaction.
If we are, then we do not add the inode buffer flag to the buffer
log format item flags. This means the entire buffer will be
replayed, not just the unlinked fields. We do this while
CIL flusheѕ are locked out to ensure that we don't race with the
sequence numbers changing and hence fail to put the inode buffer
flag in the buffer format flags when we really need to.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:41:22 -05:00
Dave Chinner
df806158b0 xfs: enable background pushing of the CIL
If we let the CIL grow without bound, it will grow large enough to violate
recovery constraints (must be at least one complete transaction in the log at
all times) or take forever to write out through the log buffers. Hence we need
a check during asynchronous transactions as to whether the CIL needs to be
pushed.

We track the amount of log space the CIL consumes, so it is relatively simple
to limit it on a pure size basis. Make the limit the minimum of just under half
the log size (recovery constraint) or 8MB of log space (which is an awful lot
of metadata).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:38:20 -05:00
Dave Chinner
9da1ab181a xfs: forced unmounts need to push the CIL
If the filesystem is being shut down and the there is no log error,
the current code forces out the current log buffers. This code now needs
to push the CIL before it forces out the log buffers to acheive the same
result.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:38:14 -05:00
Dave Chinner
71e330b593 xfs: Introduce delayed logging core code
The delayed logging code only changes in-memory structures and as
such can be enabled and disabled with a mount option. Add the mount
option and emit a warning that this is an experimental feature that
should not be used in production yet.

We also need infrastructure to track committed items that have not
yet been written to the log. This is what the Committed Item List
(CIL) is for.

The log item also needs to be extended to track the current log
vector, the associated memory buffer and it's location in the Commit
Item List. Extend the log item and log vector structures to enable
this tracking.

To maintain the current log format for transactions with delayed
logging, we need to introduce a checkpoint transaction and a context
for tracking each checkpoint from initiation to transaction
completion.  This includes adding a log ticket for tracking space
log required/used by the context checkpoint.

To track all the changes we need an io vector array per log item,
rather than a single array for the entire transaction. Using the new
log vector structure for this requires two passes - the first to
allocate the log vector structures and chain them together, and the
second to fill them out.  This log vector chain can then be passed
to the CIL for formatting, pinning and insertion into the CIL.

Formatting of the log vector chain is relatively simple - it's just
a loop over the iovecs on each log vector, but it is made slightly
more complex because we re-write the iovec after the copy to point
back at the memory buffer we just copied into.

This code also needs to pin log items. If the log item is not
already tracked in this checkpoint context, then it needs to be
pinned. Otherwise it is already pinned and we don't need to pin it
again.

The only other complexity is calculating the amount of new log space
the formatting has consumed. This needs to be accounted to the
transaction in progress, and the accounting is made more complex
becase we need also to steal space from it for log metadata in the
checkpoint transaction. Calculate all this at insert time and update
all the tickets, counters, etc correctly.

Once we've formatted all the log items in the transaction, attach
the busy extents to the checkpoint context so the busy extents live
until checkpoint completion and can be processed at that point in
time. Transactions can then be freed at this point in time.

Now we need to issue checkpoints - we are tracking the amount of log space
used by the items in the CIL, so we can trigger background checkpoints when the
space usage gets to a certain threshold. Otherwise, checkpoints need ot be
triggered when a log synchronisation point is reached - a log force event.

Because the log write code already handles chained log vectors, writing the
transaction is trivial, too. Construct a transaction header, add it
to the head of the chain and write it into the log, then issue a
commit record write. Then we can release the checkpoint log ticket
and attach the context to the log buffer so it can be called during
Io completion to complete the checkpoint.

We also need to allow for synchronising multiple in-flight
checkpoints. This is needed for two things - the first is to ensure
that checkpoint commit records appear in the log in the correct
sequence order (so they are replayed in the correct order). The
second is so that xfs_log_force_lsn() operates correctly and only
flushes and/or waits for the specific sequence it was provided with.

To do this we need a wait variable and a list tracking the
checkpoint commits in progress. We can walk this list and wait for
the checkpoints to change state or complete easily, an this provides
the necessary synchronisation for correct operation in both cases.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:38:03 -05:00
Dave Chinner
ed3b4d6cdc xfs: Improve scalability of busy extent tracking
When we free a metadata extent, we record it in the per-AG busy
extent array so that it is not re-used before the freeing
transaction hits the disk. This array is fixed size, so when it
overflows we make further allocation transactions synchronous
because we cannot track more freed extents until those transactions
hit the disk and are completed. Under heavy mixed allocation and
freeing workloads with large log buffers, we can overflow this array
quite easily.

Further, the array is sparsely populated, which means that inserts
need to search for a free slot, and array searches often have to
search many more slots that are actually used to check all the
busy extents. Quite inefficient, really.

To enable this aspect of extent freeing to scale better, we need
a structure that can grow dynamically. While in other areas of
XFS we have used radix trees, the extents being freed are at random
locations on disk so are better suited to being indexed by an rbtree.

So, use a per-AG rbtree indexed by block number to track busy
extents.  This incures a memory allocation when marking an extent
busy, but should not occur too often in low memory situations. This
should scale to an arbitrary number of extents so should not be a
limitation for features such as in-memory aggregation of
transactions.

However, there are still situations where we can't avoid allocating
busy extents (such as allocation from the AGFL). To minimise the
overhead of such occurences, we need to avoid doing a synchronous
log force while holding the AGF locked to ensure that the previous
transactions are safely on disk before we use the extent. We can do
this by marking the transaction doing the allocation as synchronous
rather issuing a log force.

Because of the locking involved and the ordering of transactions,
the synchronous transaction provides the same guarantees as a
synchronous log force because it ensures that all the prior
transactions are already on disk when the synchronous transaction
hits the disk. i.e. it preserves the free->allocate order of the
extent correctly in recovery.

By doing this, we avoid holding the AGF locked while log writes are
in progress, hence reducing the length of time the lock is held and
therefore we increase the rate at which we can allocate and free
from the allocation group, thereby increasing overall throughput.

The only problem with this approach is that when a metadata buffer is
marked stale (e.g. a directory block is removed), then buffer remains
pinned and locked until the log goes to disk. The issue here is that
if that stale buffer is reallocated in a subsequent transaction, the
attempt to lock that buffer in the transaction will hang waiting
the log to go to disk to unlock and unpin the buffer. Hence if
someone tries to lock a pinned, stale, locked buffer we need to
push on the log to get it unlocked ASAP. Effectively we are trading
off a guaranteed log force for a much less common trigger for log
force to occur.

Ideally we should not reallocate busy extents. That is a much more
complex fix to the problem as it involves direct intervention in the
allocation btree searches in many places. This is left to a future
set of modifications.

Finally, now that we track busy extents in allocated memory, we
don't need the descriptors in the transaction structure to point to
them. We can replace the complex busy chunk infrastructure with a
simple linked list of busy extents. This allows us to remove a large
chunk of code, making the overall change a net reduction in code
size.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:34:00 -05:00
Dave Chinner
955833cf2a xfs: make the log ticket ID available outside the log infrastructure
The ticket ID is needed to uniquely identify transactions when doing busy
extent matching. Delayed logging changes the lifecycle of busy extents with
respect to the transaction structure lifecycle. Hence we can no longer use
the transaction structure as a means of determining the owner of the busy
extent as it may be freed and reused while the busy extent is still active.

This commit provides the infrastructure to access the xlog_tid_t held in the
ticket from a transaction handle. This avoids the need for callers to peek
into the transaction and log structures to find this out.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:52 -05:00
Dave Chinner
169a7b078e xfs: clean up log ticket overrun debug output
Push the error message output when a ticket overrun is detected
into the ticket printing functions. Also remove the debug version
of the code as the production version will still panic just as
effectively on a debug kernel via the panic mask being set.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:46 -05:00
Dave Chinner
c11554104f xfs: Clean up XFS_BLI_* flag namespace
Clean up the buffer log format (XFS_BLI_*) flags because they have a
polluted namespace. They XFS_BLI_ prefix is used for both in-memory
and on-disk flag feilds, but have overlapping values for different
flags. Rename the buffer log format flags to use the XFS_BLF_*
prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed
flags.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:39 -05:00
Dave Chinner
64fc35de60 xfs: modify buffer item reference counting
The buffer log item reference counts used to take referenceѕ for every
transaction, similar to the pin counting. This is symmetric (like the
pin/unpin) with respect to transaction completion, but with dleayed logging
becomes assymetric as the pinning becomes assymetric w.r.t. transaction
completion.

To make both cases the same, allow the buffer pinning to take a reference to
the buffer log item and always drop the reference the transaction has on it
when being unlocked. This is balanced correctly because the unpin operation
always drops a reference to the log item. Hence reference counting becomes
symmetric w.r.t. item pinning as well as w.r.t active transactions and as a
result the reference counting model remain consistent between normal and
delayed logging.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:31 -05:00
Dave Chinner
3383ca5780 xfs: allow log ticket allocation to take allocation flags
Delayed logging currently requires ticket allocation to succeed, so
we need to be able to sleep on allocation. It also should not allow
memory allocation to recurse into the filesystem. hence we need to
pass allocation flags directing the type of allocation the caller
requires.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:17 -05:00
Dave Chinner
524ee36fa4 xfs: Don't reuse the same transaction ID for duplicated transactions.
The transaction ID is written into the log as the unique identifier
for transactions during recover. When duplicating a transaction, we
reuse the log ticket, which means it has the same transaction ID as
the previous transaction.

Rather than regenerating a random transaction ID for the duplicated
transaction, just add one to the current ID so that duplicated
transaction can be easily spotted in the log and during recovery
during problem diagnosis.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:10 -05:00
Linus Torvalds
f13771187b Merge branch 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  uml: Pushdown the bkl from harddog_kern ioctl
  sunrpc: Pushdown the bkl from sunrpc cache ioctl
  sunrpc: Pushdown the bkl from ioctl
  autofs4: Pushdown the bkl from ioctl
  uml: Convert to unlocked_ioctls to remove implicit BKL
  ncpfs: BKL ioctl pushdown
  coda: Clean-up whitespace problems in pioctl.c
  coda: BKL ioctl pushdown
  drivers: Push down BKL into various drivers
  isdn: Push down BKL into ioctl functions
  scsi: Push down BKL into ioctl functions
  dvb: Push down BKL into ioctl functions
  smbfs: Push down BKL into ioctl function
  coda/psdev: Remove BKL from ioctl function
  um/mmapper: Remove BKL usage
  sn_hwperf: Kill BKL usage
  hfsplus: Push down BKL into ioctl function
2010-05-24 08:01:10 -07:00
Linus Torvalds
0163916f1d Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: confusion between kmap() and kmap_atomic() api
  exofs: Add default address_space_operations
2010-05-24 07:57:41 -07:00
Linus Torvalds
3e766fd41d Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
  fat: convert to unlocked_ioctl
  fat: Cleanup nls_unload() usage
  fat: use pack_hex_byte() instead of custom one
2010-05-24 07:41:47 -07:00
Linus Torvalds
4fd5ec509b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  9p: Optimize TCREATE by eliminating a redundant fid clone.
  9p: cleanup: remove unneeded assignment
  9p: Add mksock support
  fs/9p: Make sure we properly instantiate dentry.
  9p: add 9P2000.L rename operation
  9p: add 9P2000.L statfs operation
  9p: VFS switches for 9p2000.L: VFS switches
  9p: VFS switches for 9p2000.L: protocol and client changes
2010-05-24 07:41:13 -07:00
Linus Torvalds
6e188240eb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (59 commits)
  ceph: reuse mon subscribe message instead of allocated anew
  ceph: avoid resending queued message to monitor
  ceph: Storage class should be before const qualifier
  ceph: all allocation functions should get gfp_mask
  ceph: specify max_bytes on readdir replies
  ceph: cleanup pool op strings
  ceph: Use kzalloc
  ceph: use common helper for aborted dir request invalidation
  ceph: cope with out of order (unsafe after safe) mds reply
  ceph: save peer feature bits in connection structure
  ceph: resync headers with userland
  ceph: use ceph. prefix for virtual xattrs
  ceph: throw out dirty caps metadata, data on session teardown
  ceph: attempt mds reconnect if mds closes our session
  ceph: clean up send_mds_reconnect interface
  ceph: wait for mds OPEN reply to indicate reconnect success
  ceph: only send cap releases when mds is OPEN|HUNG
  ceph: dicard cap releases on mds restart
  ceph: make mon client statfs handling more generic
  ceph: drop src address(es) from message header [new protocol feature]
  ...
2010-05-24 07:37:52 -07:00
Steven Whitehouse
7df0e0397b GFS2: Fix permissions checking for setflags ioctl()
We should be checking for the ownership of the file for which
flags are being set, rather than just for write access.

Reported-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-05-24 14:36:48 +01:00
Jan Kara
8f45c33dec ufs: Remove dead quota code
UFS quota is non-functional at least since 2.6.12 because dq_op was set
to NULL. Since the filesystem exists mainly to allow cooperation with Solaris
and quota format isn't standard, just remove the dead code.

CC: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:10:19 +02:00
Jan Kara
3635046281 udf: Remove dead quota code
Quota on UDF is non-functional at least since 2.6.16 (I'm too lazy to
do more archeology) because it does not provide .quota_write and .quota_read
functions and thus quotaon(8) just returns EINVAL. Since nobody complained
for all those years and quota support is not even in UDF standard just nuke
it.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:10:19 +02:00
Christoph Hellwig
287a80958c quota: rename default quotactl methods to dquot_
Follow the dquot_* style used elsewhere in dquot.c.

[Jan Kara: Fixed up missing conversion of ext2]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:10:17 +02:00
Christoph Hellwig
123e9caf1e quota: explicitly set ->dq_op and ->s_qcop
Only set the quota operation vectors if the filesystem actually supports
quota instead of doing it for all filesystems in alloc_super().

[Jan Kara: Export dquot_operations and vfs_quotactl_ops]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:10:17 +02:00
Christoph Hellwig
307ae18a56 quota: drop remount argument to ->quota_on and ->quota_off
Remount handling has fully moved into the filesystem, so all this is
superflous now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:09:12 +02:00
Christoph Hellwig
e0ccfd959c quota: move unmount handling into the filesystem
Currently the VFS calls into the quotactl interface for unmounting
filesystems.  This means filesystems with their own quota handling
can't easily distinguish between user-space originating quotaoff
and an unount.  Instead move the responsibily of the unmount handling
into the filesystem to be consistent with all other dquot handling.

Note that we do call dquot_disable a lot later now, e.g. after
a sync_filesystem.  But this is fine as the quota code does all its
writes via blockdev's mapping and that is synced even later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:09:12 +02:00
Christoph Hellwig
0f0dd62fdd quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
Instead of having wrappers in the VFS namespace export the dquot_suspend
and dquot_resume helpers directly.  Also rename vfs_quota_disable to
dquot_disable while we're at it.

[Jan Kara: Moved dquot_suspend to quotaops.h and made it inline]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:06:40 +02:00
Christoph Hellwig
c79d967de3 quota: move remount handling into the filesystem
Currently do_remount_sb calls into the dquot code to tell it about going
from rw to ro and ro to rw.  Move this code into the filesystem to
not depend on the dquot code in the VFS - note ocfs2 already ignores
these calls and handles remount by itself.  This gets rid of overloading
the quotactl calls and allows to unify the VFS and XFS codepaths in
that area later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:06:39 +02:00
Jan Kara
eea7feb072 ocfs2: Fix use after free on remount read-only
We also have to cancel quota syncing thread on remount read only because
at that moment quota is being turned off. Otherwise quota syncing thread
will try to access already freed quota structures.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:06:39 +02:00