Operations that modify state for a whole page must be syncronized across
all requests within a page group. In the read path, this is calling
unlock_page and SetPageUptodate. Both of these functions should not be
called until all requests in a page group have reached the point where
they would call them.
This patch should have no effect yet since all page groups currently
have one request, but will come into play when pg_test functions are
modified to split pages into sub-page regions.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Add "page groups" - a circular list of nfs requests (struct nfs_page)
that all reference the same page. This gives nfs read and write paths
the ability to account for sub-page regions independently. This
somewhat follows the design of struct buffer_head's sub-page
accounting.
Only "head" requests are ever added/removed from the inode list in
the buffered write path. "head" and "sub" requests are treated the
same through the read path and the rest of the write/commit path.
Requests are given an extra reference across the life of the list.
Page groups are never rejoined after being split. If the read/write
request fails and the client falls back to another path (ie revert
to MDS in PNFS case), the already split requests are pushed through
the recoalescing code again, which may split them further and then
coalesce them into properly sized requests on the wire. Fragmentation
shouldn't be a problem with the current design, because we flush all
requests in page group when a non-contiguous request is added, so
the only time resplitting should occur is on a resend of a read or
write.
This patch lays the groundwork for sub-page splitting, but does not
actually do any splitting. For now all page groups have one request
as pg_test functions don't yet split pages. There are several related
patches that are needed support multiple requests per page group.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Call nfs_can_coalesce_requests for every request, even the first one.
This is needed for future patches to give pg_test a way to inform
add_request to reduce the size of the request.
Now @prev can be null in nfs_can_coalesce_requests and pg_test functions.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This is a step toward allowing pg_test to inform the the
coalescing code to reduce the size of requests so they may fit in
whatever scheme the pg_test callback wants to define.
For now, just return the size of the request if there is space, or 0
if there is not. This shouldn't change any behavior as it acts
the same as when the pg_test functions returned bool.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
@inode is passed but not used.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At this point the read and write structures look identical, so combine
them into something shared by both.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
What we have here is two functions that look identical. Let's share
some more code!
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Once again, these two functions look identical in the read and write
case. Time to combine them together!
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Most of this code is the same for both the read and write paths, so
combine everything and use the rw_ops when necessary.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Result will be massaged to saner shape in the next commits. It is
ugly, no questions - the point of that one is to be a provably
equivalent transformation (and it might be worth splitting a bit
more).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Check for NULL before using the acl in the access type switch
statement. This seems to be consistent with what is done in the JFFS
and ext4 filesystems and with the behaviour of JFS in the 3.13 kernel.
The bug seemed to be introduced in commit 2cc6a5a0.
The bug results in a kernel Oops, NULL dereference could not be handled
when accessing a JFS filesystem. The rdiff-backup process seemed to
trigger the bug. See also reported bug #75341:
https://bugzilla.kernel.org/show_bug.cgi?id=75341
Signed-off-by: William Burrow <wbkernel@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
These functions are almost identical on both the read and write side.
FLUSH_COND_STABLE will never be set for the read path, so leaving it in
the generic code won't hurt anything.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At this point, the read and write versions of this function look
identical so both should use the same function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Write adds a little bit of code dealing with flush flags, but since
"how" will always be 0 when reading we can share the code.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The read and write paths set up this struct in exactly the same way, so
create a single shared struct.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Combining these functions will let me make a single nfs_rw_common_ops
struct (see the next patch).
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The read and write paths do exactly the same thing for the rpc_prepare
rpc_op. This patch combines them together into a single function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
I create a new struct nfs_rw_ops to decide the differences between reads
and writes. This struct will be set when initializing a new
nfs_pgio_descriptor, and then passed on to the nfs_rw_header when a new
header is allocated.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
These functions are identical for the read and write paths so they can
be combined.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The header had a pointer to the verifier that was set from the old write
data struct. We don't need to keep the pointer around now that we have
shared structures.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The only difference is the write verifier field, but we can keep that
for a little bit longer.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At this point, the only difference between nfs_read_data and
nfs_write_data is the write verifier.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reads and writes have very similar results. This patch combines the two
structs together with comments to show where the differing fields are
used.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reads and writes have very similar arguments. This patch combines them
together and documents the few fields used only by write.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The read_pageio_init method is just a very convoluted way to grab the
right nfs_pageio_ops vector. The vector to chose is not a choice of
protocol version, but just a pNFS vs MDS I/O choice that can simply be
done inside nfs_pageio_init_read based on the presence of a layout
driver, and a new force_mds flag to the special case of falling back
to MDS I/O on a pNFS-capable volume.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The write_pageio_init method is just a very convoluted way to grab the
right nfs_pageio_ops vector. The vector to chose is not a choice of
protocol version, but just a pNFS vs MDS I/O choice that can simply be
done inside nfs_pageio_init_write based on the presence of a layout
driver, and a new force_mds flag to the special case of falling back
to MDS I/O on a pNFS-capable volume.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
"fdatasync() is similar to fsync(), but does not flush modified metadata
unless that metadata is needed in order to allow a subsequent data
retrieval to be correctly handled."
We absolutely need to commit the layouts to be able to retrieve the data
in case either the client, the server or the storage subsystem go down.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
xdr_reserve_space should now be calculating the length correctly as we
go, so there's no longer any need to fix it up here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This is a cosmetic change for now; no change in behavior.
Note we're just depending on xdr_reserve_space to do the bounds checking
for us, we're not really depending on its adjustment of iovec or xdr_buf
lengths yet, as those are fixed up by as necessary after the fact by
read-link operations and by nfs4svc_encode_compoundres. However we do
have to update xdr->iov on read-like operations to prevent
xdr_reserve_space from messing with the already-fixed-up length of the
the head.
When the attribute encoding fails partway through we have to undo the
length adjustments made so far. We do it manually for now, but later
patches will add an xdr_truncate_encode() helper to handle cases like
this.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It can happen only when dentry_kill() is called with unlock_on_failure
equal to 0 - other callers had dentry pinned until the moment they've
got ->d_lock and DCACHE_DENTRY_KILLED is set only after lockref_mark_dead().
IOW, only one of three call sites of dentry_kill() might end up reaching
that code. Just move it there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The debugging check which verifies that we never write outside of the file
length was incorrect, since it was multiplying file length by the page size,
instead of dividing. Fix this.
Spotted-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Commit 6130f5315e "switch vmsplice_to_user() to copy_page_to_iter()" in
v3.15-rc1 broke vmsplice(2).
This patch fixes two bugs:
- count is not initialized to a proper value, which resulted in no data
being copied
- if rw_copy_check_uvector() returns negative then the iov might be leaked.
Tested OK.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cody Schafer already fixed binary file creation for attribute groups, see [1].
This patch makes the appropriate changes for binary file removal
of attribute groups.
[1]: http://lkml.org/lkml/2014/2/27/832
Signed-off-by: Robert ABEL <rabel@cit-ec.uni-bielefeld.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The variable "size" is expressed as number of blocks and not as
number of clusters, this could trigger a kernel panic when using
ext4 with the size of a cluster different from the size of a block.
Cc: stable@vger.kernel.org
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tail of a page straddling inode size must be zeroed when being written
out due to POSIX requirement that modifications of mmaped page beyond
inode size must not be written to the file. ext4_bio_write_page() did
this only for blocks fully beyond inode size but didn't properly zero
blocks partially beyond inode size. Fix this.
The problem has been uncovered by mmap_11-4 test in openposix test suite
(part of LTP).
Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Fixes: 5a0dc7365c
Fixes: bd2d0210cf
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove local variable "stored" from ext4_readdir(...). This variable
gets initialized but is never used inside the function.
Signed-off-by: Giedrius Rekasius <giedrius.rekasius@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
xfstests generic/091 is failing when mounting ext4 with data=journal.
I think that this regression is same problem that occurred prior to collapse
range issue. So ZERO RANGE also need to call ext4_force_commit as
collapse range.
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This post-encoding check should be taking into account the need to
encode at least an out-of-space error to the following op (if any).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If nfsd4_check_resp_size() returns an error then we should really be
truncating the reply here, otherwise we may leave extra garbage at the
end of the rpc reply.
Also add a warning to catch any cases where our reply-size estimates may
be wrong in the case of a non-idempotent operation.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If UBIFS_DEBUG is defined an additional assertion of the ui_lock
spinlock in do_writepage cannot compile because the ui pointer has not
been previously declared.
Fix this by declaring and initializing the ui pointer in case
UBIFS_DEBUG is defined.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Function ubifs_read_one_lp will not set @lp and returns
an error when ubifs_read_one_lp failed. We should not
perform ubifs_dump_lprop in this case because @lp is not
initialized as we wanted.
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
we avoid taking global s_orphan_lock in some cases and hold it for
shorter time in other cases.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use sbi pointer consistently in ext4_orphan_del() instead of opencoding
it sometimes. Also ext4_orphan_add() uses EXT4_SB(sb) often so create
sbi variable for it as well and use it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Pull AFS fixes and cleanups from David Howells:
"Here are some patches to the AFS filesystem:
1) Fix problems in the clean-up parts of the cache manager service
handler.
2) Split afs_end_call() introduced in (1) and replace some identical
code elsewhere with a call to the first half of the split function.
3) Fix an error introduced in the workqueue PREPARE_WORK() elimination
commits.
4) Clean up argument passing to functions called from the workqueue as
there's now an insulating layer between them and the workqueue.
This is possible from (3)"
* 'afs' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
AFS: Pass an afs_call* to call->async_workfn() instead of a work_struct*
AFS: Fix kafs module unloading
AFS: Part of afs_end_call() is identical to code elsewhere, so split it
AFS: Fix cache manager service handlers