encode_getattr, for example, can return nfserr_resource to indicate it
ran out of buffer space. That's not a legal error in the 4.1 case.
And in the 4.1 case, if we ran out of buffer space, we should have
exceeded a session limit too.
(Note in 1bc49d83c3 "nfsd4: fix
nfs4err_resource in 4.1 case" we originally tried fixing this error
return before fixing the problem that we could error out while we still
had lots of available space. The result was to trade one illegal error
for another in those cases. We decided that was helpful, so reverted
the change in fc208d026b, and are only
reinstating it now that we've elimited almost all of those cases.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I'm not sure why a client would want to stuff multiple reads in a
single compound rpc, but it's legal for them to do it, and we should
really support it.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The splice and readv cases are actually quite different--for example the
former case ignores the array of vectors we build up for the latter.
It is probably clearer to separate the two cases entirely.
There's some code duplication between the split out encoders, but this
is only temporary and will be fixed by a later patch.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We currently allow only one read per compound, with operations before
and after whose responses will require no more than about a page to
encode.
While we don't expect clients to violate those limits any time soon,
this limitation isn't really condoned by the spec, so to future proof
the server we should lift the limitation.
At the same time we'd like to continue to support zero-copy reads.
Supporting multiple zero-copy-reads per compound would require a new
data structure to replace struct xdr_buf, which can represent only one
set of included pages.
So for now we plan to modify encode_read() to support either zero-copy
or non-zero-copy reads, and use some heuristics at the start of the
compound processing to decide whether a zero-copy read will work.
This will allow us to support more exotic compounds without introducing
a performance regression in the normal case.
Later patches handle those "exotic compounds", this one just makes sure
zero-copy is turned off in those cases.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We plan to use this estimate to decide whether or not to allow zero-copy
reads. Currently we're assuming all getattr's are a page, which can be
both too small (ACLs e.g. may be arbitrarily long) and too large (after
an upcoming read patch this will unnecessarily prevent zero copy reads
in any read compound also containing a getattr).
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There's no advantage to this zero-copy-style readlink encoding, and it
unnecessarily limits the kinds of compounds we can handle. (In practice
I can't see why a client would want e.g. multiple readlink calls in a
comound, but it's probably a spec violation for us not to handle it.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
As long as we're here, let's enforce the protocol's limit on the number
of directory entries to return in a readdir.
I don't think anyone's ever noticed our lack of enforcement, but maybe
there's more of a chance they will now that we allow larger readdirs.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently we limit readdir results to a single page. This can result in
a performance regression compared to NFSv3 when reading large
directories.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Once we know the limits the session places on the size of the rpc, we
can also use that information to release any unnecessary reserved reply
buffer space.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We can simplify session limit enforcement by restricting the xdr buflen
to the session size.
Also fix a preexisting bug: we should really have been taking into
account the auth-required space when comparing against session limits,
which are limits on the size of the entire rpc reply, including any krb5
overhead.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We don't necessarily want to assume that the buflen is the same
as the number of bytes available in the pages. We may have some reason
to set it to something less (for example, later patches will use a
smaller buflen to enforce session limits).
So, calculate the buflen relative to the previous buflen instead of
recalculating it from scratch.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It will turn out to be useful to have a more accurate estimate of reply
size; so, piggyback on the existing op reply-size estimators.
Also move nfsd4_max_reply to nfs4proc.c to get easier access to struct
nfsd4_operation and friends. (Thanks to Christoph Hellwig for pointing
out that simplification.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I ran into this corner case in testing: in theory clients can provide
state owners up to 1024 bytes long. In the sessions case there might be
a risk of this pushing us over the DRC slot size.
The conflicting owner isn't really that important, so let's humor a
client that provides a small maxresponsize_cached by allowing ourselves
to return without the conflicting owner instead of outright failing the
operation.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Limits on maxresp_sz mean that we only ever need to replay rpc's that
are contained entirely in the head.
The one exception is very small zero-copy reads. That's an odd corner
case as clients wouldn't normally ask those to be cached.
in any case, this seems a little more robust.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
After this we can handle for example getattr of very large ACLs.
Read, readdir, readlink are still special cases with their own limits.
Also we can't handle a new operation starting close to the end of a
page.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Now that all op encoders can handle running out of space, we no longer
need to check the remaining size for every operation; only nonidempotent
operations need that check, and that can be done by
nfsd4_check_resp_size.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Once we've included page-cache pages in the encoding it's difficult to
remove them and restart encoding. (xdr_truncate_encode doesn't handle
that case.) So, make sure we'll have adequate space to finish the
operation first.
For now COMPOUND_SLACK_SPACE checks should prevent this case happening,
but we want to remove those checks.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We've tried to prevent running out of space with COMPOUND_SLACK_SPACE
and special checking in those operations (getattr) whose result can vary
enormously.
However:
- COMPOUND_SLACK_SPACE may be difficult to maintain as we add
more protocol.
- BUG_ON or page faulting on failure seems overly fragile.
- Especially in the 4.1 case, we prefer not to fail compounds
just because the returned result came *close* to session
limits. (Though perfect enforcement here may be difficult.)
- I'd prefer encoding to be uniform for all encoders instead of
having special exceptions for encoders containing, for
example, attributes.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Normally xdr encoding proceeds in a single pass from start of a buffer
to end, but sometimes we have to write a few bytes to an earlier
position.
Use write_bytes_to_xdr_buf for these cases rather than saving a pointer
to write to. We plan to rewrite xdr_reserve_space to handle encoding
across page boundaries using a scratch buffer, and don't want to risk
writing to a pointer that was contained in a scratch buffer.
Also it will no longer be safe to calculate lengths by subtracting two
pointers, so use xdr_buf offsets instead.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We have the same problem with ->d_lock order in the inner loop, where
we are dropping references to ancestors. Same solution, basically -
instead of using dentry_kill() we use lock_parent() (introduced in the
previous commit) to get that lock in a safe way, recheck ->d_count
(in case if lock_parent() has ended up dropping and retaking ->d_lock
and somebody managed to grab a reference during that window), trylock
the inode->i_lock and use __dentry_kill() to do the rest.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The cause of livelocks there is that we are taking ->d_lock on
dentry and its parent in the wrong order, forcing us to use
trylock on the parent's one. d_walk() takes them in the right
order, and unfortunately it's not hard to create a situation
when shrink_dentry_list() can't make progress since trylock
keeps failing, and shrink_dcache_parent() or check_submounts_and_drop()
keeps calling d_walk() disrupting the very shrink_dentry_list() it's
waiting for.
Solution is straightforward - if that trylock fails, let's unlock
the dentry itself and take locks in the right order. We need to
stabilize ->d_parent without holding ->d_lock, but that's doable
using RCU. And we'd better do that in the very beginning of the
loop in shrink_dentry_list(), since the checks on refcount, etc.
would need to be redone anyway.
That deals with a half of the problem - killing dentries on the
shrink list itself. Another one (dropping their parents) is
in the next commit.
locking parent is interesting - it would be easy to do rcu_read_lock(),
lock whatever we think is a parent, lock dentry itself and check
if the parent is still the right one. Except that we need to check
that *before* locking the dentry, or we are risking taking ->d_lock
out of order. Fortunately, once the D1 is locked, we can check if
D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent
can start or stop pointing to D1 only under D1->d_lock, so taking
D1->d_lock is enough. In other words, the right solution is
rcu_read_lock/lock what looks like parent right now/check if it's
still our parent/rcu_read_unlock/lock the child.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The object and block layouts already exist in their own
subdirectories. This patch completes the set!
Note that as a layout denotes nfs4 already, I stripped
that prefix out of the file names.
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Those flags are obsolete and checking them can incorrectly cause
remount operations to fail.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Place the call to resend the failed GETATTR under the error handler so that
when appropriate, the GETATTR is retried more than once.
The server can fail the GETATTR op in the OPEN compound with a recoverable
error such as NFS4ERR_DELAY. In the case of an O_EXCL open, the server has
created the file, so a retrans of the OPEN call will fail with NFS4ERR_EXIST.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We cannot allow nfs_page_group_lock to use TASK_KILLABLE here, since
the loop would cause a busy wait if somebody kills the task.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Handle the case where nfs_create_request() returns an error.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
nfs_read_completion relied on the fact that there was a 1:1 mapping
of page to nfs_request, but this has now changed.
Regions not covered by a request have already been zeroed elsewhere.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Use the new pg_test interface to adjust requests to fit in the current
stripe / segment.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Remove alignment checks that would revert to MDS and change pg_test
to return the max ammount left in the segment (or other pg_test call)
up to size of passed request, or 0 if no space is left.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Support direct requests that span multiple pnfs data servers by
comparing nfs_pgio_header->verf to a cached verf in pnfs_commit_bucket.
Continue to use dreq->verf if the MDS is used / non-pNFS.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Since the ability to split pages into subpage requests has been added,
nfs_pgio_header->rpc_list only ever has one pgio data.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Use the newly added support for multiple requests per page for
rsize/wsize < PAGE_SIZE, instead of having multiple read / write
data structures per pageio header.
This allows us to get rid of nfs_pgio_multi.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Now that pg_test can change the size of the request (by returning a non-zero
size smaller than the request), pg_test functions that call other
pg_test functions must return the minimum of the result - or 0 if any fail.
Also clean up the logic of some pg_test functions so that all checks are
for contitions where coalescing is not possible.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Change how nfs_mark_uptodate checks to see if writes cover a whole page.
This patch should have no effect yet since all page groups currently
have one request, but will come into play when pg_test functions are
modified to split pages into sub-page regions.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Operations that modify state for a whole page must be syncronized across
all requests within a page group. In the write path, this is calling
end_page_writeback and removing the head request from an inode.
Both of these operations should not be called until all requests
in a page group have reached the point where they would call them.
This patch should have no effect yet since all page groups currently
have one request, but will come into play when pg_test functions are
modified to split pages into sub-page regions.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>