By making this member a void pointer we can get rid of a lot of pointless
casts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Currently we need to either call IHOLD or xfs_trans_ihold on an inode when
joining it to a transaction via xfs_trans_ijoin.
This patches instead makes xfs_trans_ijoin usable on it's own by doing
an implicity xfs_trans_ihold, which also allows us to drop the third
argument. For the case where we want to hold a reference on the inode
a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks
the inode for needing an xfs_iput. In addition to the cleaner interface
to the caller this also simplifies the implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Get rid of the xfs_buf_pin/xfs_buf_unpin/xfs_buf_ispin helpers and opencode
them in their only callers, just like we did for the inode pinning a while
ago. Also remove duplicate trace points - the bufitem tracepoints cover
all the information that is present in a buffer tracepoint.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Stop the function pointer casting madness and give all the li_cb instances
correct prototype.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Stop the function pointer casting madness and give all the xfs_item_ops the
correct prototypes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The unpin_remove item operation instances always share most of the
implementation with the respective unpin implementation. So instead
of keeping two different entry points add a remove flag to the unpin
operation and share the code more easily.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Currently we track log item descriptor belonging to a transaction using a
complex opencoded chunk allocator. This code has been there since day one
and seems to work around the lack of an efficient slab allocator.
This patch replaces it with dynamically allocated log item descriptors
from a dedicated slab pool, linked to the transaction by a linked list.
This allows to greatly simplify the log item descriptor tracking to the
point where it's just a couple hundred lines in xfs_trans.c instead of
a separate file. The external API has also been simplified while we're
at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
delete items from a transaction have been simplified to the bare minium,
and the xfs_trans_find_item function is replaced with a direct dereference
of the li_desc field. All debug code walking the list of log items in
a transaction is down to a simple list_for_each_entry.
Note that we could easily use a singly linked list here instead of the
double linked list from list.h as the fastpath only does deletion from
sequential traversal. But given that we don't have one available as
a library function yet I use the list.h functions for simplicity.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Dmapi support was never merged upstream, but we still have a lot of hooks
bloating XFS for it, all over the fast pathes of the filesystem.
This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
support in mainline at least the namespace events can be done much saner
in the VFS instead of the individual filesystem, so it's not like this
is much help for future work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
This inserts sanity check that refuses to mount a filesystem with
unsupported block size.
Previously, kernel code of nilfs was looking only limitation of
devices though mkfs.nilfs2 limits the range of block sizes; there was
no check that prevents rec_len overflow with larger block sizes.
With this change, block sizes larger than 64KB or smaller than 1KB
will get rejected explicitly by kernel.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
With 64KB blocksize, a directory entry can have size 64KB which does
not fit into 16 bits we have for entry length. So this patch stores
0xffff instead and converts value when read from / written to disk.
Nilfs derives its directory implementation from ext2 filesystem, and
this draws upon the corresponding change on ext2.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Commit 8b8edefa (fscache: convert object to use workqueue instead of
slow-work) made fscache_exit() call unregister_sysctl_table()
unconditionally breaking build when sysctl is disabled. Fix it by
putting it inside CONFIG_SYSCTL.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Implementation of nilfs_get_page() is a bit old as below:
- A common read_mapping_page inline function is now available instead
of its read_cache_page use.
- wait_on_page_locked() use in the function is eliminable since
read_cache_page function does the same thing through wait_on_page_read().
- PageUptodate() check is eliminable for the same reason.
This renews nilfs_get_page() based on these points.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
When we embed a dentry lease release notification in a request, invalidate
our lease so we don't think we still have it. Otherwise we can get all
sorts of incorrect client behavior when multiple clients are interacting
with the same part of the namespace.
Signed-off-by: Sage Weil <sage@newdream.net>
Free the ceph_pg_mapping structs when they are removed from the pg_temp
rbtree. Also fix a leak in the __insert_pg_mapping() error path.
Signed-off-by: Sage Weil <sage@newdream.net>
We need to set the d_release dop for snapdir and snapped dentries so that
the ceph_dentry_info struct gets released. We also use the dcache to
cache readdir results when possible, which only works if we know when
dentries are dropped from the cache. Since we don't use the dcache for
readdir in the hidden snapdir, avoid that case in ceph_dentry_release.
Signed-off-by: Sage Weil <sage@newdream.net>
This is just cleanup--it's harmless to call nfsd_rachache_init,
nfsd_init_socks, and nfsd_reset_versions more than once. But there's no
point to it.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Right now, nfsd keeps a lockd reference for each socket that it has
open. This is unnecessary and complicates the error handling on
startup and shutdown. Change it to just do a lockd_up when starting
the first nfsd thread just do a single lockd_down when taking down the
last nfsd thread. Because of the strange way the sv_count is handled
this requires an extra flag to tell whether the nfsd_serv holds a
reference for lockd or not.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There doesn't seem to be any need to reset the nfssvc_boot time if the
nfsd startup failed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
__write_ports_addxprt calls nfsd_create_serv. That increases the
refcount of nfsd_serv (which is tracked in sv_nrthreads). The service
only decrements the thread count on error, not on success like
__write_ports_addfd does, so using this interface leaves the nfsd
thread count high.
Fix this by having this function call svc_destroy() on error to release
the reference (and possibly to tear down the service) and simply
decrement the refcount without tearing down the service on success.
This makes the sv_threads handling work basically the same in both
__write_ports_addxprt and __write_ports_addfd.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The refcounting for nfsd is a little goofy. What happens is that we
create the nfsd RPC service, attach sockets to it but don't actually
start the threads until someone writes to the "threads" procfile. To do
this, __write_ports_addfd will create the nfsd service and then will
decrement the refcount when exiting but won't actually destroy the
service.
This is fine when there aren't errors, but when there are this can
cause later attempts to start nfsd to fail. nfsd_serv will be set,
and that causes __write_versions to return EBUSY.
Fix this by calling svc_destroy on nfsd_serv when this function is
going to return error.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If someone tries to shut down the laundry_wq while it isn't up it'll
cause an oops.
This can happen because write_ports can create a nfsd_svc before we
really start the nfs server, and we may fail before the server is ever
started.
Also make sure state is shutdown on error paths in nfsd_svc().
Use a common global nfsd_up flag instead of nfs4_init, and create common
helper functions for nfsd start/shutdown, as there will be other work
that we want done only when we the number of nfsd threads transitions
between zero and nonzero.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Workqueue can now handle high concurrency. Convert gfs to use
workqueue instead of slow-work.
* Steven pointed out that recovery path might be run from allocation
path and thus requires forward progress guarantee without memory
allocation. Create and use gfs_recovery_wq with rescuer. Please
note that forward progress wasn't guaranteed with slow-work.
* Updated to use non-reentrant workqueue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
data=writeback mode is dangerous as it leads to higher data loss and stale data
exposure when systems crash. It should not be the default, especially when all
major distros ensure their ext3 filesystems default to ordered mode. Change the
default mode to the safer data=ordered mode, because we should be caring far
more about avoiding stale data exposure than performance.
CC: linux-ext4@vger.kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Quota code never touches file data. It just modifies i_blocks + i_bytes
of inodes and inode flags of quota files. So use mark_inode_dirty_sync
instead of mark_inode_dirty.
Signed-off-by: Jan Kara <jack@suse.cz>
This forces nilfs to check compatibility of feature flags so as to
reject a filesystem with unknown features when it mounts or remounts
the filesystem.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This applies read-ahead to nilfs_btree_do_lookup and
nilfs_btree_lookup_contig functions and extends them to read ahead
siblings of level 1 btree nodes that hold data blocks.
At present, the read-ahead is not applied to most btree operations;
only get_block() callback function, which is used during read of
regular files or directories, receives the benefit.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_btree_get_block() now may return untested buffer due to
read-ahead. This adds a new flag for buffer heads so that the btree
code can check whether the buffer is already verified or not.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds __nilfs_btree_get_block() function that can issue a series
of read-ahead requests for sibling btree nodes.
This read-ahead needs parent node block, so nilfs_btree_readahead_info
structure is added to pass the information that
__nilfs_btree_get_block() needs.
This also replaces the previous nilfs_btree_get_block() implementation
with a wrapper function of __nilfs_btree_get_block().
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds mode argument to nilfs_btnode_submit_block() function and
allows it to issue a read-ahead request.
An optional submit_ptr argument is also added to store the actual
block address for which bio is sent. submit_ptr is used for a series
of read-ahead requests, and helps to decide if each requested block is
continous to the previous one on disk.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_btnode_submit_block() refers to buffer head just before
returning from the function, but it releases the buffer head earlier
than that if nilfs_dat_translate() gets an error.
This has potential for oops in the erroneous case. This fixes the
issue.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This removes all inline uses from btree.c. Gcc now agressively apply
inline expansion even for the functions declared without the keyword;
the inline use in btree.c looks excessive.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The patch "reduce repetitive calculation of max number of child nodes"
gathered up the calculation of maximum number of child nodes into
nilfs_btree_nchildren_per_block() function. This makes the function
get resultant value from a private variable in bmap object instead of
calculating it for each call.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The current btree implementation repeats the same calculation on the
maximum number of child nodes. This is because a few low level
routines use the calculation for index addressing in a btree node
block.
This reduces the calculation by explicitly passing the maximum number
of child nodes (ncmax) through their argument.
This changes parameter passing of the following functions:
- nilfs_btree_node_dptrs
- nilfs_btree_node_get_ptr
- nilfs_btree_node_set_ptr
- nilfs_btree_node_init
- nilfs_btree_node_move_left
- nilfs_btree_node_move_right
- nilfs_btree_node_insert
- nilfs_btree_node_delete, and
- nilfs_btree_get_node
The following functions are removed:
- nilfs_btree_node_nchildren_min
- nilfs_btree_node_nchildren_max
Most middle level btree operations are rewritten to pass a proper
ncmax value depending on whether each occurrence of node is "root" or
not.
A constant NILFS_BTREE_ROOT_NCHILDREN_MAX is used for the root node,
whereas nilfs_btree_nchildren_per_block() function is used for
non-root nodes. If a node could be either root or a non-root node, an
output argument of nilfs_btree_get_node() is used to set up ncmax.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_btree_node_nchildren_max() and nilfs_btree_node_nchildren_min()
functions switch return value depending on whether target node is the
root or a node block. In most uses of these functions, however, the
node type is fixed, and moreover the same calculation is repeatedly
performed in loop.
This unfold these functions depending on context and move them outside
loops wherever possible.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_bmap_lookup and its variants are supposed to take a valid
pointer argument to return a block address, thus pointer checks in
nilfs_btree_lookup and nilfs_direct_lookup are needless.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This removes nilfs_bmap_union and finally unifies three structures and
the union in bmap/btree code into one.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This unifies two similar functions nilfs_btree_set_target_v and
nilfs_direct_set_target_v into one, nilfs_bmap_set_target_v.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This replaces all uses of nilfs_btree struct in implementation of
btree mapping with nilfs_bmap struct.
Name of local variable "btree" is kept not to bloat amount of change.
And, a part of local variables "bmap" is renamed to "btree" to uniform
naming rule.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This replaces all uses of nilfs_direct struct in implementation of
direct mapping with nilfs_bmap struct.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The first argument of bops->bop_propagate operation takes a constant
qualifier, and causes compilation error when removed cast to pointer
of nilfs_btree structure type. This fixes the issue to prepare for
succesive removal of nilfs_btree struct.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Will remove nilfs_bmap_key_to_dkey(), nilfs_bmap_dkey_to_key(),
nilfs_bmap_ptr_to_dptr(), and nilfs_bmap_dptr_to_ptr() for simplicity.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>