The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The .write_begin and .write_end are requiered to use generic routines
(generic_file_aio_write --> ... --> generic_perform_write) for buffered
writes.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Move the code filling and sending read request to a separate function. Future
patches will use it for .write_begin -- partial modification of a page
requires reading the page from the storage very similarly to what fuse_readpage
does.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.
filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.
In case of "write cache" turned off, the flush is ensured by fuse_vma_close().
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Let the kernel maintain i_mtime locally:
- clear S_NOCMTIME
- implement i_op->update_time()
- flush mtime on fsync and last close
- update i_mtime explicitly on truncate and fallocate
Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
should be flushed to the server eventually.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.
This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.
fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.
Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Off (0) by default. Will be used in the next patches and will be turned
on at the very end.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
A helper which gets called when read reports less bytes than was requested.
See patch "trust kernel i_size only" for details.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
... and don't skip on sanity checks. It's *not* a hot path, TYVM
(a couple of calls per a.out execve(), for pity sake) and headers of
random a.out binary are not to be trusted.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... we are doing them on adjacent parts of file, so what happens is that
each subsequent call works to rebuild the iov_iter to exact state it
had been abandoned in by previous one. Just keep it through the entire
cifs_iovec_read(). And use copy_page_to_iter() instead of doing
kmap/copy_to_user/kunmap manually...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I've switched the sanity checks on iovec to rw_copy_check_uvector();
we might need to do a local analog, if any behaviour differences are
not actually bugfixes here...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... by that point the request we'd just resent is in the
head of the list anyway. Just return to the beginning of
the loop body...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make delayed_free() call free_vfsmnt() so that we don't have two functions
doing the same job. This requires the calls to mnt_free_id() in free_vfsmnt()
to be moved into the callers of that function.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
the only thing it's doing these days is calculation of
upper limit for fs.nr_open sysctl and that can be done
statically
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new flag in ->f_mode - FMODE_WRITER. Set by do_dentry_open() in case
when it has grabbed write access, checked by __fput() to decide whether
it wants to drop the sucker. Allows to stop bothering with mnt_clone_write()
in alloc_file(), along with fewer special_file() checks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's pointless and actually leads to wrong behaviour in at least one
moderately convoluted case (pipe(), close one end, try to get to
another via /proc/*/fd and run into ETXTBUSY).
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint. That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations. It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.
Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.
One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups. And iterate through
the peers before dealing with the next group.
Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.
Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}. It means that the master
of M_{k} will be among the sequence of masters of M. On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).
So we go through the sequence of masters of M until we find
a marked one (P). Let N be the one before it. Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.
That's it for the hard part; the rest is fairly simple. Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().
It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull core block layer updates from Jens Axboe:
"This is the pull request for the core block IO bits for the 3.15
kernel. It's a smaller round this time, it contains:
- Various little blk-mq fixes and additions from Christoph and
myself.
- Cleanup of the IPI usage from the block layer, and associated
helper code. From Frederic Weisbecker and Jan Kara.
- Duplicate code cleanup in bio-integrity from Gu Zheng. This will
give you a merge conflict, but that should be easy to resolve.
- blk-mq notify spinlock fix for RT from Mike Galbraith.
- A blktrace partial accounting bug fix from Roman Pen.
- Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"
* 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
blk-mq: add REQ_SYNC early
rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
blk-mq: support partial I/O completions
blk-mq: merge blk_mq_insert_request and blk_mq_run_request
blk-mq: remove blk_mq_alloc_rq
blk-mq: don't dump CPU -> hw queue map on driver load
blk-mq: fix wrong usage of hctx->state vs hctx->flags
blk-mq: allow blk_mq_init_commands() to return failure
block: remove old blk_iopoll_enabled variable
blktrace: fix accounting of partially completed requests
smp: Rename __smp_call_function_single() to smp_call_function_single_async()
smp: Remove wait argument from __smp_call_function_single()
watchdog: Simplify a little the IPI call
smp: Move __smp_call_function_single() below its safe version
smp: Consolidate the various smp_call_function_single() declensions
smp: Teach __smp_call_function_single() to check for offline cpus
smp: Remove unused list_head from csd
smp: Iterate functions through llist_for_each_entry_safe()
block: Stop abusing rq->csd.list in blk-softirq
block: Remove useless IPI struct initialization
...
In the f2fs_wait_on_page_writeback, io->bio should be covered by io_rwsem.
Otherwise, the bio pointer can become a dangling pointer due to data races.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We should unlock page in ->readpage() path and also should unlock & release page
in error path of ->write_begin() to avoid deadlock or memory leak.
So let's add release code to fix the problem when we fail to read inline data.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch use list_for_each_entry{_safe} instead of list_for_each{_safe} for
simplfying code.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Move kmem_cache_free out of spinlock protection region for better performance.
Change log from v1:
o remove spinlock protection for kmem_cache_free in destroy_node_manager
suggested by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>