Commit graph

38263 commits

Author SHA1 Message Date
Dave Chinner
7656066986 xfs: reorganise transaction recovery item code
The code for managing transactions anf the items for recovery is
spread across 3 different locations in the file. Move them all
together so that it is easy to read the code without needing to jump
long distances in the file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 09:45:42 +10:00
Dave Chinner
88b863db97 xfs: fix double free in xlog_recover_commit_trans
When an error occurs during buffer submission in
xlog_recover_commit_trans(), we free the trans structure twice. Fix
it by only freeing the structure in the caller regardless of the
success or failure of the function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 09:45:32 +10:00
Dave Chinner
e9131e50f9 xfs: recovery of XLOG_UNMOUNT_TRANS leaks memory
The XLOG_UNMOUNT_TRANS case skips the transaction, despite the fact
an unmount record is always in a standalone transaction. Hence
whenever we come across one of these we need to free the transaction
structure associated with it as there is no commit record that
follows it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 09:45:18 +10:00
Dave Chinner
eeb1168810 xfs: refactor xlog_recover_process_data()
Clean up xlog_recover_process_data() structure in preparation for
fixing the allocation and freeing context of the transaction being
recovered.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 09:45:03 +10:00
Trond Myklebust
df817ba357 NFSv4: fix open/lock state recovery error handling
The current open/lock state recovery unfortunately does not handle errors
such as NFS4ERR_CONN_NOT_BOUND_TO_SESSION correctly. Instead of looping,
just proceeds as if the state manager is finished recovering.
This patch ensures that we loop back, handle higher priority errors
and complete the open/lock state recovery.

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-28 16:03:04 -04:00
Trond Myklebust
a4339b7b68 NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
If a NFSv4.x server returns NFS4ERR_STALE_CLIENTID in response to a
CREATE_SESSION or SETCLIENTID_CONFIRM in order to tell us that it rebooted
a second time, then the client will currently take this to mean that it must
declare all locks to be stale, and hence ineligible for reboot recovery.

RFC3530 and RFC5661 both suggest that the client should instead rely on the
server to respond to inelegible open share, lock and delegation reclaim
requests with NFS4ERR_NO_GRACE in this situation.

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-28 16:03:03 -04:00
Linus Torvalds
1e3827bf8a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes + unifying __d_move() and __d_materialise_dentry() +
  minimal regression fix for d_path() of victims of overwriting rename()
  ported on top of that"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Don't exchange "short" filenames unconditionally.
  fold swapping ->d_name.hash into switch_names()
  fold unlocking the children into dentry_unlock_parents_for_move()
  kill __d_materialise_dentry()
  __d_materialise_dentry(): flip the order of arguments
  __d_move(): fold manipulations with ->d_child/->d_subdirs
  don't open-code d_rehash() in d_materialise_unique()
  pull rehashing and unlocking the target dentry into __d_materialise_dentry()
  ufs: deal with nfsd/iget races
  fuse: honour max_read and max_write in direct_io mode
  shmem: fix nlink for rename overwrite directory
2014-09-27 17:05:14 -07:00
Mikhail Efremov
d2fa4a8476 vfs: Don't exchange "short" filenames unconditionally.
Only exchange source and destination filenames
if flags contain RENAME_EXCHANGE.
In case if executable file was running and replaced by
other file /proc/PID/exe should still show correct file name,
not the old name of the file by which it was replaced.

The scenario when this bug manifests itself was like this:
* ALT Linux uses rpm and start-stop-daemon;
* during a package upgrade rpm creates a temporary file
  for an executable to rename it upon successful unpacking;
* start-stop-daemon is run subsequently and it obtains
  the (nonexistant) temporary filename via /proc/PID/exe
  thus failing to identify the running process.

Note that "long" filenames (> DNAiME_INLINE_LEN) are still
exchanged without RENAME_EXCHANGE and this behaviour exists
long enough (should be fixed too apparently).
So this patch is just an interim workaround that restores
behavior for "short" names as it was before changes
introduced by commit da1ce0670c ("vfs: add cross-rename").

See https://lkml.org/lkml/2014/9/7/6 for details.

AV: the comments about being more careful with ->d_name.hash
than with ->d_name.name are from back in 2.3.40s; they
became obsolete by 2.3.60s, when we started to unhash the
target instead of swapping hash chain positions followed
by d_delete() as we used to do when dcache was first
introduced.

Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: da1ce0670c "vfs: add cross-rename"
Signed-off-by: Mikhail Efremov <sem@altlinux.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27 15:59:39 -04:00
Linus Torvalds
a28ddb87cd fold swapping ->d_name.hash into switch_names()
and do it along with ->d_name.len there

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27 15:59:11 -04:00
Al Viro
986c01942a fold unlocking the children into dentry_unlock_parents_for_move()
... renaming it into dentry_unlock_for_move() and making it more
symmetric with dentry_lock_for_move().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 23:11:15 -04:00
Al Viro
63cf427a57 kill __d_materialise_dentry()
it folds into __d_move() now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 23:06:14 -04:00
Al Viro
4453641fe8 __d_materialise_dentry(): flip the order of arguments
... thus making it much closer to (now unreachable, BTW) IS_ROOT(dentry)
case in __d_move().  A bit more and it'll fold in.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 22:54:02 -04:00
Al Viro
9d8cd306a8 __d_move(): fold manipulations with ->d_child/->d_subdirs
list_del() + list_add() is a slightly pessimised list_move()
list_del() + INIT_LIST_HEAD() is a slightly pessimised list_del_init()

Interleaving those makes the resulting code even worse.  And harder to follow...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 21:34:01 -04:00
Al Viro
8527dd7187 don't open-code d_rehash() in d_materialise_unique()
... and get rid of duplicate BUG_ON() there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 21:26:50 -04:00
Al Viro
5cc3821b57 pull rehashing and unlocking the target dentry into __d_materialise_dentry()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 21:25:35 -04:00
Al Viro
e4502c63f5 ufs: deal with nfsd/iget races
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 21:17:52 -04:00
Miklos Szeredi
2c80929c4c fuse: honour max_read and max_write in direct_io mode
The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
bytes a caller asked to pack into fuse request. This value may be lesser
than capacity of fuse request or iov_iter.  So fuse_get_user_pages() must
ensure that *nbytesp won't grow.

Now, when helper iov_iter_get_pages() performs all hard work of extracting
pages from iov_iter, it can be done by passing properly calculated
"maxsize" to the helper.

The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need
this capability, so pass LONG_MAX as the maxsize argument here.

Fixes: c9c37e2e63 ("fuse: switch to iov_iter_get_pages()")
Reported-by: Werner Baumann <werner.baumann@onlinehome.de>
Tested-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 21:16:51 -04:00
Christoph Hellwig
0162ac2b97 nfsd: introduce nfsd4_callback_ops
Add a higher level abstraction than the rpc_ops for callback operations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:29 -04:00
Christoph Hellwig
f0b5de1b6b nfsd: split nfsd4_callback initialization and use
Split out initializing the nfs4_callback structure from using it.  For
the NULL callback this gets rid of tons of pointless re-initializations.

Note that I don't quite understand what protects us from running multiple
NULL callbacks at the same time, but at least this chance doesn't make
it worse..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:28 -04:00
Christoph Hellwig
326129d02a nfsd: introduce a generic nfsd4_cb
Add a helper to queue up a callback.  CB_NULL has a bit of special casing
because it is special in the specification, but all other new callback
operations will be able to share code with this and a few more changes
to refactor the callback code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:27 -04:00
Christoph Hellwig
2faf3b4350 nfsd: remove nfsd4_callback.cb_op
We can always get at the private data by using container_of, no need for
a void pointer.  Also introduce a little to_delegation helper to avoid
opencoding the container_of everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:26 -04:00
Benny Halevy
341b51df1f nfsd: do not clear rpc_resp in nfsd4_cb_done_sequence
This is incorrect when a callback is has to be restarted, in which case
the XDR decoding of the second iteration will see a NULL cb argument.

[hch: updated description]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:25 -04:00
Christoph Hellwig
444b6e910d nfsd: fix nfsd4_cb_recall_done error handling
For any error that is not EBADHANDLE or NFS4ERR_BAD_STATEID,
nfsd4_cb_recall_done first marks the connection down, then
retries until dl_retries hits zero, then marks the connection down
again and sets cb_done.  This changes the code to only retry
for EBADHANDLE or NFS4ERR_BAD_STATEID, and factors setting
cb_done into a single point in the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:25 -04:00
Fabian Frederick
6ff66ac77a fs/cachefiles: add missing \n to kerror conversions
Commit 0227d6abb3 ("fs/cachefiles: replace kerror by pr_err") didn't
include newline featuring in original kerror definition

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reported-by: David Howells <dhowells@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>	[3.16.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:35 -07:00
Peter Feiner
87e6d49a00 mm: softdirty: addresses before VMAs in PTE holes aren't softdirty
In PTE holes that contain VM_SOFTDIRTY VMAs, unmapped addresses before
VM_SOFTDIRTY VMAs are reported as softdirty by /proc/pid/pagemap.  This
bug was introduced in commit 68b5a65248 ("mm: softdirty: respect
VM_SOFTDIRTY in PTE holes").  That commit made /proc/pid/pagemap look at
VM_SOFTDIRTY in PTE holes but neglected to observe the start of VMAs
returned by find_vma.

Tested:
  Wrote a selftest that creates a PMD-sized VMA then unmaps the first
  page and asserts that the page is not softdirty. I'm going to send the
  pagemap selftest in a later commit.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Jamie Liu <jamieliu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:35 -07:00
Joseph Qi
5760a97c71 ocfs2/dlm: do not get resource spinlock if lockres is new
There is a deadlock case which reported by Guozhonghua:
  https://oss.oracle.com/pipermail/ocfs2-devel/2014-September/010079.html

This case is caused by &res->spinlock and &dlm->master_lock
misordering in different threads.

It was introduced by commit 8d400b81cc ("ocfs2/dlm: Clean up refmap
helpers").  Since lockres is new, it doesn't not require the
&res->spinlock.  So remove it.

Fixes: 8d400b81cc ("ocfs2/dlm: Clean up refmap helpers")
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Reported-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:34 -07:00
Andreas Rohner
56d7acc792 nilfs2: fix data loss with mmap()
This bug leads to reproducible silent data loss, despite the use of
msync(), sync() and a clean unmount of the file system.  It is easily
reproducible with the following script:

  ----------------[BEGIN SCRIPT]--------------------
  mkfs.nilfs2 -f /dev/sdb
  mount /dev/sdb /mnt

  dd if=/dev/zero bs=1M count=30 of=/mnt/testfile

  umount /mnt
  mount /dev/sdb /mnt
  CHECKSUM_BEFORE="$(md5sum /mnt/testfile)"

  /root/mmaptest/mmaptest /mnt/testfile 30 10 5

  sync
  CHECKSUM_AFTER="$(md5sum /mnt/testfile)"
  umount /mnt
  mount /dev/sdb /mnt
  CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)"
  umount /mnt

  echo "BEFORE MMAP:\t$CHECKSUM_BEFORE"
  echo "AFTER MMAP:\t$CHECKSUM_AFTER"
  echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT"
  ----------------[END SCRIPT]--------------------

The mmaptest tool looks something like this (very simplified, with
error checking removed):

  ----------------[BEGIN mmaptest]--------------------
  data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE,
              MAP_SHARED, fd, file_offset);

  for (i = 0; i < write_count; ++i) {
        memcpy(data + i * 4096, buf, sizeof(buf));
        msync(data, file_size - file_offset, MS_SYNC))
  }
  ----------------[END mmaptest]--------------------

The output of the script looks something like this:

  BEFORE MMAP:    281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
  AFTER MMAP:     6604a1c31f10780331a6850371b3a313  /mnt/testfile
  AFTER REMOUNT:  281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile

So it is clear, that the changes done using mmap() do not survive a
remount.  This can be reproduced a 100% of the time.  The problem was
introduced in commit 136e8770cd ("nilfs2: fix issue of
nilfs_set_page_dirty() for page at EOF boundary").

If the page was read with mpage_readpage() or mpage_readpages() for
example, then it has no buffers attached to it.  In that case
page_has_buffers(page) in nilfs_set_page_dirty() will be false.
Therefore nilfs_set_file_dirty() is never called and the pages are never
collected and never written to disk.

This patch fixes the problem by also calling nilfs_set_file_dirty() if the
page has no buffers attached to it.

[akpm@linux-foundation.org: s/PAGE_SHIFT/PAGE_CACHE_SHIFT/]
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Tested-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:34 -07:00
Joseph Qi
f13a568e5a ocfs2: free vol_label in ocfs2_delete_osb()
osb->vol_label is malloced in ocfs2_initialize_super but not freed if
error occurs or during umount, thus causing a memory leak.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:34 -07:00
hujianyang
4b1a43eab1 UBIFS: Align the dump messages of SB_NODE
I found the dump messages of UBIFS_SB_NODE is not aligned. This
patch remove the extra space from the line which is retracted.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-09-26 13:33:41 +03:00
David Howells
f3f760314a NFS: Fabricate fscache server index key correctly
When fabricating a server index key for fscache, we should clear the index key
buffer before starting to fill it in, not in the middle.

Reported-by: James Pearson <james-p@moving-picture.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 21:25:18 -04:00
Trond Myklebust
3fc3edf141 NFSv3: Fix missing includes of nfs3_fs.h
Silence a few warnings about missing symbols that are due to missing
includes of nfs3_fs.h.

Fixes: 00a36a1090 (NFS: Move v3 declarations out of internal.h)
Fixes: cb8c20fa53 (NFS: Move NFS v3 acl functions to nfs3_fs.h)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 16:28:53 -04:00
NeilBrown
1aff525629 NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
Now that nfs_release_page() doesn't block indefinitely, other deadlock
avoidance mechanisms aren't needed.
 - it doesn't hurt for kswapd to block occasionally.  If it doesn't
   want to block it would clear __GFP_WAIT.  The current_is_kswapd()
   was only added to avoid deadlocks and we have a new approach for
   that.
 - memory allocation in the SUNRPC layer can very rarely try to
   ->releasepage() a page it is trying to handle.  The deadlock
   is removed as nfs_release_page() doesn't block indefinitely.

So we don't need to set PF_FSTRANS for sunrpc network operations any
more.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:25:47 -04:00
NeilBrown
353db79662 NFS: avoid waiting at all in nfs_release_page when congested.
If nfs_release_page() is called on a sequence of pages which are all
in the same file which is blocked on COMMIT, each page could
contribute a 1 second delay which could be come excessive.  I have
seen delays of as much as 208 seconds.

To keep the delay to one second, mark the bdi as write-congested
if the commit didn't finished.  Once it does finish, the
write-congested flag will be cleared by nfs_commit_release_pages().

With this, the longest total delay in try_to_free_pages that I have
seen is under 3 seconds.  With no waiting in nfs_release_page at all
I have seen delays of nearly 1.5 seconds.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:25:38 -04:00
NeilBrown
9590544694 NFS: avoid deadlocks with loop-back mounted NFS filesystems.
Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.

If the node running the NFS server fails, some other node can mount the
filesystem and start providing NFS service.  If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.

nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().

This patch modifies nfs_release_page() to wait a limited time for the
commit to complete - one second.  If the commit doesn't complete
in this time, nfs_release_page() will fail.  This means it might now
fail in some cases where it wouldn't before.  These cases are only
when 'gfp' includes '__GFP_WAIT'.

nfs_release_page() is only called by try_to_release_page(), and that
can only be called on an NFS page with required 'gfp' flags from
 - page_cache_pipe_buf_steal() in splice.c
 - shrink_page_list() in vmscan.c
 - invalidate_inode_pages2_range() in truncate.c

The first two handle failure quite safely.  The last is only called
after ->launder_page() has been called, and that will have waited
for the commit to finish already.

So aborting if the commit takes longer than 1 second is perfectly safe.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:25:28 -04:00
NeilBrown
e87b4c7a7a NFS: don't use STABLE writes during writeback.
commit b31268ac79
  FS: Use stable writes when not doing a bulk flush

was a bit heavy handed.
The particular problem that lead to this patch was that
small writes to an O_SYNC file we being written as UNSTABLE writes
followed by a commit.
This is appropriate for large writes (which require multiple NFS
requests) but for small writes (single NFS request), using
NFS_FILE_SYNC is more efficient.

So that patch causes the code to select between the two methods
depending on how many nfs requests get generated.

Unfortunately this ends up applying to non O_SYNC writes as well.
In particular if you memory-map a file and update random pages, then
when they are eventually written out by writeback they will go as
NFS_FILE_SYNC.  This is inefficient and slows down the application.

So: only set FLUSH_COND_STABLE when wbc->sync_mode is WB_SYNC_ALL.
With this patch:
 O_SYNC writes are NFS_FILE_SYNC for single requests, and NFS_UNSTABLE
    followed by COMMIT for multiple requests
 Writing immediately before close of fsync follow the same pattern.
 Non-O_SYNC writes without an fsync of close eventually get flushed
 out as UNSTABLE and a commit follows eventually as appropriate.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:23:02 -04:00
NeilBrown
8478eaa16e NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
Currently asynchronous NFSv4 request will be retried with
exponential timeout (from 1/10 to 15 seconds), but async
requests will always use a 15second retry.

Some "async" requests are really synchronous though.  The
async mechanism is used to allow the request to continue if
the requesting process is killed.
In those cases, an exponential retry is appropriate.

For example, if two different clients both open a file and
get a READ delegation, and one client then unlinks the file
(while still holding an open file descriptor), that unlink
will used the "silly-rename" handling which is async.
The first rename will result in NFS4ERR_DELAY while the
delegation is reclaimed from the other client.  The rename
will not be retried for 15 seconds, causing an unlink to take
15 seconds rather than 100msec.

This patch only added exponential timeout for async unlink and
async rename.  Other async calls, such as 'close' are sometimes
waited for so they might benefit from exponential timeout too.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:22:47 -04:00
Benjamin Coddington
173b3afcee lockd: Try to reconnect if statd has moved
If rpc.statd is restarted, upcalls to monitor hosts can fail with
ECONNREFUSED.  In that case force a lookup of statd's new port and retry the
upcall.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:08:43 -04:00
Olga Kornievskaia
8faaa6d5d4 Fixing lease renewal
Commit c9fdeb28 removed a 'continue' after checking if the lease needs
to be renewed. However, if client hasn't moved, the code falls down to
starting reboot recovery erroneously (ie., sends open reclaim and gets
back stale_clientid error) before recovering from getting stale_clientid
on the renew operation.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: c9fdeb280b (NFS: Add basic migration support to state manager thread)
Cc: stable@vger.kernel.org # 3.13+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:03:15 -04:00
Fabian Frederick
2f3169fb18 nfs: fix duplicate proc entries
Commit 65b38851a1
("NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes")

updated the following function:
static int nfs_volume_list_open(struct inode *inode, struct file *file)

it used &nfs_server_list_ops instead of &nfs_volume_list_ops
which means cat /proc/fs/nfsfs/volumes = /proc/fs/nfsfs/servers

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Fixes: 65b38851a1 (NFS: Fix /proc/fs/nfsfs/servers and...)
Cc: stable@vger.kernel.org # 3.4.x+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:00:18 -04:00
Tejun Heo
2aad2a86f6 percpu_ref: add PERCPU_REF_INIT_* flags
With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC	: start ref in atomic mode
* PERCPU_REF_INIT_DEAD		: start ref killed and drained

These flags should be able to serve the above two use cases.

v2: target_core_tpg.c conversion was missing.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-09-24 13:31:50 -04:00
Tejun Heo
d06efebf0c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall.  The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
2014-09-24 13:00:21 -04:00
Jaegeuk Kim
95dd897301 f2fs: use more free segments until SSR is activated
Previously, f2fs activates SSR if the # of free segments reaches to the # of
overprovisioned segments.
In this case, SSR starts to use dirty segments only, so that the overprovisoned
space cannot be selected for new data.
This means that we have no chance to utilizae the overprovisioned space at all.

This patch fixes that by allowing LFS allocations until the # of free segments
reaches to the last threshold, reserved space.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:24 -07:00
Jaegeuk Kim
9b5f136fd4 f2fs: change the ipu_policy option to enable combinations
This patch changes the ipu_policy setting to use any combination of orthogonal policies.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:24 -07:00
Chao Yu
210f41bc04 f2fs: fix to search whole dirty segmap when get_victim
In ->get_victim we get max_search value from dirty_i->nr_dirty without
protection of seglist_lock, after that, nr_dirty can be increased/decreased
before we hold seglist_lock lock.
Then in main loop we attempt to traverse all dirty section one time to find
victim section, but it's not accurate to use max_search as the total loop count,
because we might lose checking several sections or check sections redundantly
for the case of nr_dirty are increased or decreased previously.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:23 -07:00
Chao Yu
26666c8a43 f2fs: fix to clean previous mount option when remount_fs
In manual of mount, we descript remount as below:

"mount -o remount,rw /dev/foo /dir
After  this call all old mount options are replaced and arbitrary stuff from
fstab is ignored, except the loop= option which is internally generated and
maintained by the mount command."

Previously f2fs do not clear up old mount options when remount_fs, so we have no
chance of disabling previous option (e.g. flush_merge). Fix it.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:22 -07:00
Chao Yu
14cecc5cd6 f2fs: skip punching hole in special condition
Now punching hole in directory is not supported in f2fs, so let's limit file
type in punch_hole().

In addition, in punch_hole if offset is exceed file size, we should skip
punching hole.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:21 -07:00
Chao Yu
55cf9cb63f f2fs: support large sector size
Block size in f2fs is 4096 bytes, so theoretically, f2fs can support 4096 bytes
sector device at maximum. But now f2fs only support 512 bytes size sector, so
block device such as zRAM which uses page cache as its block storage space will
not be mounted successfully as mismatch between sector size of zRAM and sector
size of f2fs supported.

In this patch we support large sector size in f2fs, so block device with sector
size of 512/1024/2048/4096 bytes can be supported in f2fs.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:20 -07:00
Chao Yu
09db6a2ef8 f2fs: fix to truncate blocks past EOF in ->setattr
By using FALLOC_FL_KEEP_SIZE in ->fallocate of f2fs, we can fallocate block past
EOF without changing i_size of inode. These blocks past EOF will not be
truncated in ->setattr as we truncate them only when change the file size.

We should give a chance to truncate blocks out of filesize in setattr().

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:20 -07:00
Jaegeuk Kim
976e4c50ae f2fs: update i_size when __allocate_data_block
The f2fs_direct_IO uses __allocate_data_block, but inside the allocation path,
we should update i_size at the changed time to update its inode page.
Otherwise, we can get wrong i_size after roll-forward recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:19 -07:00
Jaegeuk Kim
90a893c749 f2fs: use MAX_BIO_BLOCKS(sbi)
This patch cleans up a simple macro.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:18 -07:00