Commit graph

31709 commits

Author SHA1 Message Date
Theodore Ts'o
d76a3a7711 ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.

Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().

Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time.  To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.

As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started.  This
should have a small (but probably not measurable) improvement for
ext4's scalability.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: George Barnett <gbarnett@atlassian.com>
Cc: stable@vger.kernel.org
2013-04-03 22:02:52 -04:00
Theodore Ts'o
b10a44c369 ext4: add might_sleep() annotations
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 22:00:52 -04:00
Theodore Ts'o
19b5ef6157 ext4: add mutex_is_locked() assertion to ext4_truncate()
[ Added fixup from Lukáš Czerner which only checks the assertion when
  the inode is not new and is not being freed. ]

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 21:58:52 -04:00
fanchaoting
ff7c4b3693 nfsd: remove /proc/fs/nfs when create /proc/fs/nfs/exports error
when create /proc/fs/nfs/exports error, we should remove /proc/fs/nfs,
if don't do it, it maybe cause Memory leak.

 Signed-off-by: fanchaoting <fanchaoting@cn.fujitsu.com>
 Reviewed-by: chendt.fnst <chendt.fnst@cn.fujitsu.com>

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 15:30:07 -04:00
fanchaoting
b022032e19 nfsd: don't run get_file if nfs4_preprocess_stateid_op return error
we should return error status directly when nfs4_preprocess_stateid_op
return error.

Signed-off-by: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 15:19:06 -04:00
Jeff Layton
89876f8c0d nfsd: convert the file_hashtbl to a hlist
We only ever traverse the hash chains in the forward direction, so a
double pointer list head isn't really necessary.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 15:11:04 -04:00
Linus Torvalds
cbfa0e7204 Unfortunately, we introduced some big-endian bugs during the last
merge window.  Fortunately, Cai and Christian noticed before 3.9
 shipped.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRXG7FAAoJENNvdpvBGATw7GAQALbl5LxmVmGR6JRQzNoINP+H
 v29ulo1Kly4j2vt+3b0rXKv9axWl0C/dItFlC/9WqmwuB/0BptBKIVnkKH+6zu2v
 F+cO41gfpJo3ozcgsCrjvWfdkTWbjbPTQ4XiQDFILkwiB4R9KdpynKcVcjDY+gQE
 umwJpXwDDd+fdr4FNQiFFPqd8rCC8fEeClWTtOFx7UidKl8v18iZ0/OPiAr+jBOY
 rlcaZ9F8nmOJTwgriGbod4X827xEDj7Jwe7/C6oy/lKLOTLhaahgHPDW/l0O4KZA
 4eJLj/5nxmYling4Y+rQvglVhNJ4LNv+IAXu5IpqRxosPYFnxQq+JYn8D5BlXifd
 0/hG+BwTkhm4RLJ8uQvUxxglZNQEWeSuIma4dnZX3Xf9AzsvNW9x3Iilj3F7dhUS
 6h9aeoYKv9y7GY9Out1P/UZYVi4HmB3jHiOcdTNCK4plQ3Sn2NYMw6RK1z4cXvE+
 Pokc0a9KNyusNSI83tDtjRjan9NzsRbTggoGVf19RVoIVqIjkyXzUGasO/y+mKhp
 LENAjkABdbLB1Re8B/99KwgIloUTvxGcojLKzkEbgcobruvEwKvxIrTi+fgNOiu6
 GqJOh8TwZtx3SGJujsyOSBBrdPfjPHReBWrX0VRHl/Wsd4RWCaDT8H1EdNONQ+to
 lQ+JvTZgFwQB2GABjNB6
 =n1ir
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Unfortunately, we introduced some big-endian bugs during the last
  merge window.  Fortunately, Cai and Christian noticed before 3.9
  shipped."

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix big-endian bugs which could cause fs corruptions
2013-04-03 11:21:13 -07:00
Rich Johnston
3d6e036193 xfs: Add ratelimited printk for different alert levels
Ratelimited printk will be useful in printing xfs messages which are otherwise
not required to be printed always due to their high rate (to prevent kernel ring
buffer from overflowing), while at the same time required to be printed.

Signed-off-by: Raghavendra D Prabhu <rprabhu@wnohang.net>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-04-03 13:20:39 -05:00
Ming Lei
f7db5e7660 sysfs: fix use after free in case of concurrent read/write and readdir
The inode->i_mutex isn't hold when updating filp->f_pos
in read()/write(), so the filp->f_pos might be read as
0 or 1 in readdir() when there is concurrent read()/write()
on this same file, then may cause use after free in readdir().

The bug can be reproduced with Li Zefan's test code on the
link:

	https://patchwork.kernel.org/patch/2160771/

This patch fixes the use after free under this situation.

Cc: stable <stable@vger.kernel.org>
Reported-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-03 11:09:02 -07:00
Linus Torvalds
cd0e4a9dd4 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull reiserfs fix from Jan Kara:
 "A fix for reiserfs xattr bug exposed by changes to lookup_one_len()"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  reiserfs: Fix warning and inode leak when deleting inode with xattrs
2013-04-03 10:49:27 -07:00
Theodore Ts'o
819c4920b7 ext4: refactor truncate code
Move common code in ext4_ind_truncate() and ext4_ext_truncate() into
ext4_truncate().  This saves over 60 lines of code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 12:47:17 -04:00
Theodore Ts'o
26a4c0c6cc ext4: refactor punch hole code
Move common code in ext4_ind_punch_hole() and ext4_ext_punch_hole()
into ext4_punch_hole().  This saves over 150 lines of code.

This also fixes a potential bug when the punch_hole() code is racing
against indirect-to-extents or extents-to-indirect migation.  We are
currently using i_mutex to protect against changes to the inode flag;
specifically, the append-only, immutable, and extents inode flags.  So
we need to take i_mutex before deciding whether to use the
extents-specific or indirect-specific punch_hole code.

Also, there was a missing call to ext4_inode_block_unlocked_dio() in
the indirect punch codepath.  This was added in commit 02d262dffc
to block DIO readers racing against the punch operation in the
codepath for extent-mapped inodes, but it was missing for
indirect-block mapped inodes.  One of the advantages of refactoring
the code is that it makes such oversights much less likely.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 12:45:17 -04:00
Theodore Ts'o
781f143ea0 ext4: fold ext4_alloc_blocks() in ext4_alloc_branch()
The older code was far more complicated than it needed to be because
of how we spliced in the ext4's new multiblock allocator into ext3's
indirect block code.  By folding ext4_alloc_blocks() into
ext4_alloc_branch(), we make the code far more understable, shave off
over 130 lines of code and half a kilobyte of compiled object code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 12:43:17 -04:00
Zheng Liu
eed4333f08 ext4: fold ext4_generic_write_end() into ext4_write_end()
After collapsing the handling of data ordered and data writeback
codepath, ext4_generic_write_end() has only one caller,
ext4_write_end().  So we fold it into ext4_write_end().

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 12:41:17 -04:00
Theodore Ts'o
74d553aad7 ext4: collapse handling of data=ordered and data=writeback codepaths
The only difference between how we handle data=ordered and
data=writeback is a single call to ext4_jbd2_file_inode().  Eliminate
code duplication by factoring out redundant the code paths.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 12:39:17 -04:00
Zheng Liu
8cde7ad17e ext4: fix big-endian bugs which could cause fs corruptions
When an extent was zeroed out, we forgot to do convert from cpu to le16.
It could make us hit a BUG_ON when we try to write dirty pages out.  So
fix it.

[ Also fix a bug found by Dmitry Monakhov where we were missing
  le32_to_cpu() calls in the new indirect punch hole code.

  There are a number of other big endian warnings found by static code
  analyzers, but we'll wait for the next merge window to fix them all
  up.  These fixes are designed to be Obviously Correct by code
  inspection, and easy to demonstrate that it won't make any
  difference (and hence, won't introduce any bugs) on little endian
  architectures such as x86.  --tytso ]

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: CAI Qian <caiqian@redhat.com>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-04-03 12:37:17 -04:00
J. Bruce Fields
66b2b9b2b0 nfsd4: don't destroy in-use session
This changes session destruction to be similar to client destruction in
that attempts to destroy a session while in use (which should be rare
corner cases) result in DELAY.  This simplifies things somewhat and
helps meet a coming 4.2 requirement.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:40 -04:00
J. Bruce Fields
221a687669 nfsd4: don't destroy in-use clients
When a setclientid_confirm or create_session confirms a client after a
client reboot, it also destroys any previous state held by that client.

The shutdown of that previous state must be careful not to free the
client out from under threads processing other requests that refer to
the client.

This is a particular problem in the NFSv4.1 case when we hold a
reference to a session (hence a client) throughout compound processing.

The server attempts to handle this by unhashing the client at the time
it's destroyed, then delaying the final free to the end.  But this still
leaves some races in the current code.

I believe it's simpler just to fail the attempt to destroy the client by
returning NFS4ERR_DELAY.  This is a case that should never happen
anyway.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:39 -04:00
J. Bruce Fields
4f6e6c1773 nfsd4: simplify bind_conn_to_session locking
The locking here is very fiddly, and there's no reason for us to be
setting cstate->session, since this is the only op in the compound.
Let's just take the state lock and drop the reference counting.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:39 -04:00
J. Bruce Fields
abcdff09a0 nfsd4: fix destroy_session race
destroy_session uses the session and client without continuously holding
any reference or locks.

Put the whole thing under the state lock for now.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:38 -04:00
J. Bruce Fields
bfa85e83a8 nfsd4: clientid lookup cleanup
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:37 -04:00
J. Bruce Fields
c0293b0131 nfsd4: destroy_clientid simplification
I'm not sure what the check for clientid expiry was meant to do here.

The check for a matching session is redundant given the previous check
for state: a client without state is, in particular, a client without
sessions.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:36 -04:00
J. Bruce Fields
1ca507920d nfsd4: remove some dprintk's
E.g. printk's that just report the return value from an op are
uninteresting as we already do that in the main proc_compound loop.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:36 -04:00
J. Bruce Fields
0eb6f20aa5 nfsd4: STALE_STATEID cleanup
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:35 -04:00
J. Bruce Fields
78389046f7 nfsd4: warn on odd create_session state
This should never happen.

(Note: the comparable case in setclientid_confirm *can* happen, since
updating a client record can result in both confirmed and unconfirmed
records with the same clientid.)

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:34 -04:00
ycnian@gmail.com
491402a787 nfsd: fix bug on nfs4 stateid deallocation
NFS4_OO_PURGE_CLOSE is not handled properly. To avoid memory leak, nfs4
stateid which is pointed by oo_last_closed_stid is freed in nfsd4_close(),
but NFS4_OO_PURGE_CLOSE isn't cleared meanwhile. So the stateid released in
THIS close procedure may be freed immediately in the coming encoding function.
Sorry that Signed-off-by was forgotten in last version.

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:34 -04:00
Yanchuan Nian
9c6bdbb8dd nfsd: remove unused macro in nfsv4
lk_rflags is never used anywhere, and rflags is not defined in struct
nfsd4_lock.

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:33 -04:00
J. Bruce Fields
2e4b7239a6 nfsd4: fix use-after-free of 4.1 client on connection loss
Once we drop the lock here there's nothing keeping the client around:
the only lock still held is the xpt_lock on this socket, but this socket
no longer has any connection with the client so there's no way for other
code to know we're still using the client.

The solution is simple: all nfsd4_probe_callback does is set a few
variables and queue some work, so there's no reason we can't just keep
it under the lock.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:32 -04:00
J. Bruce Fields
b0a9d3ab57 nfsd4: fix race on client shutdown
Dropping the session's reference count after the client's means we leave
a window where the session's se_client pointer is NULL.  An xpt_user
callback that encounters such a session may then crash:

[  303.956011] BUG: unable to handle kernel NULL pointer dereference at 0000000000000318
[  303.959061] IP: [<ffffffff81481a8e>] _raw_spin_lock+0x1e/0x40
[  303.959061] PGD 37811067 PUD 3d498067 PMD 0
[  303.959061] Oops: 0002 [#8] PREEMPT SMP
[  303.959061] Modules linked in: md5 nfsd auth_rpcgss nfs_acl snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_page_alloc microcode psmouse snd_timer serio_raw pcspkr evdev snd soundcore i2c_piix4 i2c_core intel_agp intel_gtt processor button nfs lockd sunrpc fscache ata_generic pata_acpi ata_piix uhci_hcd libata btrfs usbcore usb_common crc32c scsi_mod libcrc32c zlib_deflate floppy virtio_balloon virtio_net virtio_pci virtio_blk virtio_ring virtio
[  303.959061] CPU 0
[  303.959061] Pid: 264, comm: nfsd Tainted: G      D      3.8.0-ARCH+ #156 Bochs Bochs
[  303.959061] RIP: 0010:[<ffffffff81481a8e>]  [<ffffffff81481a8e>] _raw_spin_lock+0x1e/0x40
[  303.959061] RSP: 0018:ffff880037877dd8  EFLAGS: 00010202
[  303.959061] RAX: 0000000000000100 RBX: ffff880037a2b698 RCX: ffff88003d879278
[  303.959061] RDX: ffff88003d879278 RSI: dead000000100100 RDI: 0000000000000318
[  303.959061] RBP: ffff880037877dd8 R08: ffff88003c5a0f00 R09: 0000000000000002
[  303.959061] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  303.959061] R13: 0000000000000318 R14: ffff880037a2b680 R15: ffff88003c1cbe00
[  303.959061] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[  303.959061] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  303.959061] CR2: 0000000000000318 CR3: 000000003d49c000 CR4: 00000000000006f0
[  303.959061] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  303.959061] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  303.959061] Process nfsd (pid: 264, threadinfo ffff880037876000, task ffff88003c1fd0a0)
[  303.959061] Stack:
[  303.959061]  ffff880037877e08 ffffffffa03772ec ffff88003d879000 ffff88003d879278
[  303.959061]  ffff88003d879080 0000000000000000 ffff880037877e38 ffffffffa0222a1f
[  303.959061]  0000000000107ac0 ffff88003c22e000 ffff88003d879000 ffff88003c1cbe00
[  303.959061] Call Trace:
[  303.959061]  [<ffffffffa03772ec>] nfsd4_conn_lost+0x3c/0xa0 [nfsd]
[  303.959061]  [<ffffffffa0222a1f>] svc_delete_xprt+0x10f/0x180 [sunrpc]
[  303.959061]  [<ffffffffa0223d96>] svc_recv+0xe6/0x580 [sunrpc]
[  303.959061]  [<ffffffffa03587c5>] nfsd+0xb5/0x140 [nfsd]
[  303.959061]  [<ffffffffa0358710>] ? nfsd_destroy+0x90/0x90 [nfsd]
[  303.959061]  [<ffffffff8107ae00>] kthread+0xc0/0xd0
[  303.959061]  [<ffffffff81010000>] ? perf_trace_xen_mmu_set_pte_at+0x50/0x100
[  303.959061]  [<ffffffff8107ad40>] ? kthread_freezable_should_stop+0x70/0x70
[  303.959061]  [<ffffffff814898ec>] ret_from_fork+0x7c/0xb0
[  303.959061]  [<ffffffff8107ad40>] ? kthread_freezable_should_stop+0x70/0x70
[  303.959061] Code: ff ff 5d c3 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 65 48 8b 04 25 f0 c6 00 00 48 89 e5 83 80 44 e0 ff ff 01 b8 00 01 00 00 <3e> 66 0f c1 07 0f b6 d4 38 c2 74 0f 66 0f 1f 44 00 00 f3 90 0f
[  303.959061] RIP  [<ffffffff81481a8e>] _raw_spin_lock+0x1e/0x40
[  303.959061]  RSP <ffff880037877dd8>
[  303.959061] CR2: 0000000000000318
[  304.001218] ---[ end trace 2d809cd4a7931f5a ]---
[  304.001903] note: nfsd[264] exited with preempt_count 2

Reported-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:48:31 -04:00
J. Bruce Fields
9d313b17db nfsd4: handle seqid-mutating open errors from xdr decoding
If a client sets an owner (or group_owner or acl) attribute on open for
create, and the mapping of that owner to an id fails, then we return
BAD_OWNER.  But BAD_OWNER is a seqid-mutating error, so we can't
shortcut the open processing that case: we have to at least look up the
owner so we can find the seqid to bump.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:53 -04:00
J. Bruce Fields
b600de7ab9 nfsd4: remove BUG_ON
This BUG_ON just crashes the thread a little earlier than it would
otherwise--it doesn't seem useful.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:45 -04:00
Jeff Layton
0733c7ba1e nfsd: scale up the number of DRC hash buckets with cache size
We've now increased the size of the duplicate reply cache by quite a
bit, but the number of hash buckets has not changed. So, we've gone from
an average hash chain length of 16 in the old code to 4096 when the
cache is its largest. Change the code to scale out the number of buckets
with the max size of the cache.

At the same time, we also need to fix the hash function since the
existing one isn't really suitable when there are more than 256 buckets.
Move instead to use the stock hash_32 function for this. Testing on a
machine that had 2048 buckets showed that this gave a smaller
longest:average ratio than the existing hash function:

The formula here is longest hash bucket searched divided by average
number of entries per bucket at the time that we saw that longest
bucket:

    old hash: 68/(39258/2048) == 3.547404
    hash_32:  45/(33773/2048) == 2.728807

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:25 -04:00
Jeff Layton
98d821bda1 nfsd: keep stats on worst hash balancing seen so far
The typical case with the DRC is a cache miss, so if we keep track of
the max number of entries that we've ever walked over in a search, then
we should have a reasonable estimate of the longest hash chain that
we've ever seen.

With that, we'll also keep track of the total size of the cache when we
see the longest chain. In the case of a tie, we prefer to track the
smallest total cache size in order to properly gauge the worst-case
ratio of max vs. avg chain length.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:25 -04:00
Jeff Layton
a2f999a37e nfsd: add new reply_cache_stats file in nfsdfs
For presenting statistics relating to duplicate reply cache.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:24 -04:00
Jeff Layton
6c6910cd4d nfsd: track memory utilization by the DRC
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:23 -04:00
Jeff Layton
9dc56143c2 nfsd: break out comparator into separate function
Break out the function that compares the rqstp and checksum against a
reply cache entry. While we're at it, track the efficacy of the checksum
over the NFS data by tracking the cases where we would have incorrectly
matched a DRC entry if we had not tracked it or the length.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:22 -04:00
Jeff Layton
0b9ea37f24 nfsd: eliminate one of the DRC cache searches
The most common case is to do a search of the cache, followed by an
insert. In the case where we have to allocate an entry off the slab,
then we end up having to redo the search, which is wasteful.

Better optimize the code for the common case by eliminating the initial
search of the cache and always preallocating an entry. In the case of a
cache hit, we'll end up just freeing that entry but that's preferable to
an extra search.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03 11:47:22 -04:00
Jaegeuk Kim
49952fa182 f2fs: reduce redundant spin_lock operations
This patch reduces redundant spin_lock operations in alloc_nid_failed().
The alloc_nid_failed() does not need to delete entry and add one again
by triggering spin_lock and spin_unlock redundantly.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03 22:19:03 +09:00
P J P
cfb185a148 f2fs: add NULL pointer check
Commit - fa9150a84c - replaces a call to generic_writepages() in
f2fs_write_data_pages() with write_cache_pages(), with a function pointer
argument pointing to routine: __f2fs_writepage.

  -> https://git.kernel.org/linus/fa9150a84ca333f68127097c4fa1eda4b3913a22

  This patch adds a NULL pointer check in f2fs_write_data_pages() to avoid
  a possible NULL pointer dereference, in case if - mapping->a_ops->writepage -
  is NULL.

Signed-off-by: P J P <ppandit@redhat.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03 17:27:52 +09:00
Jaegeuk Kim
b2f2c390c5 f2fs: fix the bitmap consistency of dirty segments
Like below, there are 8 segment bitmaps for SSR victim candidates.

enum dirty_type {
	DIRTY_HOT_DATA,		/* dirty segments assigned as hot data logs */
	DIRTY_WARM_DATA,	/* dirty segments assigned as warm data logs */
	DIRTY_COLD_DATA,	/* dirty segments assigned as cold data logs */
	DIRTY_HOT_NODE,		/* dirty segments assigned as hot node logs */
	DIRTY_WARM_NODE,	/* dirty segments assigned as warm node logs */
	DIRTY_COLD_NODE,	/* dirty segments assigned as cold node logs */
	DIRTY,			/* to count # of dirty segments */
	PRE,			/* to count # of entirely obsolete segments */
	NR_DIRTY_TYPE
};

The upper 6 bitmaps indicates segments dirtied by active log areas respectively.
And, the DIRTY bitmap integrates all the 6 bitmaps.

For example,
 o DIRTY_HOT_DATA : 1010000
 o DIRTY_WARM_DATA: 0100000
 o DIRTY_COLD_DATA: 0001000
 o DIRTY_HOT_NODE : 0000010
 o DIRTY_WARM_NODE: 0000001
 o DIRTY_COLD_NODE: 0000000
In this case,
 o DIRTY          : 1111011,

 which means that we should guarantee the consistency between DIRTY and other
 bitmaps concreately.

However, the SSR mode selects victims freely from any log types, which can set
multiple bits across the various bitmap types.

So, this patch eliminates this inconsistency.

Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03 17:27:51 +09:00
Jaegeuk Kim
b74737541c f2fs: avoid race for summary information
In order to do GC more reliably, I'd like to lock the vicitm summary page
until its GC is completed, and also prevent any checkpoint process.

Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03 17:27:51 +09:00
Jaegeuk Kim
60374688a1 f2fs: allocate remained free segments in the LFS mode
This patch adds a new condition that allocates free segments in the current
active section even if SSR is needed.
Otherwise, f2fs cannot allocate remained free segments in the section since
SSR finds dirty segments only.

Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03 17:27:50 +09:00
Jaegeuk Kim
4ebefc4443 f2fs: check completion of foreground GC
The foreground GCs are triggered under not enough free sections.
So, we should not skip moving valid blocks in the victim segments.

Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03 17:27:50 +09:00
Jaegeuk Kim
5ec4e49f9b f2fs: change GC bitmaps to apply the section granularity
This patch removes a bitmap for victim segments selected by foreground GC, and
modifies the other bitmap for victim segments selected by background GC.

1) foreground GC bitmap
 : We don't need to manage this, since we just only one previous victim section
   number instead of the whole victim history.
   The f2fs uses the victim section number in order not to allocate currently
   GC'ed section to current active logs.

2) background GC bitmap
 : This bitmap is used to avoid selecting victims repeatedly by background GCs.
   In addition, the victims are able to be selected by foreground GCs, since
   there is no need to read victim blocks during foreground GCs.

   By the fact that the foreground GC reclaims segments in a section unit, it'd
   be better to manage this bitmap based on the section granularity.

Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03 17:27:49 +09:00
Jaegeuk Kim
33afa7fde0 f2fs: allocate new segment aligned with sections
When allocating a new segment under the LFS mode, we should keep the section
boundary.

Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03 17:27:49 +09:00
Jaegeuk Kim
56ae674cc2 f2fs: remove redundant lock_page calls
In get_node_page, we do not need to call lock_page all the time.

If the node page is cached as uptodate,

1. grab_cache_page locks the page,
2. read_node_page unlocks the page, and
3. lock_page is called for further process.

Let's avoid this.

Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03 17:27:42 +09:00
Jaegeuk Kim
53cf95222f f2fs: introduce TOTAL_SECS macro
Let's use a macro to get the total number of sections.

Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03 16:23:10 +09:00
Jaegeuk Kim
5c773ba33a f2fs: do not use duplicate names in a macro
A macro should not use duplicate parameter names.

Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03 16:22:44 +09:00
Linus Torvalds
f8e9248dbb Merge branch 'for-3.9' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfix from J Bruce Fields:
 "An xdr decoding error--thanks, Toralf Förster, and Trinity!"

* 'for-3.9' of git://linux-nfs.org/~bfields/linux:
  nfsd4: reject "negative" acl lengths
2013-04-02 07:56:20 -07:00
Jens Axboe
64f8de4da7 Merge branch 'writeback-workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core
Tejun writes:

-----

This is the pull request for the earlier patchset[1] with the same
name.  It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.

* Because the conversion needs features from wq/for-3.10,
  block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
  with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
  workqueue conflicts from flaring up in block tree.

* Resolving the issue that Jan and Dave raised about debugging
  requires arch-wide changes.  The patchset is being worked on[2] but
  it'll have to go through -mm after these changes show up in -next,
  and not included in this pull request.

The three commits are located in the following git branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue

Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.

  e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
  2f6db2a707 ("raid5: use bio_reset()")

The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it.  We just need to
remove both.  The merged branch is available at

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge

so that you can use it for verification.  The test merge commit has
proper merge description.

While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.

----

Fixed up the conflict.

Conflicts:
	drivers/md/raid5.c

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-02 10:04:39 +02:00