In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.
Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().
Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time. To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.
As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started. This
should have a small (but probably not measurable) improvement for
ext4's scalability.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: George Barnett <gbarnett@atlassian.com>
Cc: stable@vger.kernel.org
[ Added fixup from Lukáš Czerner which only checks the assertion when
the inode is not new and is not being freed. ]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
when create /proc/fs/nfs/exports error, we should remove /proc/fs/nfs,
if don't do it, it maybe cause Memory leak.
Signed-off-by: fanchaoting <fanchaoting@cn.fujitsu.com>
Reviewed-by: chendt.fnst <chendt.fnst@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
we should return error status directly when nfs4_preprocess_stateid_op
return error.
Signed-off-by: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We only ever traverse the hash chains in the forward direction, so a
double pointer list head isn't really necessary.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
merge window. Fortunately, Cai and Christian noticed before 3.9
shipped.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJRXG7FAAoJENNvdpvBGATw7GAQALbl5LxmVmGR6JRQzNoINP+H
v29ulo1Kly4j2vt+3b0rXKv9axWl0C/dItFlC/9WqmwuB/0BptBKIVnkKH+6zu2v
F+cO41gfpJo3ozcgsCrjvWfdkTWbjbPTQ4XiQDFILkwiB4R9KdpynKcVcjDY+gQE
umwJpXwDDd+fdr4FNQiFFPqd8rCC8fEeClWTtOFx7UidKl8v18iZ0/OPiAr+jBOY
rlcaZ9F8nmOJTwgriGbod4X827xEDj7Jwe7/C6oy/lKLOTLhaahgHPDW/l0O4KZA
4eJLj/5nxmYling4Y+rQvglVhNJ4LNv+IAXu5IpqRxosPYFnxQq+JYn8D5BlXifd
0/hG+BwTkhm4RLJ8uQvUxxglZNQEWeSuIma4dnZX3Xf9AzsvNW9x3Iilj3F7dhUS
6h9aeoYKv9y7GY9Out1P/UZYVi4HmB3jHiOcdTNCK4plQ3Sn2NYMw6RK1z4cXvE+
Pokc0a9KNyusNSI83tDtjRjan9NzsRbTggoGVf19RVoIVqIjkyXzUGasO/y+mKhp
LENAjkABdbLB1Re8B/99KwgIloUTvxGcojLKzkEbgcobruvEwKvxIrTi+fgNOiu6
GqJOh8TwZtx3SGJujsyOSBBrdPfjPHReBWrX0VRHl/Wsd4RWCaDT8H1EdNONQ+to
lQ+JvTZgFwQB2GABjNB6
=n1ir
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Unfortunately, we introduced some big-endian bugs during the last
merge window. Fortunately, Cai and Christian noticed before 3.9
shipped."
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix big-endian bugs which could cause fs corruptions
Ratelimited printk will be useful in printing xfs messages which are otherwise
not required to be printed always due to their high rate (to prevent kernel ring
buffer from overflowing), while at the same time required to be printed.
Signed-off-by: Raghavendra D Prabhu <rprabhu@wnohang.net>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The inode->i_mutex isn't hold when updating filp->f_pos
in read()/write(), so the filp->f_pos might be read as
0 or 1 in readdir() when there is concurrent read()/write()
on this same file, then may cause use after free in readdir().
The bug can be reproduced with Li Zefan's test code on the
link:
https://patchwork.kernel.org/patch/2160771/
This patch fixes the use after free under this situation.
Cc: stable <stable@vger.kernel.org>
Reported-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull reiserfs fix from Jan Kara:
"A fix for reiserfs xattr bug exposed by changes to lookup_one_len()"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: Fix warning and inode leak when deleting inode with xattrs
Move common code in ext4_ind_truncate() and ext4_ext_truncate() into
ext4_truncate(). This saves over 60 lines of code.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Move common code in ext4_ind_punch_hole() and ext4_ext_punch_hole()
into ext4_punch_hole(). This saves over 150 lines of code.
This also fixes a potential bug when the punch_hole() code is racing
against indirect-to-extents or extents-to-indirect migation. We are
currently using i_mutex to protect against changes to the inode flag;
specifically, the append-only, immutable, and extents inode flags. So
we need to take i_mutex before deciding whether to use the
extents-specific or indirect-specific punch_hole code.
Also, there was a missing call to ext4_inode_block_unlocked_dio() in
the indirect punch codepath. This was added in commit 02d262dffc
to block DIO readers racing against the punch operation in the
codepath for extent-mapped inodes, but it was missing for
indirect-block mapped inodes. One of the advantages of refactoring
the code is that it makes such oversights much less likely.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The older code was far more complicated than it needed to be because
of how we spliced in the ext4's new multiblock allocator into ext3's
indirect block code. By folding ext4_alloc_blocks() into
ext4_alloc_branch(), we make the code far more understable, shave off
over 130 lines of code and half a kilobyte of compiled object code.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
After collapsing the handling of data ordered and data writeback
codepath, ext4_generic_write_end() has only one caller,
ext4_write_end(). So we fold it into ext4_write_end().
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
The only difference between how we handle data=ordered and
data=writeback is a single call to ext4_jbd2_file_inode(). Eliminate
code duplication by factoring out redundant the code paths.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
When an extent was zeroed out, we forgot to do convert from cpu to le16.
It could make us hit a BUG_ON when we try to write dirty pages out. So
fix it.
[ Also fix a bug found by Dmitry Monakhov where we were missing
le32_to_cpu() calls in the new indirect punch hole code.
There are a number of other big endian warnings found by static code
analyzers, but we'll wait for the next merge window to fix them all
up. These fixes are designed to be Obviously Correct by code
inspection, and easy to demonstrate that it won't make any
difference (and hence, won't introduce any bugs) on little endian
architectures such as x86. --tytso ]
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: CAI Qian <caiqian@redhat.com>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
This changes session destruction to be similar to client destruction in
that attempts to destroy a session while in use (which should be rare
corner cases) result in DELAY. This simplifies things somewhat and
helps meet a coming 4.2 requirement.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When a setclientid_confirm or create_session confirms a client after a
client reboot, it also destroys any previous state held by that client.
The shutdown of that previous state must be careful not to free the
client out from under threads processing other requests that refer to
the client.
This is a particular problem in the NFSv4.1 case when we hold a
reference to a session (hence a client) throughout compound processing.
The server attempts to handle this by unhashing the client at the time
it's destroyed, then delaying the final free to the end. But this still
leaves some races in the current code.
I believe it's simpler just to fail the attempt to destroy the client by
returning NFS4ERR_DELAY. This is a case that should never happen
anyway.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The locking here is very fiddly, and there's no reason for us to be
setting cstate->session, since this is the only op in the compound.
Let's just take the state lock and drop the reference counting.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
destroy_session uses the session and client without continuously holding
any reference or locks.
Put the whole thing under the state lock for now.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I'm not sure what the check for clientid expiry was meant to do here.
The check for a matching session is redundant given the previous check
for state: a client without state is, in particular, a client without
sessions.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
E.g. printk's that just report the return value from an op are
uninteresting as we already do that in the main proc_compound loop.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This should never happen.
(Note: the comparable case in setclientid_confirm *can* happen, since
updating a client record can result in both confirmed and unconfirmed
records with the same clientid.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NFS4_OO_PURGE_CLOSE is not handled properly. To avoid memory leak, nfs4
stateid which is pointed by oo_last_closed_stid is freed in nfsd4_close(),
but NFS4_OO_PURGE_CLOSE isn't cleared meanwhile. So the stateid released in
THIS close procedure may be freed immediately in the coming encoding function.
Sorry that Signed-off-by was forgotten in last version.
Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
lk_rflags is never used anywhere, and rflags is not defined in struct
nfsd4_lock.
Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Once we drop the lock here there's nothing keeping the client around:
the only lock still held is the xpt_lock on this socket, but this socket
no longer has any connection with the client so there's no way for other
code to know we're still using the client.
The solution is simple: all nfsd4_probe_callback does is set a few
variables and queue some work, so there's no reason we can't just keep
it under the lock.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If a client sets an owner (or group_owner or acl) attribute on open for
create, and the mapping of that owner to an id fails, then we return
BAD_OWNER. But BAD_OWNER is a seqid-mutating error, so we can't
shortcut the open processing that case: we have to at least look up the
owner so we can find the seqid to bump.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This BUG_ON just crashes the thread a little earlier than it would
otherwise--it doesn't seem useful.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We've now increased the size of the duplicate reply cache by quite a
bit, but the number of hash buckets has not changed. So, we've gone from
an average hash chain length of 16 in the old code to 4096 when the
cache is its largest. Change the code to scale out the number of buckets
with the max size of the cache.
At the same time, we also need to fix the hash function since the
existing one isn't really suitable when there are more than 256 buckets.
Move instead to use the stock hash_32 function for this. Testing on a
machine that had 2048 buckets showed that this gave a smaller
longest:average ratio than the existing hash function:
The formula here is longest hash bucket searched divided by average
number of entries per bucket at the time that we saw that longest
bucket:
old hash: 68/(39258/2048) == 3.547404
hash_32: 45/(33773/2048) == 2.728807
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The typical case with the DRC is a cache miss, so if we keep track of
the max number of entries that we've ever walked over in a search, then
we should have a reasonable estimate of the longest hash chain that
we've ever seen.
With that, we'll also keep track of the total size of the cache when we
see the longest chain. In the case of a tie, we prefer to track the
smallest total cache size in order to properly gauge the worst-case
ratio of max vs. avg chain length.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
For presenting statistics relating to duplicate reply cache.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Break out the function that compares the rqstp and checksum against a
reply cache entry. While we're at it, track the efficacy of the checksum
over the NFS data by tracking the cases where we would have incorrectly
matched a DRC entry if we had not tracked it or the length.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The most common case is to do a search of the cache, followed by an
insert. In the case where we have to allocate an entry off the slab,
then we end up having to redo the search, which is wasteful.
Better optimize the code for the common case by eliminating the initial
search of the cache and always preallocating an entry. In the case of a
cache hit, we'll end up just freeing that entry but that's preferable to
an extra search.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This patch reduces redundant spin_lock operations in alloc_nid_failed().
The alloc_nid_failed() does not need to delete entry and add one again
by triggering spin_lock and spin_unlock redundantly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Commit - fa9150a84c - replaces a call to generic_writepages() in
f2fs_write_data_pages() with write_cache_pages(), with a function pointer
argument pointing to routine: __f2fs_writepage.
-> https://git.kernel.org/linus/fa9150a84ca333f68127097c4fa1eda4b3913a22
This patch adds a NULL pointer check in f2fs_write_data_pages() to avoid
a possible NULL pointer dereference, in case if - mapping->a_ops->writepage -
is NULL.
Signed-off-by: P J P <ppandit@redhat.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Like below, there are 8 segment bitmaps for SSR victim candidates.
enum dirty_type {
DIRTY_HOT_DATA, /* dirty segments assigned as hot data logs */
DIRTY_WARM_DATA, /* dirty segments assigned as warm data logs */
DIRTY_COLD_DATA, /* dirty segments assigned as cold data logs */
DIRTY_HOT_NODE, /* dirty segments assigned as hot node logs */
DIRTY_WARM_NODE, /* dirty segments assigned as warm node logs */
DIRTY_COLD_NODE, /* dirty segments assigned as cold node logs */
DIRTY, /* to count # of dirty segments */
PRE, /* to count # of entirely obsolete segments */
NR_DIRTY_TYPE
};
The upper 6 bitmaps indicates segments dirtied by active log areas respectively.
And, the DIRTY bitmap integrates all the 6 bitmaps.
For example,
o DIRTY_HOT_DATA : 1010000
o DIRTY_WARM_DATA: 0100000
o DIRTY_COLD_DATA: 0001000
o DIRTY_HOT_NODE : 0000010
o DIRTY_WARM_NODE: 0000001
o DIRTY_COLD_NODE: 0000000
In this case,
o DIRTY : 1111011,
which means that we should guarantee the consistency between DIRTY and other
bitmaps concreately.
However, the SSR mode selects victims freely from any log types, which can set
multiple bits across the various bitmap types.
So, this patch eliminates this inconsistency.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In order to do GC more reliably, I'd like to lock the vicitm summary page
until its GC is completed, and also prevent any checkpoint process.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a new condition that allocates free segments in the current
active section even if SSR is needed.
Otherwise, f2fs cannot allocate remained free segments in the section since
SSR finds dirty segments only.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The foreground GCs are triggered under not enough free sections.
So, we should not skip moving valid blocks in the victim segments.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes a bitmap for victim segments selected by foreground GC, and
modifies the other bitmap for victim segments selected by background GC.
1) foreground GC bitmap
: We don't need to manage this, since we just only one previous victim section
number instead of the whole victim history.
The f2fs uses the victim section number in order not to allocate currently
GC'ed section to current active logs.
2) background GC bitmap
: This bitmap is used to avoid selecting victims repeatedly by background GCs.
In addition, the victims are able to be selected by foreground GCs, since
there is no need to read victim blocks during foreground GCs.
By the fact that the foreground GC reclaims segments in a section unit, it'd
be better to manage this bitmap based on the section granularity.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When allocating a new segment under the LFS mode, we should keep the section
boundary.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In get_node_page, we do not need to call lock_page all the time.
If the node page is cached as uptodate,
1. grab_cache_page locks the page,
2. read_node_page unlocks the page, and
3. lock_page is called for further process.
Let's avoid this.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Let's use a macro to get the total number of sections.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
A macro should not use duplicate parameter names.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Tejun writes:
-----
This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.
* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.
* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.
The three commits are located in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue
Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.
e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a707 ("raid5: use bio_reset()")
The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge
so that you can use it for verification. The test merge commit has
proper merge description.
While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.
----
Fixed up the conflict.
Conflicts:
drivers/md/raid5.c
Signed-off-by: Jens Axboe <axboe@kernel.dk>