This is based on Josef's "Btrfs: turbo charge fsync".
The current btrfs checks if an inode is in log by comparing
root's last_log_commit to inode's last_sub_trans[2].
But the problem is that this root->last_log_commit is shared among
inodes.
Say we have N inodes to be logged, after the first inode,
root's last_log_commit is updated and the N-1 remained files will
be skipped.
This fixes the bug by keeping a local copy of root's last_log_commit
inside each inode and this local copy will be maintained itself.
[1]: we regard each log transaction as a subset of btrfs's transaction,
i.e. sub_trans
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
This is based on Josef's "Btrfs: turbo charge fsync".
The above Josef's patch performs very good in random sync write test,
because we won't have too much extents to merge.
However, it does not performs good on the test:
dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync
The reason is when we do sequencial sync write, we need to merge the
current extent just with the previous one, so that we can get accumulated
extents to log:
A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...
So we'll have to flush more and more checksum into log tree, which is the
bottleneck according to my tests.
But we can avoid this by telling fsync the real extents that are needed
to be logged.
With this, I did the above dd sync write test (size=50m),
w/o (orig) w/ (josef's) w/ (this)
SATA 104KB/s 109KB/s 121KB/s
ramdisk 1.5MB/s 1.5MB/s 10.7MB/s (613%)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
We will stop and restart a transaction every time we move to a different leaf
when truncating a file. This is for enospc reasons, but really we could
probably get away with doing this a little better by actually working until we
hit an ENOSPC. So add a ->failfast flag to the block_rsv and set it when we do
truncates which will fail as soon as the block rsv runs out of space, and then
at that point we can stop and restart the transaction and refill the block rsv
and carry on. This will make rm'ing of a file with lots of extents a bit
faster. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is based on Josef's "Btrfs: turbo charge fsync".
We should cleanup those extents after we've finished logging inode,
otherwise we may do redundant work on them.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
I hit this a couple times while working on my fsync patch (all my bugs, not
normal operation), but with my new stuff we could have new errors from cases
I have not encountered, so instead of BUG()'ing we should be WARN()'ing so
that we are notified there is a problem but the user doesn't lose their
data. We can easily commit the transaction in the case that the tree
logging fails and still be fine, so let's try and be as nice to the user as
possible. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
At least for the vm workload. Currently on fsync we will
1) Truncate all items in the log tree for the given inode if they exist
and
2) Copy all items for a given inode into the log
The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing. This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them. We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already. Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write
Original Patched
SATA drive 82KB/s 140KB/s
Fusion drive 431KB/s 2532KB/s
So around 2-6 times faster depending on your hardware. There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok. This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get. All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.
The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok. I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
While working on my fsync patch my fsync tester kept hitting mismatching
md5sums when I would randomly write to a prealloc'ed region, syncfs() and
then write to the prealloced region some more and then fsync() and then
immediately reboot. This is because the tree logging code will skip writing
csums for file extents who's generation is less than the current running
transaction. When we mark extents as written we haven't been updating their
generation so they were always being skipped. This wouldn't happen if you
were to preallocate and then write in the same transaction, but if you for
example prealloced a VM you could definitely run into this problem. This
patch makes my fsync tester happy again. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Swinging this pendulum back the other way. We've been allocating chunks up
to 2% of the disk no matter how much we actually have allocated. So instead
fix this calculation to only allocate chunks if we have more than 80% of the
space available allocated. Please test this as it will likely cause all
sorts of ENOSPC problems to pop up suddenly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is a completely impossible situation to hit where you can preallocate
a file, fsync it, write into the preallocated region, have the transaction
commit twice and then fsync and then immediately lose power and lose all of
the contents of the write. This patch fixes this just so I feel better
about the situation and because it is lightweight, we just update the
last_trans when we finish an ordered IO and we don't update the inode
itself. This way we are completely safe and I feel better. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The btrfs send code was assuming the offset of the file item into the
extent translated to bytes on disk. If we're compressed, this isn't
true, and so it was off into extents owned by other files.
It was also improperly handling inline extents. This solves a crash
where we may have gone past the end of the file extent item by not
testing early enough for an inline extent. It also solves problems
where we have a whole between the end of the inline item and the start
of the full extent.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can't do the deleted/reused logic for top/root inodes as it would
create a stream that tries to delete and recreate the root dir.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We have to ignore inode/space cache objects in send/receive.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We need to pass the root that we determined earlier to iterate_inode_ref.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
The previous check was working fine, but this check should be
easier to read. Also, we could theoritically have some exotic
bugs with the previous checks.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
A leftover from older code and unused now.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Updating send_progress in process_recorded_refs was not correct.
It got updated too early in the cur_inode_new_gen case.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Btrfs send/receive uses the aux field to store inode numbers. On
32 bit machines this may become a problem.
Also fix all users of ulist_add and ulist_add_merged.
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We can't easily use the index of the radix tree for inums as the
radix tree uses 32bit indexes on 32bit kernels. For 32bit kernels,
we now use the lower 32bit of the inum as index and an additional
list to store multiple entries per radix tree entry.
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
When everything is done, name_cache_free is called which however
forgot to call kfree on the cache entries.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
If we break, we may miss the clone from send_root which we prefer
over all other clones.
Commit is a result of Arne's review.
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Don't have a seperate return path for the mentioned case. Now
we do the same "take lowest inode/offset" logic for all found clones.
Commit is a result of Arne's review.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Make sure to never get in trouble due to the backref_ctx
which was on the stack before.
Commit is a result of Arne's review.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We only added the parent for the new position of a moved dir.
We also need to add the old parent of the moved dir.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
fs_path_remove is not used at the moment due to a previous patch.
Remove it for now (with #if 0) to avoid compile warnings.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We missed that check which resultet in all refs with the same name
being reported as first_ref.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
When the current inodes inum is smaller then the inum of the
parent directory strange things were happending due to wrong
path resolution and other bugs. Fix this with a new approach
for the problem.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Here is the big driver core update for 3.7-rc1.
A number of firmware_class.c updates (as you saw a month or so ago), and
some hyper-v updates and some printk fixes as well. All patches that
are outside of the drivers/base area have been acked by the respective
maintainers, and have all been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlBp3vkACgkQMUfUDdst+ylQoACgldktGFgkCLzH+rGYthrXOC5P
9hUAnjmOhdoHlMTL81vWTlH+BrGernym
=khrr
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core merge from Greg Kroah-Hartman:
"Here is the big driver core update for 3.7-rc1.
A number of firmware_class.c updates (as you saw a month or so ago),
and some hyper-v updates and some printk fixes as well. All patches
that are outside of the drivers/base area have been acked by the
respective maintainers, and have all been in the linux-next tree for a
while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
memory: tegra{20,30}-mc: Fix reading incorrect register in mc_readl()
device.h: Add missing inline to #ifndef CONFIG_PRINTK dev_vprintk_emit
memory: emif: Add ifdef CONFIG_DEBUG_FS guard for emif_debugfs_[init|exit]
Documentation: Fixes some translation error in Documentation/zh_CN/gpio.txt
Documentation: Remove 3 byte redundant code at the head of the Documentation/zh_CN/arm/booting
Documentation: Chinese translation of Documentation/video4linux/omap3isp.txt
device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
dev: Add dev_vprintk_emit and dev_printk_emit
netdev_printk/netif_printk: Remove a superfluous logging colon
netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack
driver-core: Shut up dev_dbg_reatelimited() without DEBUG
tools/hv: Parse /etc/os-release
tools/hv: Check for read/write errors
tools/hv: Fix exit() error code
tools/hv: Fix file handle leak
Tools: hv: Implement the KVP verb - KVP_OP_GET_IP_INFO
Tools: hv: Rename the function kvp_get_ip_address()
Tools: hv: Implement the KVP verb - KVP_OP_SET_IP_INFO
Tools: hv: Add an example script to configure an interface
...
Features currently supported:
- 39-bit address space for user and kernel (each)
- 4KB and 64KB page configurations
- Compat (32-bit) user applications (ARMv7, EABI only)
- Flattened Device Tree (mandated for all AArch64 platforms)
- ARM generic timers
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iQIcBAABAgAGBQJQabRiAAoJEGvWsS0AyF7xXgcQAK+FTXt0ikdQYMkV5AIZXb9i
xHRhuiZWx2vKyk0mCqpyGLY58GSmSb6uTBg/2P2Ej7vXdH/RB2goPzjlspfjkDL4
o8RJp7eQ07Uz3KRDYEJgMP8xKZid6KFG93RJ6TjjpKZLuDBdwiG1GP1vb0jVcWfo
ttZrj/aI8lMcqrh3Vq5qefP7GWP1OVATqeaGTiT7oo38pXwF3t237xfBr2iDGFBp
ZgIRddrxpa7JYUesfJDDDdGHvLq7Vh2jJV+io9qasBZDrtppGJIhZ0vUni2DgIi7
r4i1LcynDN4JaG0maZ4U/YQm74TCD4BqxV8GJ7zwLPTWeN+of+skjhPSLOkA+0fp
I+sWjXlv200gDfJZ9qnUld2kFpoDfJi2b7fNDouSDd2OhmVOVWG3jnVP4Z7meVSb
O8BYzWDdsAiabuwciUY3OsmW6424lT93b2v86Vncs4unKMvEjOPxYZbUxhqX8f2j
gsmWwwD/yS4THx2B6OyW9VT3I5J6miqs2Glt/GG6vPWT5AKQJn9jCxKaBGhPMPIs
xe5/GycBYjdk/Y8qRjegxFbEqzQuiRzmkeFn5jwjmBLqpGNbZDpvMaL6adhAKM5/
v6UIKa91ra4fC9N0h6G61pOc9N9DbT8wPbCbdYY0RMTMRuLDZDgAM3Bvz0r2APdD
96leNy6vx684hbkCSLJs
=buJB
-----END PGP SIGNATURE-----
Merge tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64
Pull arm64 support from Catalin Marinas:
"Linux support for the 64-bit ARM architecture (AArch64)
Features currently supported:
- 39-bit address space for user and kernel (each)
- 4KB and 64KB page configurations
- Compat (32-bit) user applications (ARMv7, EABI only)
- Flattened Device Tree (mandated for all AArch64 platforms)
- ARM generic timers"
* tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (35 commits)
arm64: ptrace: remove obsolete ptrace request numbers from user headers
arm64: Do not set the SMP/nAMP processor bit
arm64: MAINTAINERS update
arm64: Build infrastructure
arm64: Miscellaneous header files
arm64: Generic timers support
arm64: Loadable modules
arm64: Miscellaneous library functions
arm64: Performance counters support
arm64: Add support for /proc/sys/debug/exception-trace
arm64: Debugging support
arm64: Floating point and SIMD
arm64: 32-bit (compat) applications support
arm64: User access library functions
arm64: Signal handling support
arm64: VDSO support
arm64: System calls handling
arm64: ELF definitions
arm64: SMP support
arm64: DMA mapping API
...
make menuconfig for cifs shows multiple entries toward
the end of the list with the incorrect indentation
(probably a bug in Kconfig parsing of items
that are dependant on the module (cifs=m instead of
just CONFIG_CIFS). This patch fixes the indentation
of all but the last entry (CIFS_ACL) which I don't
know how to fix. It also clarifies wording in
two places
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Based on whether the user (on mount command) chooses:
vers=3.0 (for smb3.0 support)
vers=2.1 (for smb2.1 support)
or (with subsequent patch, which will allow SMB2 support)
vers=2.0 (for original smb2.02 dialect support)
send only one dialect at a time during negotiate (we
had been sending a list).
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Pull the trivial tree from Jiri Kosina:
"Tiny usual fixes all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
doc: fix old config name of kprobetrace
fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
btrfs: fix the commment for the action flags in delayed-ref.h
btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
vfs: fix kerneldoc for generic_fh_to_parent()
treewide: fix comment/printk/variable typos
ipr: fix small coding style issues
doc: fix broken utf8 encoding
nfs: comment fix
platform/x86: fix asus_laptop.wled_type module parameter
mfd: printk/comment fixes
doc: getdelays.c: remember to close() socket on error in create_nl_socket()
doc: aliasing-test: close fd on write error
mmc: fix comment typos
dma: fix comments
spi: fix comment/printk typos in spi
Coccinelle: fix typo in memdup_user.cocci
tmiofb: missing NULL pointer checks
tools: perf: Fix typo in tools/perf
tools/testing: fix comment / output typos
...
Pull GFS2 updates from Steven Whitehouse:
"The major feature this time is the "rbm" conversion in the resource
group code. The new struct gfs2_rbm specifies the location of an
allocatable block in (resource group, bitmap, offset) form. There are
a number of added helper functions, and later patches then rewrite
some of the resource group code in terms of this new structure. Not
only does this give us a nice code clean up, but it also removes some
of the previous restrictions where extents could not cross bitmap
boundaries, for example.
In addition to that, there are a few bug fixes and clean ups, but the
rbm work is by far the majority of this patch set in terms of number
of changed lines."
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (27 commits)
GFS2: Write out dirty inode metadata in delayed deletes
GFS2: fix s_writers.counter imbalance in gfs2_ail_empty_gl
GFS2: Fix infinite loop in rbm_find
GFS2: Consolidate free block searching functions
GFS2: Get rid of I_MUTEX_QUOTA usage
GFS2: Stop block extents at the end of bitmaps
GFS2: Fix unclaimed_blocks() wrapping bug and clean up
GFS2: Improve block reservation tracing
GFS2: Fall back to ignoring reservations, if there are no other blocks left
GFS2: Fix ->show_options() for statfs slow
GFS2: Use rbm for gfs2_setbit()
GFS2: Use rbm for gfs2_testbit()
GFS2: Eliminate unnecessary check for state > 3 in bitfit
GFS2: Eliminate redundant calls to may_grant
GFS2: Combine functions gfs2_glock_dq_wait and wait_on_demote
GFS2: Combine functions gfs2_glock_wait and wait_on_holder
GFS2: inline __gfs2_glock_schedule_for_reclaim
GFS2: change function gfs2_direct_IO to use a normal gfs2_glock_dq
GFS2: rbm code cleanup
GFS2: Fix case where reservation finished at end of rgrp
...
Commits 5e8830dc85 and 41c4d25f78 introduced a regression into
v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
take place for files modified via mmap if the page was already in the
page cache. This would also affect ext3 file systems mounted using
the ext4 file system driver.
The problem was that ext4_page_mkwrite() had a shortcut which would
avoid calling __block_page_mkwrite() under some circumstances, and the
above two commit transferred the responsibility of calling
file_update_time() to __block_page_mkwrite --- which woudln't get
called in some circumstances.
Since __block_page_mkwrite() only has three callers,
block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
best way to solve this is to move the responsibility for calling
file_update_time() to its caller.
This problem was found via xfstests #215 with a file system mounted
with -o nodelalloc.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable@vger.kernel.org
Inode is allowed to have empty leaf only if it this is blockless inode.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
punch_hole is the place where we have to wait for all existing writers
(writeback, aio, dio), but currently we simply flush pended end_io request
which is not sufficient. Other issue is that punch_hole performed w/o i_mutex
held which obviously result in dangerous data corruption due to
write-after-free.
This patch performs following changes:
- Guard punch_hole with i_mutex
- Recheck inode flags under i_mutex
- Block all new dio readers in order to prevent information leak caused by
read-after-free pattern.
- punch_hole now wait for all writers in flight
NOTE: XXX write-after-free race is still possible because new dirty pages
may appear due to mmap(), and currently there is no easy way to stop
writeback while punch_hole is in progress.
[ Fixed error return from ext4_ext_punch_hole() to make sure that we
release i_mutex before returning EPERM or ETXTBUSY -- Ted ]
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Selected by __ARCH_WANT_SYS_EXECVE in unistd.h. Requires
* working current_pt_regs()
* *NOT* doing a syscall-in-kernel kind of kernel_execve()
implementation. Using generic kernel_execve() is fine.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
based mostly on arm and alpha versions. Architectures can define
__ARCH_WANT_KERNEL_EXECVE and use it, provided that
* they have working current_pt_regs(), even for kernel threads.
* kernel_thread-spawned threads do have space for pt_regs
in the normal location. Normally that's as simple as switching to
generic kernel_thread() and making sure that kernel threads do *not*
go through return from syscall path; call the payload from equivalent
of ret_from_fork if we are in a kernel thread (or just have separate
ret_from_kernel_thread and make copy_thread() use it instead of
ret_from_fork in kernel thread case).
* they have ret_from_kernel_execve(); it is called after
successful do_execve() done by kernel_execve() and gets normal
pt_regs location passed to it as argument. It's essentially
a longjmp() analog - it should set sp, etc. to the situation
expected at the return for syscall and go there. Eventually
the need for that sucker will disappear, but that'll take some
surgery on kernel_thread() payloads.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
IBM reported a deadlock in select_parent(). This was found to be caused
by taking rename_lock when already locked when restarting the tree
traversal.
There are two cases when the traversal needs to be restarted:
1) concurrent d_move(); this can only happen when not already locked,
since taking rename_lock protects against concurrent d_move().
2) racing with final d_put() on child just at the moment of ascending
to parent; rename_lock doesn't protect against this rare race, so it
can happen when already locked.
Because of case 2, we need to be able to handle restarting the traversal
when rename_lock is already held. This patch fixes all three callers of
try_to_ascend().
IBM reported that the deadlock is gone with this patch.
[ I rewrote the patch to be smaller and just do the "goto again" if the
lock was already held, but credit goes to Miklos for the real work.
- Linus ]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
JFFS2 was designed without thought for OOB bitflips, it seems, but they
can occur and will be reported to JFFS2 via mtd_read_oob()[1]. We don't
want to fail on these transactions, since the data was corrected.
[1] Few drivers report bitflips for OOB-only transactions. With such
drivers, this patch should have no effect.
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This patch fixes regression introduced by
"8bdc81c jffs2: get rid of jffs2_sync_super". We submit a delayed work in order
to make sure the write-buffer is synchronized at some point. But we do not
flush it when we unmount, which causes an oops when we unmount the file-system
and then the delayed work is executed.
This patch fixes the issue by adding a "cancel_delayed_work_sync()" infocation
in the '->sync_fs()' handler. This will make sure the delayed work is canceled
on sync, unmount and re-mount. And because VFS always callse 'sync_fs()' before
unmounting or remounting, this fixes the issue.
Reported-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Cc: stable@vger.kernel.org [3.5+]
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Jan Kara have spotted interesting issue:
There are potential data corruption issue with direct IO overwrites
racing with truncate:
Like:
dio write truncate_task
->ext4_ext_direct_IO
->overwrite == 1
->down_read(&EXT4_I(inode)->i_data_sem);
->mutex_unlock(&inode->i_mutex);
->ext4_setattr()
->inode_dio_wait()
->truncate_setsize()
->ext4_truncate()
->down_write(&EXT4_I(inode)->i_data_sem);
->__blockdev_direct_IO
->ext4_get_block
->submit_io()
->up_read(&EXT4_I(inode)->i_data_sem);
# truncate data blocks, allocate them to
# other inode - bad stuff happens because
# dio is still in flight.
In order to serialize with truncate dio worker should grab extra i_dio_count
reference before drop i_mutex.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>