When mounting a btrfs file system btrfs_test_super() may attempt to
use sb->s_fs_info, the btrfs root, of a super block that is going away
and that has had the btrfs root set to NULL in its ->put_super(). But
if the super block is going away it cannot be an existing super block
so we can return false in this case.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently we fail xfstest 236 because we're not updating the inode ctime on
link. This is a simple fix, and makes it so we pass 236 now.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have been failing xfstest 228 forever, because we don't check to make sure
the new inode size is acceptable as far as RLIMIT is concerned. Just check to
make sure it's ok to create a inode with this new size and error out if not.
With this patch we now pass 228.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a typo in __btrfs_prealloc_file_range() where we set the i_size to
actual_len/cur_offset, and then just set it to cur_offset again, and do the same
with btrfs_ordered_update_i_size(). This fixes it back to keeping i_size in a
local variable and then updating i_size properly. Tested this with
xfs_io -F -f -c "falloc 0 1" -c "pwrite 0 1" foo
stat'ing foo gives us a size of 1 instead of 4096 like it was. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Ensure we return the dirent->d_type when it is known
NFS: Correct the array bound calculation in nfs_readdir_add_to_array
NFS: Don't ignore errors from nfs_do_filldir()
NFS: Fix the error handling in "uncached_readdir()"
NFS: Fix a page leak in uncached_readdir()
NFS: Fix a page leak in nfs_do_filldir()
NFS: Assume eof if the server returns no readdir records
NFS: Buffer overflow in ->decode_dirent() should not be fatal
Pure nfs client performance using odirect.
SUNRPC: Fix an infinite loop in call_refresh/call_refreshresult
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cciss: fix build for PROC_FS disabled
block: fix amiga and atari floppy driver compile warning
blk-throttle: Fix calculation of max number of WRITES to be dispatched
ioprio: grab rcu_read_lock in sys_ioprio_{set,get}()
xen/blkfront: cope with backend that fail empty BLKIF_OP_WRITE_BARRIER requests
xen/blkfront: Implement FUA with BLKIF_OP_WRITE_BARRIER
xen/blkfront: change blk_shadow.request to proper pointer
xen/blkfront: map REQ_FLUSH into a full barrier
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: fix typo in comment of nilfs_dat_move function
nilfs2: nilfs_iget_for_gc() returns ERR_PTR
reiserfs_unpack() locks the inode mutex with reiserfs_mutex_lock_safe()
to protect against reiserfs lock dependency. However this protection
requires to have the reiserfs lock to be locked.
This is the case if reiserfs_unpack() is called by reiserfs_ioctl but
not from reiserfs_quota_on() when it tries to unpack tails of quota
files.
Fix the ordering of the two locks in reiserfs_unpack() to fix this
issue.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: Markus Gapp <markus.gapp@gmx.net>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: <stable@kernel.org> [2.6.36.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently one pagemap_read() call walks in PAGEMAP_WALK_SIZE bytes (== 512
pages.) But there is a corner case where walk_pmd_range() accidentally
runs over a VMA associated with a hugetlbfs file.
For example, when a process has mappings to VMAs as shown below:
# cat /proc/<pid>/maps
...
3a58f6d000-3a58f72000 rw-p 00000000 00:00 0
7fbd51853000-7fbd51855000 rw-p 00000000 00:00 0
7fbd5186c000-7fbd5186e000 rw-p 00000000 00:00 0
7fbd51a00000-7fbd51c00000 rw-s 00000000 00:12 8614 /hugepages/test
then pagemap_read() goes into walk_pmd_range() path and walks in the range
0x7fbd51853000-0x7fbd51a53000, but the hugetlbfs VMA should be handled by
walk_hugetlb_range(). Otherwise PMD for the hugepage is considered bad
and cleared, which causes undesirable results.
This patch fixes it by separating pagemap walk range into one PMD.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The attribute cache for a file was not being cleared when a file is opened
with O_TRUNC.
If the filesystem's open operation truncates the file ("atomic_o_trunc"
feature flag is set) then the kernel should invalidate the cached st_mtime
and st_ctime attributes.
Also i_size should be explicitly be set to zero as it is used sometimes
without refreshing the cache.
Signed-off-by: Ken Sumrall <ksumrall@android.com>
Cc: Anfei <anfei.zhou@gmail.com>
Cc: "Anand V. Avati" <avati@gluster.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Flush the disk cache in fsync and sync to make sure data actually is
on disk on completion of these system calls. There is a nobarrier
mount option to disable this behaviour. It's slightly misnamed now
that barrier actually are gone, but it matches the name used by all
major filesystems.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Avoid doing unessecary work in fsync. Do nothing unless the inode
was marked dirty, and only write the various metadata inodes out if
they contain any dirty state from this inode. This is archived by
adding three new dirty bits to the hfsplus-specific inode which are
set in the correct places.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Split the flags field in the hfsplus inode into an extent_state
flag that is locked by the extent_lock, and a new flags field
that uses atomic bitops. The second will grow more flags in the
next patch.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
fsync is supposed to not just work on regular files, but also on
directories. Fortunately enough hfsplus_file_fsync works just fine
for directories, so we can just wire it up.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Remove lots of code we don't need from fsync, we just need to call
->write_inode on the inode if it's dirty, for which sync_inode_metadata
is a lot more efficient than write_inode_now, and we need to write
out the various metadata inodes, which we now do explicitly instead
of by calling ->sync_fs.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
There is no reason to write out the metadata inodes or volume headers
during a non-blocking sync, as we are almost guaranteed to dirty them
again during the inode writeouts.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
hfsplus stores all metadata except for the volume headers in special
inodes. While these are marked hashed and periodically written out
by the flusher threads, we can't rely on that for sync. For the case
of a data integrity sync the VM has life-lock avoidance code that
avoids writing inodes again that are redirtied during the sync,
which is something that can happen easily for hfsplus. So make sure
we explicitly write out the metadata inodes at the beginning of
hfsplus_sync_fs.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Switch the hfsplus partition table reding for cdroms to use our bio
helpers. Again we don't rely on any caching in the buffer_heads, and
this gets rid of the last buffer_head use in hfsplus.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
The hfsplus backup volume header is located two blocks from the end of
the device. In case of device sizes that are not 4k aligned this means
we can't access it using buffer_heads when using the default 4k block
size.
Switch to using raw bios to read/write all buffer headers. We were not
relying on any caching behaviour of the buffer heads anyway. Additionally
always read in the backup volume header during mount to verify that we
can actually read it.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Remove opencoded writing of the volume header in hfsplus_fill_super
and hfsplus_put_super and offload it to hfsplus_sync_fs. In the
put_super case this means we only write the superblock once instead
of twice.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Turn a few noisy debug printks that show up during xfstests into
complied out debug print statements.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
nilfs_iget_for_gc() returns an ERR_PTR() on failure and doesn't return
NULL.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Store the dirent->d_type in the struct nfs_cache_array_entry so that we
can use it in getdents() calls.
This fixes a regression with the new readdir code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It looks as if the array size calculation in MAX_READDIR_ARRAY does not
take the alignment of struct nfs_cache_array_entry into account.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We should ignore the errors from the filldir callback, and just interpret
them as meaning we should exit, however we should definitely pass back
ENOMEM errors.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, uncached_readdir() is broken because if fails to handle
the results from nfs_readdir_xdr_to_array() correctly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_do_filldir() must always free desc->page when it is done, otherwise
we end up leaking the page.
Also remove unused variable 'dentry'.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Overflowing the buffer in the readdir ->decode_dirent() should not lead to
a fatal error, but rather to an attempt to reread the record in question.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When an application opens a file with O_DIRECT flag, if the size of
the data that is written is equal to wsize, the client sends a
WRITE RPC with stable flag set to UNSTABLE followed by a single
COMMIT RPC rather than sending a single WRITE RPC with the stable
flag set to FILE_SYNC. This a bug.
Patch to fix this.
Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Everybody who calls btrfs_add_nondir just passes in the dentry of the new file
and then dereference dentry->d_parent->d_inode, but everybody who calls
btrfs_add_nondir() are already passed the parent's inode. So instead of
dereferencing dentry->d_parent, just make btrfs_add_nondir take the dir inode as
an argument and pass that along so we don't have to worry about d_parent.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since we walk up the path logging all of the parts of the inode's path, we need
to hold i_mutex to make sure that the inode is not renamed while we're logging
everything. btrfs_log_dentry_safe does dget_parent and all of that jazz, but we
may get unexpected results if the rename changes the inode's location while
we're higher up the path logging those dentries, so do this for safety reasons.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are lots of places where we do dentry->d_parent->d_inode without holding
the dentry->d_lock. This could cause problems with rename. So instead we need
to use dget_parent() and hold the reference to the parent as long as we are
going to use it's inode and then dput it at the end.
Signed-off-by: Josef Bacik <josef@redhat.com>
Cc: raven@themaw.net
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When creating new inodes we don't setup inode->i_generation. So if we generate
an fh with a newly created inode we save the generation of 0, but if we flush
the inode to disk and have to read it back when getting the inode on the server
we'll have the right i_generation, so gens wont match and we get ESTALE. This
patch properly sets inode->i_generation when we create the new inode and now I'm
no longer getting ESTALE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
People kept reporting NFS issues, specifically getting ESTALE alot. I figured
out how to reproduce the problem
SERVER
mkfs.btrfs /dev/sda1
mount /dev/sda1 /mnt/btrfs-test
<add /mnt/btrfs-test to /etc/exports>
btrfs subvol create /mnt/btrfs-test/foo
service nfs start
CLIENT
mount server:/mnt/btrfs /mnt/test
cd /mnt/test/foo
ls
SERVER
echo 3 > /proc/sys/vm/drop_caches
CLIENT
ls <-- get an ESTALE here
This is because the standard way to lookup a name in nfsd is to use readdir, and
what it does is do a readdir on the parent directory looking for the inode of
the child. So in this case the parent being / and the child being foo. Well
subvols all have the same inode number, so doing a readdir of / looking for
inode 256 will return '.', which obviously doesn't match foo. So instead we
need to have our own .get_name so that we can find the right name.
Our .get_name will either lookup the inode backref or the root backref,
whichever we're looking for, and return the name we find. Running the above
reproducer with this patch results in everything acting the way its supposed to.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the
original filesize of the dest file 'file2' is 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Now clone file1 to file2, the dest file should be 40K, but it
still shows 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We've done the check for src_offset and src_length, and We should
also check dest_offset, otherwise we'll corrupt the destination
file:
(After cloning file1 to file2 with unaligned dest_offset)
# cat /mnt/file2
cat: /mnt/file2: Input/output error
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When I added the clear_cache option I screwed up and took the break out of
the space_cache case statement, so whenever you mount with space_cache you also
get clear_cache, which does you no good if you say set space_cache in fstab so
it always gets set. This patch adds the break back in properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
'unused' calculated with wrong sign in reserve_metadata_bytes().
This might have lead to unwanted over-reservations.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
extent_bio_alloc() and compressed_bio_alloc() are similar, cleanup
similar source code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
bio_endio() will free dip and dip->csums, so dip and dip->csums twice will
be freed twice. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Migrate page will directly call the btrfs btree writepage function,
which isn't actually allowed.
Our writepage assumes that you have locked the extent_buffer and
flagged the block as written. Without doing these steps, we can
corrupt metadata blocks.
A later commit will remove the btree writepage function since
it is really only safely used internally by btrfs. We
use writepages for everything else.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Add EXT4_IOC_TRIM ioctl to handle batched discard
fs: Do not dispatch FITRIM through separate super_operation
ext4: ext4_fill_super shouldn't return 0 on corruption
jbd2: fix /proc/fs/jbd2/<dev> when using an external journal
ext4: missing unlock in ext4_clear_request_list()
ext4: fix setting random pages PageUptodate
Filesystem independent ioctl was rejected as not common enough to be in
core vfs ioctl. Since we still need to access to this functionality this
commit adds ext4 specific ioctl EXT4_IOC_TRIM to dispatch
ext4_trim_fs().
It takes fstrim_range structure as an argument. fstrim_range is definec in
the include/linux/fs.h and its definition is as follows.
struct fstrim_range {
__u64 start;
__u64 len;
__u64 minlen;
}
start - first Byte to trim
len - number of Bytes to trim from start
minlen - minimum extent length to trim, free extents shorter than this
number of Bytes will be ignored. This will be rounded up to fs
block size.
After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>