If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext2_quota_read() doesn't initialize tmp_bh.b_size before calling
ext2_get_block() where we access it. Since it is a local variable it
might contain some garbage. Make sure it is filled with reasonable
value before passing.
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Use lowercase names of quota functions instead of old uppercase ones.
CC: bfields@fieldses.org
CC: neilb@suse.de
Signed-off-by: Jan Kara <jack@suse.cz>
Use lowercase names of quota functions instead of old uppercase ones.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <shaggy@austin.ibm.com>
Use lowercase names of quota functions instead of old uppercase ones.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mingming Cao <cmm@us.ibm.com>
CC: linux-ext4@vger.kernel.org
Use lowercase names of quota functions instead of old uppercase ones.
Signed-off-by: Jan Kara <jack@suse.cz>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
Remove bogus typedef which is just a definition of char *.
Remove unnecessary type casts.
Substitute freedqbuf() with kfree.
Signed-off-by: Jan Kara <jack@suse.cz>
Andrew Morton has suggested that three global quota locks can end up in the
same cacheline which can result in bad cacheline ping-pong on SMP machines.
Make locks cacheline aligned so that we avoid this problem (thanks goes to
Andrew for the idea).
Signed-off-by: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
Uses quota reservation/claim/release to handle quota properly for delayed
allocation in the three steps: 1) quotas are reserved when data being copied
to cache when block allocation is defered 2) when new blocks are allocated.
reserved quotas are converted to the real allocated quota, 2) over-booked
quotas for metadata blocks are released back.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
According to checkpatch: EXPORT_SYMBOL(foo); should immediately follow its
function/variable
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reserved quota will be claimed at the block allocation time. Over-booked
quota could be returned back with the release callback function.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Delayed allocation defers the block allocation at the dirty pages
flush-out time, doing quota charge/check at that time is too late.
But we can't charge the quota blocks until blocks are really allocated,
otherwise users could get overcharged after reboot from system crash.
This patch adds quota reservation for delayed allocation. Quota blocks
are reserved in memory, inode and quota won't gets dirtied until later
block allocation time.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
btrfs_update_delayed_ref is optimized to add and remove different
references in one pass through the delayed ref tree. It is a zero
sum on the total number of refs on a given extent.
But, the code was recording an extra ref in the head node. This
never made it down to the disk but was used when deciding if it was
safe to free the extent while dropping snapshots.
The fix used here is to make sure the ref_mod count is unchanged
on the head ref when btrfs_update_delayed_ref is called.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commit 86c9508eb1c0ce5aa07b5cf1d36b60c54efc3d7a
"sysfs: don't block indefinitely for unmapped files" in linux-next
crashes the PowerMac G5 when X starts up. It's caught out by the way
powerpc's pci_mmap of legacy_mem uses shmem_zero_setup(), substituting
a new vma->vm_file whose private_data no longer points to the bin_buffer
(substitution done because some versions of X crash if that mmap fails).
The fix to this is straightforward: the original vm_file is fput() in
that case, so this mmap won't block sysfs at all, so just don't switch
over to bin_vm_ops if vm_file has changed.
But more fixes made before realizing that was the problem:-
It should not be an error if bin_page_mkwrite() finds no underlying
page_mkwrite().
Check that a file already mmap'ed has the same underlying vm_ops
_before_ pointing vma->vm_ops at bin_vm_ops.
If the file being mmap'ed is a shmem/tmpfs file, don't fail the mmap
on CONFIG_NUMA=y, just because that has a set_policy and get_policy:
provide bin_set_policy, bin_get_policy and bin_migrate.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Eric Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The only way for a sysfs attribute to remove itself (without
deadlock) is to use the sysfs_schedule_callback() interface.
Vegard Nossum discovered that a poorly written sysfs ->store
callback can repeatedly schedule remove callbacks on the same
device over and over, e.g.
$ while true ; do echo 1 > /sys/devices/.../remove ; done
If the 'remove' attribute uses the sysfs_schedule_callback API
and also does not protect itself from concurrent accesses, its
callback handler will be called multiple times, and will
eventually attempt to perform operations on a freed kobject,
leading to many problems.
Instead of requiring all callers of sysfs_schedule_callback to
implement their own synchronization, provide the protection in
the infrastructure.
Now, sysfs_schedule_callback will only allow one scheduled
callback per kobject. On subsequent calls with the same kobject,
return -EAGAIN.
This is a short term fix. The long term fix is to allow sysfs
attributes to remove themselves directly, without any of this
callback hokey pokey.
[cornelia.huck@de.ibm.com: s390 ccwgroup bits]
Reported-by: vegard.nossum@gmail.com
Signed-off-by: Alex Chiang <achiang@hp.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch implements uevent suppress in kobject and removes it
from struct device, based on the following ideas:
1,Uevent sending should be one attribute of kobject, so suppressing it
in kobject layer is more natural than in device layer. By this way,
we can do it for other objects embedded with kobject.
2,It may save several bytes for each instance of struct device.(On my
omap3(32bit ARM) based box, can save 8bytes per device object)
This patch also introduces dev_set|get_uevent_suppress() helpers to
set and query uevent_suppress attribute in case to help kobject
as private part of struct device in future.
[This version is against the latest driver-core patch set of Greg,please
ignore the last version.]
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Modify sysfs bin files so that we can remove the bin file while they are
still mapped. When the kobject is removed we unmap the bin file and
arrange for future accesses to the mapping to receive SIGBUS.
Implementing this prevents a nasty DOS when pci devices are hot plugged
and unplugged. Where if any of their resources were mmaped the kernel
could not free up their pci resources or release their pci data
structures.
[akpm@linux-foundation.org: remove unused var]
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The sysfs_dirent serves as both an inode and a directory entry
for sysfs. To prevent the sysfs inode numbers from being freed
prematurely hold a reference to sysfs_dirent from the sysfs inode.
[akpm@linux-foundation.org: add comment]
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs: sysfs_add_one WARNs with full path to duplicate filename
As a debugging aid, it can be useful to know the full path to a
duplicate file being created in sysfs.
We now will display warnings such as:
sysfs: cannot create duplicate filename '/foo'
when attempting to create multiple files named 'foo' in the sysfs
root, or:
sysfs: cannot create duplicate filename '/bus/pci/slots/5/foo'
when attempting to create multiple files named 'foo' under a
given directory in sysfs.
The path displayed is always a relative path to sysfs_root. The
leading '/' in the path name refers to the sysfs_root mount
point, and should not be confused with the "real" '/'.
Thanks to Alex Williamson for essentially writing sysfs_pathname.
Cc: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs_get_inode ultimately calls sysfs_count_nlink when the a
directory inode is fectched. sysfs_count_nlink needs to be
called under the sysfs_mutex to guard against the unlikely
but possible scenario that the root directory is changing
as we are counting the number entries in it, and just in
general to be consistent.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
SYSFS_MAGIC has been added into magic.h, so only use that definition
in magic.h to avoid potential consistency problem.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The fsync log has code to make sure all of the parents of a file are in the
log along with the file. It uses a minimal log of the parent directory
inodes, just enough to get the parent directory on disk.
If the transaction that originally created a file is fully on disk,
and the file hasn't been renamed or linked into other directories, we
can safely skip the parent directory walk. We know the file is on disk
somewhere and we can go ahead and just log that single file.
This is more important now because unrelated unlinks in the parent directory
might make us force a commit if we try to log the parent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The tree logging code allows individual files or directories to be logged
without including operations on other files and directories in the FS.
It tries to commit the minimal set of changes to disk in order to
fsync the single file or directory that was sent to fsync or O_SYNC.
The tree logging code was allowing files and directories to be unlinked
if they were part of a rename operation where only one directory
in the rename was in the fsync log. This patch adds a few new rules
to the tree logging.
1) on rename or unlink, if the inode being unlinked isn't in the fsync
log, we must force a full commit before doing an fsync of the directory
where the unlink was done. The commit isn't done during the unlink,
but it is forced the next time we try to log the parent directory.
Solution: record transid of last unlink/rename per directory when the
directory wasn't already logged. For renames this is only done when
renaming to a different directory.
mkdir foo/some_dir
normal commit
rename foo/some_dir foo2/some_dir
mkdir foo/some_dir
fsync foo/some_dir/some_file
The fsync above will unlink the original some_dir without recording
it in its new location (foo2). After a crash, some_dir will be gone
unless the fsync of some_file forces a full commit
2) we must log any new names for any file or dir that is in the fsync
log. This way we make sure not to lose files that are unlinked during
the same transaction.
2a) we must log any new names for any file or dir during rename
when the directory they are being removed from was logged.
2a is actually the more important variant. Without the extra logging
a crash might unlink the old name without recreating the new one
3) after a crash, we must go through any directories with a link count
of zero and redo the rm -rf
mkdir f1/foo
normal commit
rm -rf f1/foo
fsync(f1)
The directory f1 was fully removed from the FS, but fsync was never
called on f1, only its parent dir. After a crash the rm -rf must
be replayed. This must be able to recurse down the entire
directory tree. The inode link count fixup code takes care of the
ugly details.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
During log replay, inodes are copied from the log to the main filesystem
btrees. Sometimes they have a zero link count in the log but they actually
gain links during the replay or have some in the main btree.
This patch updates the link count to be at least one after copying the
inode out of the log. This makes sure the inode is deleted during an
iput while the rest of the replay code is still working on it.
The log replay has fixup code to make sure that link counts are correct
at the end of the replay, so we could use any non-zero number here and
it would work fine.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The delayed reference mechanism is responsible for all updates to the
extent allocation trees, including those updates created while processing
the delayed references.
This commit tries to limit the amount of work that gets created during
the final run of delayed refs before a commit. It avoids cowing new blocks
unless it is required to finish the commit, and so it avoids new allocations
that were not really required.
The goal is to avoid infinite loops where we are always making more work
on the final run of delayed refs. Over the long term we'll make a
special log for the last delayed ref updates as well.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This reads in blocks in the checksum btree before starting the
transaction in btrfs_finish_ordered_io. It makes it much more likely
we'll be able to do operations inside the transaction without
needing any btree reads, which limits transaction latencies overall.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying. This may require a kmalloc and it
was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.
This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer. Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.
This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.
Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commits are fairly expensive, and so btrfs has code to sit around for a while
during the commit and let new writers come in.
But, while we're sitting there, new delayed refs might be added, and those
can be expensive to process as well. Unless the transaction is very very
young, it makes sense to go ahead and let the commit finish without hanging
around.
The commit grow loop isn't as important as it used to be, the fsync logging
code handles most performance critical syncs now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This reduces contention on the extent buffer spin locks by testing for a
blocking lock before trying to take the spinlock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The fs/btrfs/inode.c code to run delayed allocation during writout
needed some stack usage optimization. This is the first pass, it does
the check for compression earlier on, which allows us to do the common
(no compression) case higher up in the call chain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To avoid deadlocks and reduce latencies during some critical operations, some
transaction writers are allowed to jump into the running transaction and make
it run a little longer, while others sit around and wait for the commit to
finish.
This is a bit unfair, especially when the callers that jump in do a bunch
of IO that makes all the others procs on the box wait. This commit
reduces the stalls this produces by pre-reading file extent pointers
during btrfs_finish_ordered_io before the transaction is joined.
It also tunes the drop_snapshot code to politely wait for transactions
that have started writing out their delayed refs to finish. This avoids
new delayed refs being flooded into the queue while we're trying to
close off the transaction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The delayed reference queue maintains pending operations that need to
be done to the extent allocation tree. These are processed by
finding records in the tree that are not currently being processed one at
a time.
This is slow because it uses lots of time searching through the rbtree
and because it creates lock contention on the extent allocation tree
when lots of different procs are running delayed refs at the same time.
This commit changes things to grab a cluster of refs for processing,
using a cursor into the rbtree as the starting point of the next search.
This way we walk smoothly through the rbtree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When extents are freed, it is likely that we've removed the last
delayed reference update for the extent. This checks the delayed
ref tree when things are freed, and if no ref updates area left it
immediately processes the delayed ref.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Many of the tree balancing functions follow the same pattern.
1) cow a block
2) do something to the result
This commit breaks them up into two functions so the variables and
code required for part two don't suck down stack during part one.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent allocation tree maintains a reference count and full
back reference information for every extent allocated in the
filesystem. For subvolume and snapshot trees, every time
a block goes through COW, the new copy of the block adds a reference
on every block it points to.
If a btree node points to 150 leaves, then the COW code needs to go
and add backrefs on 150 different extents, which might be spread all
over the extent allocation tree.
These updates currently happen during btrfs_cow_block, and most COWs
happen during btrfs_search_slot. btrfs_search_slot has locks held
on both the parent and the node we are COWing, and so we really want
to avoid IO during the COW if we can.
This commit adds an rbtree of pending reference count updates and extent
allocations. The tree is ordered by byte number of the extent and byte number
of the parent for the back reference. The tree allows us to:
1) Modify back references in something close to disk order, reducing seeks
2) Significantly reduce the number of modifications made as block pointers
are balanced around
3) Do all of the extent insertion and back reference modifications outside
of the performance critical btrfs_search_slot code.
#3 has the added benefit of greatly reducing the btrfs stack footprint.
The extent allocation tree modifications are done without the deep
(and somewhat recursive) call chains used in the past.
These delayed back reference updates must be done before the transaction
commits, and so the rbtree is tied to the transaction. Throttling is
implemented to help keep the queue of backrefs at a reasonable size.
Since there was a similar mechanism in place for the extent tree
extents, that is removed and replaced by the delayed reference tree.
Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In order to avoid doing expensive extent management with tree locks held,
btrfs_search_slot will preallocate tree blocks for use by COW without
any tree locks held.
A later commit moves all of the extent allocation work for COW into
a delayed update mechanism, and this preallocation will no longer be
required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>