It will be used in suspend code and serves as an easy wrap around
copy_from_user. Similar to simple_read_from_buffer, it takes care
of transfers with proper lengths depending on available and count
parameters and advances ppos appropriately.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Once file or link creation gets going, it can't be interrupted by a
signal. They're not idempotent.
This blocks signals in ocfs2_mknod(), ocfs2_link(), and ocfs2_symlink()
once we start actually changing things. ocfs2_mknod() covers mknod(),
creat(), mkdir(), and open(O_CREAT).
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2 sometimes needs to block signals around dlm operations, but it
currently does it with sigprocmask(). Even worse, it's checking the
error code of sigprocmask(). The in-kernel sigprocmask() can only error
if you get the SIG_* argument wrong. We don't.
Wrap the sigprocmask() calls with ocfs2_[un]block_signals(). These
functions are void, but they will BUG() if somehow sigprocmask() returns
an error.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The comments make it clear the otherwise subtle behavior of cifs_new_fileinfo().
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
--
fs/cifs/dir.c | 18 ++++++++++++++++--
1 files changed, 16 insertions(+), 2 deletions(-)
Signed-off-by: Steve French <sfrench@us.ibm.com>
After commit 1f36f774b2 ("Switch !O_CREAT case to use of do_last()") in
2.6.34-rc1 autofs direct mounts stopped working. This is caused by
current->link_count being 0 when ->follow_link() is called from
do_filp_open().
I can't work out why this hasn't been seen before Als patch series.
This patch removes the autofs dependence on current->link_count.
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
..a left over from the commit 3321b791b2.
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This is the upstream fix for this bug. This patch differs
from the RHEL5 fix (Red Hat bz #555754) which simply writes to the 8-byte
value field of the quota. In upstream quota code, we're
required to write the entire quota (88 bytes) which can be split
across a page boundary. We check for such quotas, and read/write
the two parts from/to the corresponding pages holding these parts.
With this patch, I don't see the bug anymore using the reproducer
in Red Hat bz 555754. I successfully ran a couple of simple tests/mounts/
umounts and it doesn't seem like this patch breaks anything else.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Snapshots and regular ro/rw mounts are essentially-different within
the meaning whether the checkpoint is static or not and is marked with
a snapshot flag or not.
The current implemenation, however, allows to remount a snapshot to a
regular rw-mount if the checkpoint number equals the latest one.
This transition is actually impossible since changing a checkpoint to
a snapshot makes another checkpoint, thus the condition is never
satisfied.
This fixes the weird state of affairs, and specifically separates
snapshots and regular rw/ro-mounts.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This replaces uses of new_encode_dev/new_decode_dev with their 64-bit
counterparts, huge_encode_dev/huge_decode_dev respectively.
This is just for clarification and has no impact on the disk format.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
deactivate_super was replaced with deactivate_locked_super, but the
comment of nilfs_get_sb remain unchanged. This renews the comment.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
MS_VERBOSE is deprecated. This replaces it with MS_SILENT in
reference to get_sb_bdev function.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
An fmode_t argument is passed to kill_block_super() through s_mode
member of the super_block structure. This is used to release the
block device with the same mode, however, nilfs does not set s_mode
anywhere.
This modifies nilfs_get_sb function to properly initialize the s_mode
member.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The second argument of open_bdev_exclusive/close_bdev_exclusive takes
fmode_t flags instead of mount flags. This fixes the misuse.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Nilfs maintains two super blocks, and selects the new one on mount
time if they both have valid checksums and their timestamps differ.
However, this has potential for mis-selection since the system clock
may be rewinded and the resolution of the timestamps is not high.
Usually this doesn't become an issue because both super blocks are
updated at the same time when the file system is unmounted. Even if
the file system wasn't unmounted cleanly, the roll-forward recovery
will find the proper log which stores the latest super root. Thus,
the issue can appear only if update of one super block fails and the
clock happens to be rewinded.
This fixes the issue by using checkpoint numbers instead of timestamps
to pick the super block storing the location of the latest log.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds missing endian conversions in comparision of the magic
number of super blocks. It was coincidence that prior versions didn't
incur problems; the upper byte of the magic number happened to be
equal to the lower byte. But, semantically it's wrong to depend on
this.
This won't change anything else nor suffer any compatibility issues.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This kills the following sparse warnings:
fs/nilfs2/segment.c:567:28: warning: symbol 'nilfs_sc_file_ops' was not declared. Should it be static?
fs/nilfs2/segment.c:617:28: warning: symbol 'nilfs_sc_dat_ops' was not declared. Should it be static?
fs/nilfs2/segment.c:625:28: warning: symbol 'nilfs_sc_dsync_ops' was not declared. Should it be static?
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The implementation of persistent object allocator (alloc.c) is poorly
documented. This adds kernel doc style comments on that functions.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
In nilfs_segctor_thread(), timer is a local variable allocated on stack. Its
address can't be set to sci->sc_timer and passed in several procedures.
It works now by chance, just because other procedures are called by
nilfs_segctor_thread() directly or indirectly and the stack hasn't been
deallocated yet.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
There are only two lines of code in nilfs_segctor_init(). From a logic
design view, the first line 'sci->sc_seq_done = sci->sc_seq_request;'
should be put in nilfs_segctor_new(). Even in nilfs_segctor_new(),
this initialization is needless because sci is kzalloc-ed. So
nilfs_segctor_init() is only a wrap call to
nilfs_segctor_start_thread().
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds a field to record the latest checkpoint number in the
nilfs_segment_summary structure. This will help to recover the latest
checkpoint number from logs on disk. This field is intended for
crucial cases in which super blocks have lost pointer to the latest
log.
Even though this will change the disk format, both backward and
forward compatibility is preserved by a size field prepared in the
segment summary header.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Printing a message after loading a file system is a practice. Add this to
provide a better user-friendly experience.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This cleanup patch gives several improvements:
- Moving all kmem_cache_{create_destroy} calls into one place, which removes
some small function calls, cleans up error check code and clarify the logic.
- Mark all initial code in __init section.
- Remove some very obvious comments.
- Adjust some declarations.
- Fix some space-tab issues.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This moves a pointer to buffer storing super root block to each log
buffer from nilfs_sc_info struct for simplicity.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Like ext3, nilfs has 'errors' mount option to allow specifying desired
behavior on severe errors.
Currently, the default action is 'errors=continue' and has potential
to advance filesystem corruption for severe errors.
This will change the action to 'errors=remount-ro' to avoid the issue.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_btree_release_path() and nilfs_btree_free_path() are bound into each other
tightly. Make them into one procedure to clearify the logic and avoid some
misusages.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_btree_alloc_path() and nilfs_btree_init_path() are bound into each other
tightly. Make them into one procedure to clearify the logic and avoid some
misusages.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
It's legal to send a DESTROY_SESSION outside any session (as the only
operation in a compound), in which case cstate->session will be NULL;
check for that case.
While we're at it, move these checks into a separate helper function.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix RCU issues in the NFSv4 delegation code
NFSv4: Fix the locking in nfs_inode_reclaim_delegation()
The write buffer may not have been written and may no longer be written
due to an interrupted write in the affected page.
Signed-off-by: Joern Engel <joern@logfs.org>
The get_mtd_device() function returns error pointers on failure and if we
don't handle it, it leads to a crash.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Joern Engel <joern@logfs.org>
The following patch adds a message to indicate when barriers have been
disabled due to a block device which doesn't support them. You could
already tell this via the mount options in /proc/mounts, but all the
other filesystems also log a message at the same time.
Also, the same mechanisms are used to indicate when the lock
demote interface has been used (only ever used for debugging)
which is a request from our support team.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The ->writepages writeback_control is not still valid in the writepages
completion. We were touching it solely to adjust pages_skipped when there
was a writeback error (EIO, ENOSPC, EPERM due to bad osd credentials),
causing an oops in the writeback code shortly thereafter. Updating
pages_skipped on error isn't correct anyway, so let's just rip out this
(clearly broken) code to pass the wbc to the completion.
Signed-off-by: Sage Weil <sage@newdream.net>
Lockres hash size of 16KB is far too small for large filesystems (where we
have hundreds of thousands of lock resources stored in the table).
This patch increases it to 128KB.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2, we use ocfs2_extend_trans() to extend a journal handle's
blocks. But if jbd2_journal_extend() fails, it will only restart
with the the new number of blocks. This tends to be awkward since
in most cases we want additional reserved blocks. It makes our code
harder to mantain since the caller can't be sure all the original
blocks will not be accessed and dirtied again. There are 15 callers
of ocfs2_extend_trans() in fs/ocfs2, and 12 of them have to add
h_buffer_credits before they call ocfs2_extend_trans(). This makes
ocfs2_extend_trans() really extend atop the original block count.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Two tiny cleanup for allocation reservation.
1. Remove some extra codes in ocfs2_local_alloc_find_clear_bits.
2. Remove an unuseful variables in ocfs2_find_resv_lhs.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When we allocate some bits from the reservation, we always
allocate from the r_start(see ocfs2_resmap_resv_bits).
So there should be no reason to check between r_start
and start. And I don't think we will change this behaviour
later by allocating from some bits after r_start. Why not make
ocfs2_adjust_resv_from_alloc simple for now?
The only chance we have to adjust the reservation is when we haven't
reached the end. With this patch, the function is more readable.
Note:
btw, this patch also fixes an original bug in the function
which I haven't found before.
if (end < ocfs2_resv_end(resv))
rhs = end - ocfs2_resv_end(resv);
This code is of course buggy. ;)
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
OCFS2 has never really supported intr. This patch acknowledges this reality
and makes nointr the default mount option. In a later patch, we intend to
support intr.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
o2dlm join and leave messages are more than informational as they are
required for debugging locking issues. This patch changes them from
KERN_INFO to KERN_NOTICE.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch logs socket state changes that lead to socket shutdown.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Print the node number of a peer node if sending it a message failed.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The default behavior for directory reservations stays the same, but we add a
mount option so people can tweak the size of directory reservations
according to their workloads.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The default reservation size of 4 (32-bit windows) is a bit too ambitious.
Scale it back to 16 bits (resv_level=2). I have been testing various sizes
on a 4-node cluster which runs a mixed workload that is heavily threaded.
With a 256MB local alloc, I get *roughly* the following levels of average file
fragmentation:
resv_level=0 70%
resv_level=1 21%
resv_level=2 23%
resv_level=3 24%
resv_level=4 60%
resv_level=5 did not test
resv_level=6 60%
resv_level=2 seemed like a good compromise between not letting windows be
too small, but not so big that heavier workloads will immediately suffer
without tuning.
This patch also change the behavior of directory reservations - they now
track file reservations. The previous compromise of giving directory
windows only 8 bits wound up fragmenting more at some window sizes because
file allocations had smaller unused windows to poach from.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
I have observed that the current size of 8M gives us pretty poor
fragmentation on multi-threaded workloads which do lots of writes.
Generally, I can increase the size of local alloc windows and observe a
marked decrease in fragmentation, even up and beyond window sizes of 512
megabytes. This makes sense for a couple reasons - larger local alloc means
more room for reservation windows. On multi-node workloads the larger local
alloc helps as well because we don't have to do window slides as often.
Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
longer used and the comment above it was out of date.
To test fragmentation, I used a workload which launched 4 threads that did
4k writes into a series of about 140 alternating files.
With resv_level=2, and a 4k/4k file system I observed the following average
fragmentation for various localalloc= parameters:
localalloc= avg. fragmentation
8 48
32 16
64 10
120 7
On larger cluster sizes, the difference is more dramatic.
The new default size top out at 256M, which we'll only get for cluster
sizes of 32K and above.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch pulls the local alloc sizing code into localalloc.c and provides
a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
except that I correctly calculate the maximum local alloc size. The old code
in ocfs2_parse_options() calculated the max size as:
ocfs2_local_alloc_size(sb) * 8
which is correct, in bits. Unfortunately though the option passed in is in
megabytes. Ultimately, this bug made no real difference - the shrink code
would catch a too-large size and bring it down to something reasonable.
Still, it's less than efficient as-is.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Inodes are always allocated from the global bitmap now so we don't need this
any more. Also, the existing implementation bounces reservations around
needlessly.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Otherwise, the need for a very large contiguous allocation tends to
wreak havoc on many inode allocation reservations on the local alloc, thus
ruining any chances for contiguousness.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>