Commit graph

17944 commits

Author SHA1 Message Date
Ryusuke Konishi
cdce214e39 nilfs2: use huge_encode_dev/huge_decode_dev
This replaces uses of new_encode_dev/new_decode_dev with their 64-bit
counterparts, huge_encode_dev/huge_decode_dev respectively.

This is just for clarification and has no impact on the disk format.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:34 +09:00
Ryusuke Konishi
b87ca91948 nilfs2: update comment on deactivate_super at nilfs_get_sb
deactivate_super was replaced with deactivate_locked_super, but the
comment of nilfs_get_sb remain unchanged.  This renews the comment.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:34 +09:00
Ryusuke Konishi
e2d1591a13 nilfs2: replace MS_VERBOSE with MS_SILENT
MS_VERBOSE is deprecated.  This replaces it with MS_SILENT in
reference to get_sb_bdev function.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:34 +09:00
Ryusuke Konishi
4571b82cdc nilfs2: add missing initialization of s_mode
An fmode_t argument is passed to kill_block_super() through s_mode
member of the super_block structure.  This is used to release the
block device with the same mode, however, nilfs does not set s_mode
anywhere.

This modifies nilfs_get_sb function to properly initialize the s_mode
member.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:33 +09:00
Ryusuke Konishi
13e905592b nilfs2: fix misuse of open_bdev_exclusive/close_bdev_exclusive
The second argument of open_bdev_exclusive/close_bdev_exclusive takes
fmode_t flags instead of mount flags.  This fixes the misuse.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:33 +09:00
Ryusuke Konishi
25294d8c37 nilfs2: use checkpoint number instead of timestamp to select super block
Nilfs maintains two super blocks, and selects the new one on mount
time if they both have valid checksums and their timestamps differ.

However, this has potential for mis-selection since the system clock
may be rewinded and the resolution of the timestamps is not high.

Usually this doesn't become an issue because both super blocks are
updated at the same time when the file system is unmounted.  Even if
the file system wasn't unmounted cleanly, the roll-forward recovery
will find the proper log which stores the latest super root.  Thus,
the issue can appear only if update of one super block fails and the
clock happens to be rewinded.

This fixes the issue by using checkpoint numbers instead of timestamps
to pick the super block storing the location of the latest log.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:32 +09:00
Ryusuke Konishi
34cb9b5c97 nilfs2: add missing endian conversion on super block magic number
This adds missing endian conversions in comparision of the magic
number of super blocks.  It was coincidence that prior versions didn't
incur problems; the upper byte of the magic number happened to be
equal to the lower byte.  But, semantically it's wrong to depend on
this.

This won't change anything else nor suffer any compatibility issues.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:32 +09:00
Ryusuke Konishi
4e819509cb nilfs2: make nilfs_sc_*_ops static
This kills the following sparse warnings:

fs/nilfs2/segment.c:567:28: warning: symbol 'nilfs_sc_file_ops' was not declared. Should it be static?
fs/nilfs2/segment.c:617:28: warning: symbol 'nilfs_sc_dat_ops' was not declared. Should it be static?
fs/nilfs2/segment.c:625:28: warning: symbol 'nilfs_sc_dsync_ops' was not declared. Should it be static?

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:31 +09:00
Ryusuke Konishi
db55d92252 nilfs2: add kernel doc comments to persistent object allocator functions
The implementation of persistent object allocator (alloc.c) is poorly
documented.  This adds kernel doc style comments on that functions.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:31 +09:00
Li Hong
fdce895ea5 nilfs2: change sc_timer from a pointer to an embedded one in struct nilfs_sc_info
In nilfs_segctor_thread(), timer is a local variable allocated on stack. Its
address can't be set to sci->sc_timer and passed in several procedures.

It works now by chance, just because other procedures are called by
nilfs_segctor_thread() directly or indirectly and the stack hasn't been
deallocated yet.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:31 +09:00
Li Hong
154ac5a830 nilfs2: remove nilfs_segctor_init() in segment.c
There are only two lines of code in nilfs_segctor_init(). From a logic
design view, the first line 'sci->sc_seq_done = sci->sc_seq_request;'
should be put in nilfs_segctor_new(). Even in nilfs_segctor_new(),
this initialization is needless because sci is kzalloc-ed. So
nilfs_segctor_init() is only a wrap call to
nilfs_segctor_start_thread().

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:31 +09:00
Ryusuke Konishi
50614bcf29 nilfs2: insert checkpoint number in segment summary header
This adds a field to record the latest checkpoint number in the
nilfs_segment_summary structure.  This will help to recover the latest
checkpoint number from logs on disk.  This field is intended for
crucial cases in which super blocks have lost pointer to the latest
log.

Even though this will change the disk format, both backward and
forward compatibility is preserved by a size field prepared in the
segment summary header.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:31 +09:00
Li Hong
9f130263f3 nilfs2: add a print message after loading nilfs2
Printing a message after loading a file system is a practice. Add this to
provide a better user-friendly experience.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:30 +09:00
Li Hong
41c88bd74d nilfs2: cleanup multi kmem_cache_{create,destroy} code
This cleanup patch gives several improvements:

 - Moving all kmem_cache_{create_destroy} calls into one place, which removes
 some small function calls, cleans up error check code and clarify the logic.

 - Mark all initial code in __init section.

 - Remove some very obvious comments.

 - Adjust some declarations.

 - Fix some space-tab issues.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:30 +09:00
Ryusuke Konishi
aaed1d5bfa nilfs2: move out checksum routines to segment buffer code
This moves out checksum routines in log writer to segbuf.c for
cleanup.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:30 +09:00
Ryusuke Konishi
1e2b68bf28 nilfs2: move pointer to super root block into logs
This moves a pointer to buffer storing super root block to each log
buffer from nilfs_sc_info struct for simplicity.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:30 +09:00
Ryusuke Konishi
277a6a3417 nilfs2: change default of 'errors' mount option to 'remount-ro' mode
Like ext3, nilfs has 'errors' mount option to allow specifying desired
behavior on severe errors.

Currently, the default action is 'errors=continue' and has potential
to advance filesystem corruption for severe errors.

This will change the action to 'errors=remount-ro' to avoid the issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:30 +09:00
Li Hong
73bb48869b nilfs2: Combine nilfs_btree_release_path() and nilfs_btree_free_path()
nilfs_btree_release_path() and nilfs_btree_free_path() are bound into each other
tightly. Make them into one procedure to clearify the logic and avoid some
misusages.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:29 +09:00
Li Hong
f905440f5e nilfs2: Combine nilfs_btree_alloc_path() and nilfs_btree_init_path()
nilfs_btree_alloc_path() and nilfs_btree_init_path() are bound into each other
tightly. Make them into one procedure to clearify the logic and avoid some
misusages.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-05-10 11:32:29 +09:00
J. Bruce Fields
5d4cec2f2f nfsd4: fix bare destroy_session null dereference
It's legal to send a DESTROY_SESSION outside any session (as the only
operation in a compound), in which case cstate->session will be NULL;
check for that case.

While we're at it, move these checks into a separate helper function.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-05-07 19:08:47 -04:00
Linus Torvalds
9167746716 Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Fix RCU issues in the NFSv4 delegation code
  NFSv4: Fix the locking in nfs_inode_reclaim_delegation()
2010-05-07 13:59:48 -07:00
Joern Engel
6f485b4187 logfs: handle powerfail on NAND flash
The write buffer may not have been written and may no longer be written
due to an interrupted write in the affected page.

Signed-off-by: Joern Engel <joern@logfs.org>
2010-05-07 19:38:40 +02:00
Dan Carpenter
ccf31c10f1 logfs: handle errors from get_mtd_device()
The get_mtd_device() function returns error pointers on failure and if we
don't handle it, it leads to a crash.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Joern Engel <joern@logfs.org>
2010-05-07 14:16:09 +02:00
Joern Engel
58e323cf5e logfs: remove unused variable
Signed-off-by: Joern Engel <joern@logfs.org>
2010-05-07 14:15:04 +02:00
Steven Whitehouse
913a71d250 GFS2: Add some useful messages
The following patch adds a message to indicate when barriers have been
disabled due to a block device which doesn't support them. You could
already tell this via the mount options in /proc/mounts, but all the
other filesystems also log a message at the same time.

Also, the same mechanisms are used to indicate when the lock
demote interface has been used (only ever used for debugging)
which is a request from our support team.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-05-06 11:03:29 +01:00
Sage Weil
54ad023ba8 ceph: don't use writeback_control in writepages completion
The ->writepages writeback_control is not still valid in the writepages
completion.  We were touching it solely to adjust pages_skipped when there
was a writeback error (EIO, ENOSPC, EPERM due to bad osd credentials),
causing an oops in the writeback code shortly thereafter.  Updating
pages_skipped on error isn't correct anyway, so let's just rip out this
(clearly broken) code to pass the wbc to the completion.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-05 21:31:40 -07:00
Sunil Mushran
0467ae954d ocfs2/dlm: Increase o2dlm lockres hash size
Lockres hash size of 16KB is far too small for large filesystems (where we
have hundreds of thousands of lock resources stored in the table).
This patch increases it to 128KB.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:20:01 -07:00
Tao Ma
c901fb0073 ocfs2: Make ocfs2_extend_trans() really extend.
In ocfs2, we use ocfs2_extend_trans() to extend a journal handle's
blocks. But if jbd2_journal_extend() fails, it will only restart
with the the new number of blocks.  This tends to be awkward since
in most cases we want additional reserved blocks. It makes our code
harder to mantain since the caller can't be sure all the original
blocks will not be accessed and dirtied again.  There are 15 callers
of ocfs2_extend_trans() in fs/ocfs2, and 12 of them have to add
h_buffer_credits before they call ocfs2_extend_trans().  This makes
ocfs2_extend_trans() really extend atop the original block count.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:18:09 -07:00
Tao Ma
3e4218df31 ocfs2/trivial: Code cleanup for allocation reservation.
Two tiny cleanup for allocation reservation.
1. Remove some extra codes in ocfs2_local_alloc_find_clear_bits.
2. Remove an unuseful variables in ocfs2_find_resv_lhs.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:18:09 -07:00
Tao Ma
b065556a7d ocfs2: make ocfs2_adjust_resv_from_alloc simple.
When we allocate some bits from the reservation, we always
allocate from the r_start(see ocfs2_resmap_resv_bits).
So there should be no reason to check between r_start
and start. And I don't think we will change this behaviour
later by allocating from some bits after r_start.  Why not make
ocfs2_adjust_resv_from_alloc simple for now?

The only chance we have to adjust the reservation is when we haven't
reached the end. With this patch, the function is more readable.

Note:
btw, this patch also fixes an original bug in the function
which I haven't found before.
	if (end < ocfs2_resv_end(resv))
		rhs = end - ocfs2_resv_end(resv);
This code is of course buggy. ;)

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:18:09 -07:00
Sunil Mushran
4b37fcb7d4 ocfs2: Make nointr a default mount option
OCFS2 has never really supported intr. This patch acknowledges this reality
and makes nointr the default mount option. In a later patch, we intend to
support intr.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:18:08 -07:00
Sunil Mushran
5c80d4c9e5 ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE
o2dlm join and leave messages are more than informational as they are
required for debugging locking issues. This patch changes them from
KERN_INFO to KERN_NOTICE.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:18:08 -07:00
Srinivas Eeda
23fd9abdc8 o2net: log socket state changes
This patch logs socket state changes that lead to socket shutdown.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:18:08 -07:00
Wengang Wang
a5196ec5ef ocfs2: print node # when tcp fails
Print the node number of a peer node if sending it a message failed.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:18:08 -07:00
Mark Fasheh
83f92318fa ocfs2: Add dir_resv_level mount option
The default behavior for directory reservations stays the same, but we add a
mount option so people can tweak the size of directory reservations
according to their workloads.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:18:07 -07:00
Mark Fasheh
b07f8f24df ocfs2: change default reservation window sizes
The default reservation size of 4 (32-bit windows) is a bit too ambitious.
Scale it back to 16 bits (resv_level=2). I have been testing various sizes
on a 4-node cluster which runs a mixed workload that is heavily threaded.
With a 256MB local alloc, I get *roughly* the following levels of average file
fragmentation:

resv_level=0	70%
resv_level=1	21%
resv_level=2	23%
resv_level=3	24%
resv_level=4	60%
resv_level=5	did not test
resv_level=6	60%

resv_level=2 seemed like a good compromise between not letting windows be
too small, but not so big that heavier workloads will immediately suffer
without tuning.

This patch also change the behavior of directory reservations - they now
track file reservations.  The previous compromise of giving directory
windows only 8 bits wound up fragmenting more at some window sizes because
file allocations had smaller unused windows to poach from.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:18:07 -07:00
Mark Fasheh
6b82021b9e ocfs2: increase the default size of local alloc windows
I have observed that the current size of 8M gives us pretty poor
fragmentation on multi-threaded workloads which do lots of writes.

Generally, I can increase the size of local alloc windows and observe a
marked decrease in fragmentation, even up and beyond window sizes of 512
megabytes. This makes sense for a couple reasons - larger local alloc means
more room for reservation windows. On multi-node workloads the larger local
alloc helps as well because we don't have to do window slides as often.

Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
longer used and the comment above it was out of date.

To test fragmentation, I used a workload which launched 4 threads that did
4k writes into a series of about 140 alternating files.

With resv_level=2, and a 4k/4k file system I observed the following average
fragmentation for various localalloc= parameters:

localalloc=	avg. fragmentation
	8		48
	32		16
	64		10
	120		7

On larger cluster sizes, the difference is more dramatic.

The new default size top out at 256M, which we'll only get for cluster
sizes of 32K and above.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:18:07 -07:00
Mark Fasheh
73c8a80003 ocfs2: clean up localalloc mount option size parsing
This patch pulls the local alloc sizing code into localalloc.c and provides
a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
except that I correctly calculate the maximum local alloc size. The old code
in ocfs2_parse_options() calculated the max size as:

ocfs2_local_alloc_size(sb) * 8

which is correct, in bits. Unfortunately though the option passed in is in
megabytes. Ultimately, this bug made no real difference - the shrink code
would catch a too-large size and bring it down to something reasonable.
Still, it's less than efficient as-is.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:18:06 -07:00
Mark Fasheh
a57c8fd2ad ocfs2: remove ocfs2_local_alloc_in_range()
Inodes are always allocated from the global bitmap now so we don't need this
any more. Also, the existing implementation bounces reservations around
needlessly.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2010-05-05 18:17:31 -07:00
Mark Fasheh
33d5d380d6 ocfs2: allocate btree internal block groups from the global bitmap
Otherwise, the need for a very large contiguous allocation tends to
wreak havoc on many inode allocation reservations on the local alloc, thus
ruining any chances for contiguousness.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2010-05-05 18:17:31 -07:00
Mark Fasheh
e3b4a97dbe ocfs2: use allocation reservations for directory data
Use the reservations system for unindexed dir tree allocations. We don't
bother with the indexed tree as reads from it are mostly random anyway.
Directory reservations are marked seperately, to allow the reservations code
a chance to optimize their window sizes. This patch allocates only 8 bits
for directory windows as they generally are not expected to grow as quickly
as file data. Future improvements to dir window sizing can trivially be
made.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2010-05-05 18:17:30 -07:00
Mark Fasheh
4fe370afaa ocfs2: use allocation reservations during file write
Add a per-inode reservations structure and pass it through to the
reservations code.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2010-05-05 18:17:30 -07:00
Mark Fasheh
d02f00cc05 ocfs2: allocation reservations
This patch improves Ocfs2 allocation policy by allowing an inode to
reserve a portion of the local alloc bitmap for itself. The reserved
portion (allocation window) is advisory in that other allocation
windows might steal it if the local alloc bitmap becomes
full. Otherwise, the reservations are honored and guaranteed to be
free. When the local alloc window is moved to a different portion of
the bitmap, existing reservations are discarded.

Reservation windows are represented internally by a red-black
tree. Within that tree, each node represents the reservation window of
one inode. An LRU of active reservations is also maintained. When new
data is written, we allocate it from the inodes window. When all bits
in a window are exhausted, we allocate a new one as close to the
previous one as possible. Should we not find free space, an existing
reservation is pulled off the LRU and cannibalized.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2010-05-05 18:17:30 -07:00
Joel Becker
ec20cec7a3 ocfs2: Make ocfs2_journal_dirty() void.
jbd[2]_journal_dirty_metadata() only returns 0.  It's been returning 0
since before the kernel moved to git.  There is no point in checking
this error.

ocfs2_journal_dirty() has been faithfully returning the status since the
beginning.  All over ocfs2, we have blocks of code checking this can't
fail status.  In the past few years, we've tried to avoid adding these
checks, because they are pointless.  But anyone who looks at our code
assumes they are needed.

Finally, ocfs2_journal_dirty() is made a void function.  All error
checking is removed from other files.  We'll BUG_ON() the status of
jbd2_journal_dirty_metadata() just in case they change it someday.  They
won't.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-05 18:17:29 -07:00
James Morris
0ffbe2699c Merge branch 'master' into next 2010-05-06 10:56:07 +10:00
Steve French
bdfae149c5 [CIFS] Remove unused cifs_oplock_cachep
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-05-06 00:38:16 +00:00
Jeff Layton
26efa0bac9 cifs: have decode_negTokenInit set flags in server struct
...rather than the secType. This allows us to get rid of the MSKerberos
securityEnum. The client just makes a decision at upcall time.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-05-05 23:24:11 +00:00
Jeff Layton
198b568278 cifs: break negotiate protocol calls out of cifs_setup_session
So that we can reasonably set up the secType based on both the
NegotiateProtocol response and the parsed mount options.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-05-05 23:18:27 +00:00
Joern Engel
c0c79c31c9 logfs: fix sync
Rather self-explanatory.

Signed-off-by: Joern Engel <joern@logfs.org>
2010-05-05 22:33:36 +02:00
Joern Engel
bba0b5c2c2 logfs: fix compile failure
When CONFIG_BLOCK is not enabled:

fs/logfs/super.c:142: error: implicit declaration of function 'bdev_get_queue'
fs/logfs/super.c:142: error: invalid type argument of '->' (have 'int')

Found by Randy Dunlap <randy.dunlap@oracle.com>

Signed-off-by: Joern Engel <joern@logfs.org>
2010-05-05 22:32:52 +02:00