Commit graph

15797 commits

Author SHA1 Message Date
Christoph Hellwig
f0fad8a530 writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE
Since it's an opportunistic writeback and not a data integrity action,
don't punt to blocking writeback. Just wakeup the thread and it will
flush old data.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-16 15:16:18 +02:00
Jens Axboe
1fe06ad892 writeback: get rid of wbc->for_writepages
It's only set, it's never checked. Kill it.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-16 15:16:18 +02:00
Jens Axboe
2c96ce9f20 fs: remove bdev->bd_inode_backing_dev_info
It has been unused since it was introduced in:

commit 520808bf20e90fdbdb320264ba7dd5cf9d47dcac
Author: Andrew Morton <akpm@osdl.org>
Date:   Fri May 21 00:46:17 2004 -0700

    [PATCH] block device layer: separate backing_dev_info infrastructure

So lets just kill it.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-16 15:16:18 +02:00
Csaba Henk
79a9d99434 fuse: add fusectl interface to max_background
Make the max_background and congestion_threshold parameters of a FUSE
mount tunable at runtime by adding the respective knobs to its directory
within the fusectl filesystem.

Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-09-16 14:15:29 +02:00
Csaba Henk
487ea5af63 fuse: limit user-specified values of max background requests
An untrusted user could DoS the system if s/he were allowed to accumulate an
arbitrary number of pending background requests by setting the above limits
to extremely high values in INIT. This patch excludes this possibility by
imposing global upper limits on the possible values of per-mount "max
background requests" and "congestion threshold" parameters for unprivileged
FUSE filesystems.

These global limits are implemented as module parameters.

Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-09-16 14:15:29 +02:00
Csaba Henk
d6db07ded5 fuse: use drop_nlink() instead of direct nlink manipulation
drop_nlink() is the API function to decrease the link count of an inode.
However, at a place the control filesystem used the decrement operator
on i_nlink directly. Fix this.

Cc: Anand Avati <avati@gluster.com>
Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-09-16 14:15:28 +02:00
Andi Kleen
465fdd97cb HWPOISON: Enable error_remove_page on btrfs
Cc: chris.mason@oracle.com

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:18 +02:00
Andi Kleen
f590f333fb HWPOISON: Enable error_remove_page for NFS
Enable hardware memory error handling for NFS

Truncation of data pages at runtime should be safe in NFS,
even when it doesn't support migration so far.

Trond tells me migration is also queued up for 2.6.32.

Acked-by: Trond.Myklebust@netapp.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:17 +02:00
Andi Kleen
aa261f549d HWPOISON: Enable .remove_error_page for migration aware file systems
Enable removing of corrupted pages through truncation
for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
These should cover most server needs.

I chose the set of migration aware file systems for this
for now, assuming they have been especially audited.
But in general it should be safe for all file systems
on the data area that support read/write and truncate.

Caveat: the hardware error handler does not take i_mutex
for now before calling the truncate function. Is that ok?

Cc: tytso@mit.edu
Cc: hch@infradead.org
Cc: mfasheh@suse.com
Cc: aia21@cantab.net
Cc: hugh.dickins@tiscali.co.uk
Cc: swhiteho@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:16 +02:00
Andi Kleen
6a46079cf5 HWPOISON: The high level memory error handler in the VM v7
Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.

This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.

The code that does this is portable and lives in mm/memory-failure.c

To quote the overview comment:

High level machine check handler. Handles pages reported by the
hardware as being corrupted usually due to a 2bit ECC memory or cache
failure.

This focuses on pages detected as corrupted in the background.
When the current CPU tries to consume corruption the currently
running process can just be killed directly instead. This implies
that if the error cannot be handled for some reason it's safe to
just ignore it because no corruption has been consumed yet. Instead
when that happens another machine check will happen.

Handles page cache pages in various states. The tricky part
here is that we can access any page asynchronous to other VM
users, because memory failures could happen anytime and anywhere,
possibly violating some of their assumptions. This is why this code
has to be extremely careful. Generally it tries to use normal locking
rules, as in get the standard locks, even if that means the
error handling takes potentially a long time.

Some of the operations here are somewhat inefficient and have non
linear algorithmic complexity, because the data structures have not
been optimized for this case. This is in particular the case
for the mapping from a vma to a process. Since this case is expected
to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways

Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
Nick Piggin (who did a lot of great work) and others.

Cc: npiggin@suse.de
Cc: riel@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
2009-09-16 11:50:15 +02:00
Alex Elder
fdec29c5fc Merge branch 'master' of git://oss.sgi.com/xfs/xfs into for-linus
Conflicts:
	fs/xfs/linux-2.6/xfs_lrw.c
2009-09-15 21:37:47 -05:00
Ricardo Labiaga
b09333c464 nfsd41: Refactor create_client()
Move common initialization of 'struct nfs4_client' inside create_client().

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>

[nfsd41: Remember the auth flavor to use for callbacks]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-15 20:52:13 -04:00
Alexandros Batsakis
3ddc8bf5f3 nfsd41: modify nfsd4.1 backchannel to use new xprt class
This patch enables the use of the nfsv4.1 backchannel.

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
[initialize rpc_create_args.bc_xprt too]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-15 20:52:13 -04:00
Ricardo Labiaga
0421b5c55a nfsd41: Backchannel: Implement cb_recall over NFSv4.1
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[nfsd41: cb_recall callback]
[Share v4.0 and v4.1 back channel xdr]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Share v4.0 and v4.1 back channel xdr]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: use nfsd4_cb_sequence for callback minorversion]
[nfsd41: conditionally decode_sequence in nfs4_xdr_dec_cb_recall]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: Backchannel: Add sequence arguments to callback RPC arguments]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[pulled-in definition of nfsd4_cb_done]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-15 20:52:12 -04:00
Benny Halevy
2af73580b7 nfsd41: Backchannel: cb_sequence callback
Implement the cb_sequence callback conforming to draft-ietf-nfsv4-minorversion1

Note: highest slot id and target highest slot id do not have to be 0
as was previously implemented.  They can be greater than what the
nfs server sent if the client supports a larger slot table on the
backchannel.  At this point we just ignore that.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[Rework the back channel xdr using the shared v4.0 and v4.1 framework.]
Signed-off-by: Andy Adamson <andros@netapp.com>
[fixed indentation]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: use nfsd4_cb_sequence for callback minorversion]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: fix verification of CB_SEQUENCE highest slot id[
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: Backchannel: Remove old backchannel serialization]
[nfsd41: Backchannel: First callback sequence ID should be 1]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: decode_cb_sequence does not need to actually decode ignored fields]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-15 20:49:56 -04:00
Ricardo Labiaga
2a1d1b5938 nfsd41: Backchannel: Setup sequence information
Follows the model used by the NFS client.  Setup the RPC prepare and done
function pointers so that we can populate the sequence information if
minorversion == 1.  rpc_run_task() is then invoked directly just like
existing NFS client operations do.

nfsd4_cb_prepare() determines if the sequence information needs to be setup.
If the slot is in use, it adds itself to the wait queue.

nfsd4_cb_done() wakes anyone sleeping on the callback channel wait queue
after our RPC reply has been received.  It also sets the task message
result pointer to NULL to clearly indicate we're done using it.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[define and initialize cl_cb_seq_nr here]
[pulled out unused defintion of nfsd4_cb_done]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-15 20:49:56 -04:00
Ricardo Labiaga
199ff35e1c nfsd41: Backchannel: Server backchannel RPC wait queue
RPC callback requests will wait on this wait queue if the backchannel
is out of slots.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-15 20:49:55 -04:00
Ricardo Labiaga
132f97715c nfsd41: Backchannel: Add sequence arguments to callback RPC arguments
Follow the model we use in the client. Make the sequence arguments
part of the regular RPC arguments.  None of the callbacks that are
soon to be implemented expect results that need to be passed back
to the caller, so we don't define a separate RPC results structure.
For session validation, the cb_sequence decoding will use a pointer
to the sequence arguments that are part of the RPC argument.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[define struct nfsd4_cb_sequence here]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-15 20:49:55 -04:00
Andy Adamson
38524ab38f nfsd41: Backchannel: callback infrastructure
Keep the xprt used for create_session in cl_cb_xprt.
Mark cl_callback.cb_minorversion = 1 and remember
the client provided cl_callback.cb_prog rpc program number.
Use it to probe the callback path.

Use the client's network address to initialize as the
callback's address as expected by the xprt creation
routines.

Define xdr sizes and code nfs4_cb_compound header to be able
to send a null callback rpc.

Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[get callback minorversion from fore channel's]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: change bc_sock to bc_xprt]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pulled definition for cl_cb_xprt]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: set up backchannel's cb_addr]
[moved rpc_create_args init to "nfsd: modify nfsd4.1 backchannel to use new xprt class"]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-15 20:49:55 -04:00
J. Bruce Fields
80fc015bdf nfsd4: use common rpc_cred for all callbacks
Callbacks are always made using the machine's identity, so we can use a
single auth_generic credential shared among callbacks to all clients and
let the rpc code take care of the rest.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-15 20:49:34 -04:00
J. Bruce Fields
29ab23cc5d nfsd4: allow nfs4 state startup to fail
The failure here is pretty unlikely, but we should handle it anyway.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-15 20:49:33 -04:00
J. Bruce Fields
886e3b7fe6 nfsd4: fix null dereference creating nfsv4 callback client
On setting up the callback to the client, we attempt to use the same
authentication flavor the client did.  We find an rpc cred to use by
calling rpcauth_lookup_credcache(), which assumes that the given
authentication flavor has a credentials cache.  However, this is not
required to be true--in particular, auth_null does not use one.
Instead, we should call the auth's lookup_cred() method.

Without this, a client attempting to mount using nfsv4 and auth_null
triggers a null dereference.

Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-15 20:49:33 -04:00
Chris Mason
6e74057c46 Btrfs: Fix async thread shutdown race
It was possible for an async worker thread to be selected to
receive a new work item, but exit before the work item was
actually placed into that thread's work list.

This commit fixes the race by incrementing the num_pending
counter earlier, and making sure to check the number of pending
work items before a thread exits.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-15 20:20:17 -04:00
Chris Mason
627e421a3f Btrfs: fix worker thread double spin_lock_irq
The exit-on-idle code for async worker threads was incorrectly
calling spin_lock_irq with interrupts already off.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-15 20:20:17 -04:00
Chris Mason
3e99d8eb34 Btrfs: fix async worker startup race
After a new worker thread starts, it is placed into the
list of idle threads.  But, this may race with a
check for idle done by the worker thread itself, resulting
in a double list_add operation.

This fix adds a check to make sure the idle thread addition
is done properly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-15 20:20:16 -04:00
Jeff Layton
48541bd3dd cifs: have cifsFileInfo hold an extra inode reference
It's possible that this struct will outlive the filp to which it is
attached. If it does and it needs to do some work on the inode, then
it'll need a reference.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-09-15 19:45:25 +00:00
Jeff Layton
058daf4f67 cifs: take read lock on GlobalSMBSes_lock in is_valid_oplock_break
...rather than a write lock. It doesn't change the list so a read lock
should be sufficient.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-09-15 19:45:10 +00:00
Jeff Layton
495e993745 cifs: remove cifsInodeInfo.oplockPending flag
It's set on oplock break but nothing ever looks at it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-09-15 19:45:06 +00:00
Jeff Layton
590a3fe0e1 cifs: fix oplock request handling in posix codepath
cifs_posix_open takes a "poplock" argument that's intended to be used in
the actual posix open call to set the "Flags" field. It ignores this
value however and declares an "oplock" parameter on the stack that it
passes uninitialized to the CIFSPOSIXOpen function. Not only does this
mean that the oplock request flags are bogus, but the result that's
expected to be in that variable is unchanged.

Fix this, and also clean up the type of the oplock parameter used. Since
it's expected to be __u32, we should use that everywhere and not
implicitly cast it from a signed type.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-09-15 19:45:03 +00:00
Chuck Ebbert
20d1752f3d [CIFS] Re-enable Lanman security
commit ac68392460 ("[CIFS] Allow raw
ntlmssp code to be enabled with sec=ntlmssp") added a new bit to the
allowed security flags mask but seems to have inadvertently removed
Lanman security from the allowed flags. Add it back.

CC: Stable <stable@kernel.org>
Signed-off-by: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-09-15 19:38:05 +00:00
Jaswinder Singh Rajput
9ef96da6ec xfs: includecheck fix for fs/xfs/xfs_iops.c
fix the following 'make includecheck' warning:

  fs/xfs/linux-2.6/xfs_iops.c: xfs_acl.h is included more than once.

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-09-15 12:30:30 -05:00
Alexey Dobriyan
361735fd8f xfs: switch to seq_file
create_proc_read_entry() is getting deprecated.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-09-15 12:29:24 -05:00
David Brownell
a4dbd6740d driver model: constify attribute groups
Let attribute group vectors be declared "const".  We'd
like to let most attribute metadata live in read-only
sections... this is a start.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-09-15 09:50:47 -07:00
Artem Bityutskiy
be9e62a730 UBIFS: improve lprops dump
Improve 'dbg_dump_lprop()' and print dark and dead space there,
decode flags, and journal heads.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-09-15 17:09:48 +03:00
Artem Bityutskiy
055da1b704 UBIFS: various minor commentary fixes
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-09-15 17:09:24 +03:00
Artem Bityutskiy
77a7ae580c UBIFS: improve journal head debugging prints
Convert the journal head integer into the head name when printing
debugging information.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-09-15 17:05:06 +03:00
Artem Bityutskiy
d6d140097b UBIFS: define journal head numbers in ubifs-media.h
The journal head names and numbers are part of the UBIFS format, so
they should be in the ubifs-media.h.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-09-15 14:45:35 +03:00
Theodore Ts'o
3661d28615 ext4: Fix include/trace/events/ext4.h to work with Systemtap
Using relative pathnames in #include statements interacts badly with
SystemTap, since the fs/ext4/*.h header files are not packaged up as
part of a distribution kernel's header files.  Since systemtap doesn't
use TP_fast_assign(), we can use a blind structure definition and then
make sure the needed header files are defined before the ext4 source
files #include the trace/events/ext4.h header file.

https://bugzilla.redhat.com/show_bug.cgi?id=512478

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-14 22:59:50 -04:00
Linus Torvalds
355bbd8cb8 Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits)
  block: use blkdev_issue_discard in blk_ioctl_discard
  Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
  block: don't assume device has a request list backing in nr_requests store
  block: Optimal I/O limit wrapper
  cfq: choose a new next_req when a request is dispatched
  Seperate read and write statistics of in_flight requests
  aoe: end barrier bios with EOPNOTSUPP
  block: trace bio queueing trial only when it occurs
  block: enable rq CPU completion affinity by default
  cfq: fix the log message after dispatched a request
  block: use printk_once
  cciss: memory leak in cciss_init_one()
  splice: update mtime and atime on files
  block: make blk_iopoll_prep_sched() follow normal 0/1 return convention
  cfq-iosched: get rid of must_alloc flag
  block: use interrupts disabled version of raise_softirq_irqoff()
  block: fix comment in blk-iopoll.c
  block: adjust default budget for blk-iopoll
  block: fix long lines in block/blk-iopoll.c
  block: add blk-iopoll, a NAPI like approach for block devices
  ...
2009-09-14 17:55:15 -07:00
Linus Torvalds
4142e0d1de Merge branch 'osync_cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'osync_cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  fsync: wait for data writeout completion before calling ->fsync
  vfs: Remove generic_osync_inode() and sync_page_range{_nolock}()
  fat: Opencode sync_page_range_nolock()
  pohmelfs: Use new syncing helper
  xfs: Convert sync_page_range() to simple filemap_write_and_wait_range()
  ocfs2: Update syncing after splicing to match generic version
  ntfs: Use new syncing helpers and update comments
  ext4: Remove syncing logic from ext4_file_write
  ext3: Remove syncing logic from ext3_file_write
  ext2: Update comment about generic_osync_inode
  vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  vfs: Rename generic_file_aio_write_nolock
  ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
  pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
  vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
  vfs: Export __generic_file_aio_write() and add some comments
  vfs: Introduce filemap_fdatawait_range
2009-09-14 14:36:47 -07:00
Linus Torvalds
33f1de6931 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
  GFS2: Whitespace fixes
  GFS2: Remove unused sysfs file
  GFS2: Be extra careful about deallocating inodes
  GFS2: Remove no_formal_ino generating code
  GFS2: Rename eattr.[ch] as xattr.[ch]
  GFS2: Clean up of extended attribute support
  GFS2: Add explanation of extended attr on-disk format
  GFS2: Add "-o errors=panic|withdraw" mount options
  GFS2: jumping to wrong label?
  GFS2: free disk inode which is deleted by remote node -V2
  GFS2: Add a document explaining GFS2's uevents
  GFS2: Add sysfs link to device
  GFS2: Replace assertion with proper error handling
  GFS2: Improve error handling in inode allocation
  GFS2: Add some more info to uevents
  GFS2: Add online uevent to GFS2
2009-09-14 14:35:56 -07:00
Linus Torvalds
041d6d0be8 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
  udf: Fix possible corruption when close races with write
  udf: Perform preallocation only for regular files
  udf: Remove wrong assignment in udf_symlink
  udf: Remove dead code
2009-09-14 14:35:07 -07:00
Linus Torvalds
af8cb8aa38 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (21 commits)
  fs/Kconfig: move nilfs2 outside misc filesystems
  nilfs2: convert nilfs_bmap_lookup to an inline function
  nilfs2: allow btree code to directly call dat operations
  nilfs2: add update functions of virtual block address to dat
  nilfs2: remove individual gfp constants for each metadata file
  nilfs2: stop zero-fill of btree path just before free it
  nilfs2: remove unused btree argument from btree functions
  nilfs2: remove nilfs_dat_abort_start and nilfs_dat_abort_free
  nilfs2: shorten freeze period due to GC in write operation v3
  nilfs2: add more check routines in mount process
  nilfs2: An unassigned variable is assigned to a never used structure member
  nilfs2: use GFP_NOIO for bio_alloc instead of GFP_NOWAIT
  nilfs2: stop using periodic write_super callback
  nilfs2: clean up nilfs_write_super
  nilfs2: fix disorder of nilfs_write_super in nilfs_sync_fs
  nilfs2: remove redundant super block commit
  nilfs2: implement nilfs_show_options to display mount options in /proc/mounts
  nilfs2: always lookup disk block address before reading metadata block
  nilfs2: use semaphore to protect pointer to a writable FS-instance
  nilfs2: fix format string compile warning (ino_t)
  ...
2009-09-14 14:34:33 -07:00
Linus Torvalds
6cdb5930a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: consolidate reconnect logic in smb_init routines
  cifs: Replace wrtPending with a real reference count
  cifs: protect GlobalOplock_Q with its own spinlock
  cifs: use tcon pointer in cifs_show_options
  cifs: send IPv6 addr in upcall with colon delimiters
  [CIFS] Fix checkpatch warnings
  PATCH] cifs: fix broken mounts when a SSH tunnel is used (try #4)
  [CIFS] Memory leak in ntlmv2 hash calculation
  [CIFS] potential NULL dereference in parse_DFS_referrals()
2009-09-14 14:33:13 -07:00
Linus Torvalds
d7e9660ad9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1623 commits)
  netxen: update copyright
  netxen: fix tx timeout recovery
  netxen: fix file firmware leak
  netxen: improve pci memory access
  netxen: change firmware write size
  tg3: Fix return ring size breakage
  netxen: build fix for INET=n
  cdc-phonet: autoconfigure Phonet address
  Phonet: back-end for autoconfigured addresses
  Phonet: fix netlink address dump error handling
  ipv6: Add IFA_F_DADFAILED flag
  net: Add DEVTYPE support for Ethernet based devices
  mv643xx_eth.c: remove unused txq_set_wrr()
  ucc_geth: Fix hangs after switching from full to half duplex
  ucc_geth: Rearrange some code to avoid forward declarations
  phy/marvell: Make non-aneg speed/duplex forcing work for 88E1111 PHYs
  drivers/net/phy: introduce missing kfree
  drivers/net/wan: introduce missing kfree
  net: force bridge module(s) to be GPL
  Subject: [PATCH] appletalk: Fix skb leak when ipddp interface is not loaded
  ...

Fixed up trivial conflicts:

 - arch/x86/include/asm/socket.h

   converted to <asm-generic/socket.h> in the x86 tree.  The generic
   header has the same new #define's, so that works out fine.

 - drivers/net/tun.c

   fix conflict between 89f56d1e9 ("tun: reuse struct sock fields") that
   switched over to using 'tun->socket.sk' instead of the redundantly
   available (and thus removed) 'tun->sk', and 2b980dbd ("lsm: Add hooks
   to the TUN driver") which added a new 'tun->sk' use.

   Noted in 'next' by Stephen Rothwell.
2009-09-14 10:37:28 -07:00
Jan Kara
cbc8cc3352 udf: Fix possible corruption when close races with write
When we close a file, we remove preallocated blocks from it. But this
truncation was not protected by i_mutex and thus it could have raced with a
write through a different fd and cause crashes or even filesystem corruption.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 19:13:01 +02:00
Jan Kara
81056dd044 udf: Perform preallocation only for regular files
So far we preallocated blocks also for directories but that brings a
problem, when to get rid of preallocated blocks we don't need. So far
we removed them in udf_clear_inode() which has a disadvantage that
1) blocks are unavailable long after writing to a directory finished
   and thus one can get out of space unnecessarily early
2) releasing blocks from udf_clear_inode is problematic because VFS
   does not expect us to redirty inode there and it also slows down
   memory reclaim.

So preallocate blocks only for regular files where we can drop preallocation
in udf_release_file.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 19:13:00 +02:00
Jan Kara
7c6e3d1aae udf: Remove wrong assignment in udf_symlink
Recomputation of the pointer was wrong (it should have been just increment).
Luckily, we never use the computed value. Remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 19:13:00 +02:00
Jan Kara
5891d9dd2a udf: Remove dead code
Remove code that gets never used.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 19:13:00 +02:00
Christoph Hellwig
2daea67e96 fsync: wait for data writeout completion before calling ->fsync
Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
the data, the calls into ->fsync to write out the metadata and then finally
calls filemap_fdatawait to wait for the data I/O to complete.  What sounds
like a clever micro-optimization actually is nast trap for many filesystems.

For many modern filesystems i_size or other inode information is only
updated on I/O completion and we need to wait for I/O to finish before
we can write out the metadata.  For old fashionen filesystems that
instanciate blocks during the actual write and also update the metadata
at that point it opens up a large window were we could expose uninitialized
blocks after a crash.  While a few filesystems that need it already wait
for the I/O to finish inside their ->fsync methods it is rather suboptimal
as it is done under the i_mutex and also always for the whole file instead
of just a part as we could do for O_SYNC handling.

Here is a small audit of all fsync instances in the tree:

 - spufs_mfc_fsync:
 - ps3flash_fsync:
 - vol_cdev_fsync:
 - printer_fsync:
 - fb_deferred_io_fsync:
 - bad_file_fsync:
 - simple_sync_file:

	don't care - filesystems/drivers do't use the page cache or are
	purely in-memory.

 - simple_fsync:
 - file_fsync:
 - affs_file_fsync:
 - fat_file_fsync:
 - jfs_fsync:
 - ubifs_fsync:
 - reiserfs_dir_fsync:
 - reiserfs_sync_file:

	never touch pagecache themselves.  We need to wait before if we do
	not want to expose stale data after an allocation.

 - afs_fsync:
 - fuse_fsync_common:

	do the waiting writeback itself in awkward ways, would benefit from
	proper semantics

 - block_fsync:

	Does a filemap_write_and_wait on the block device inode.  Because we
	now have f_mapping that is the same inode we call it on in vfs_fsync.
	So just removing it and letting the VFS do the work in one go would
	be an improvement.

 - btrfs_sync_file:
 - cifs_fsync:
 - xfs_file_fsync:

	need the wait first and currently do it themselves. would benefit from
	doing it outside i_mutex.

 - coda_fsync:
 - ecryptfs_fsync:
 - exofs_file_fsync:
 - shm_fsync:

	only passes the fsync through to the lower layer

 - ext3_sync_file:

	doesn't seem to care, comments are confusing.

 - ext4_sync_file:

	would need the wait to work correctly for delalloc mode with late
	i_size updates.  Otherwise the ext3 comment applies.

	currently implemens it's own writeback and wait in an odd way,
	could benefit from doing it properly.

 - gfs2_fsync:

	not needed for journaled data mode, but probably harmless there.
	Currently writes back data asynchronously itself.  Needs some
	major audit.

 - hostfs_fsync:

	just calls fsync/datasync on the host FD.  Without the wait before
	data might not even be inflight yet if we're unlucky.

 - hpfs_file_fsync:
 - ncp_fsync:

	no-ops.  Dangerous before and after.

 - jffs2_fsync:

	just calls jffs2_flush_wbuf_gc, not sure how this relates to data.

 - nfs_fsync_dir:

	just increments stats, claims all directory operations are synchronous

 - nfs_file_fsync:

	only writes out data???  Looks very odd.

 - nilfs_sync_file:

	looks like it expects all data done, but not sure from the code

 - ntfs_dir_fsync:
 - ntfs_file_fsync:

	appear to do their own data writeback.  Very convoluted code.

 - ocfs2_sync_file:

	does it's own data writeback, but no wait.  probably needs the wait.

 - smb_fsync:

	according to a comment expects all pages written already, probably needs
	the wait before.

This patch only changes vfs_fsync_range, removal of the wait in the methods
that have it is left to the filesystem maintainers.  Note that most
filesystems really do need an audit for their fsync methods given the
gems found in this very brief audit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:17 +02:00