Commit graph

11588 commits

Author SHA1 Message Date
Andrew Morton
71c5576fbd revert "percpu counter: clean up percpu_counter_sum_and_set()"
Revert

    commit 1f7c14c62c
    Author: Mingming Cao <cmm@us.ibm.com>
    Date:   Thu Oct 9 12:50:59 2008 -0400

        percpu counter: clean up percpu_counter_sum_and_set()

Before this patch we had the following:

percpu_counter_sum(): return the percpu_counter's value

percpu_counter_sum_and_set(): return the percpu_counter's value, copying
that value into the central value and zeroing the per-cpu counters before
returning.

After this patch, percpu_counter_sum_and_set() has gone, and
percpu_counter_sum() gets the old percpu_counter_sum_and_set()
functionality.

Problem is, as Eric points out, the old percpu_counter_sum_and_set()
functionality was racy and wrong.  It zeroes out counters on "other" cpus,
without holding any locks which will prevent races agaist updates from
those other CPUS.

This patch reverts 1f7c14c62c.  This means
that percpu_counter_sum_and_set() still has the race, but
percpu_counter_sum() does not.

Note that this is not a simple revert - ext4 has since started using
percpu_counter_sum() for its dirty_blocks counter as well.

Note that this revert patch changes percpu_counter_sum() semantics.

Before the patch, a call to percpu_counter_sum() will bring the counter's
central counter mostly up-to-date, so a following percpu_counter_read()
will return a close value.

After this patch, a call to percpu_counter_sum() will leave the counter's
central accumulator unaltered, so a subsequent call to
percpu_counter_read() can now return a significantly inaccurate result.

If there is any code in the tree which was introduced after
e8ced39d5e was merged, and which depends
upon the new percpu_counter_sum() semantics, that code will break.

Reported-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10 08:01:52 -08:00
Chris Mason
459931eca5 Btrfs: Delete csum items when freeing extents
This finishes off the new checksumming code by removing csum items
for extents that are no longer in use.

The trick is doing it without racing because a single csum item may
hold csums for more than one extent.  Extra checks are added to
btrfs_csum_file_blocks to make sure that we are using the correct
csum item after dropping locks.

A new btrfs_split_item is added to split a single csum item so it
can be split without dropping the leaf lock.  This is used to
remove csum bytes from the middle of an item.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-10 09:10:46 -05:00
Roland McGrath
85f334666a tracehook: exec double-reporting fix
The patch 6341c39 "tracehook: exec" introduced a small regression in
2.6.27 regarding binfmt_misc exec event reporting.  Since the reporting
is now done in the common search_binary_handler() function, an exec
of a misc binary will result in two (or possibly multiple) exec events
being reported, instead of just a single one, because the misc handler
contains a recursive call to search_binary_handler.

To add to the confusion, if PTRACE_O_TRACEEXEC is not active, the multiple
SIGTRAP signals will in fact cause only a single ptrace intercept, as the
signals are not queued.  However, if PTRACE_O_TRACEEXEC is on, the debugger
will actually see multiple ptrace intercepts (PTRACE_EVENT_EXEC).

The test program included below demonstrates the problem.

This change fixes the bug by calling tracehook_report_exec() only in the
outermost search_binary_handler() call (bprm->recursion_depth == 0).

The additional change to restore bprm->recursion_depth after each binfmt
load_binary call is actually superfluous for this bug, since we test the
value saved on entry to search_binary_handler().  But it keeps the use of
of the depth count to its most obvious expected meaning.  Depending on what
binfmt handlers do in certain cases, there could have been false-positive
tests for recursion limits before this change.

    /* Test program using PTRACE_O_TRACEEXEC.
       This forks and exec's the first argument with the rest of the arguments,
       while ptrace'ing.  It expects to see one PTRACE_EVENT_EXEC stop and
       then a successful exit, with no other signals or events in between.

       Test for kernel doing two PTRACE_EVENT_EXEC stops for a binfmt_misc exec:

       $ gcc -g traceexec.c -o traceexec
       $ sudo sh -c 'echo :test:M::foobar::/bin/cat: > /proc/sys/fs/binfmt_misc/register'
       $ echo 'foobar test' > ./foobar
       $ chmod +x ./foobar
       $ ./traceexec ./foobar; echo $?
       ==> good <==
       foobar test
       0
       $
       ==> bad <==
       foobar test
       unexpected status 0x4057f != 0
       3
       $

    */

    #include <stdio.h>
    #include <sys/types.h>
    #include <sys/wait.h>
    #include <sys/ptrace.h>
    #include <unistd.h>
    #include <signal.h>
    #include <stdlib.h>

    static void
    wait_for (pid_t child, int expect)
    {
      int status;
      pid_t p = wait (&status);
      if (p != child)
	{
	  perror ("wait");
	  exit (2);
	}
      if (status != expect)
	{
	  fprintf (stderr, "unexpected status %#x != %#x\n", status, expect);
	  exit (3);
	}
    }

    int
    main (int argc, char **argv)
    {
      pid_t child = fork ();

      if (child < 0)
	{
	  perror ("fork");
	  return 127;
	}
      else if (child == 0)
	{
	  ptrace (PTRACE_TRACEME);
	  raise (SIGUSR1);
	  execv (argv[1], &argv[1]);
	  perror ("execve");
	  _exit (127);
	}

      wait_for (child, W_STOPCODE (SIGUSR1));

      if (ptrace (PTRACE_SETOPTIONS, child,
		  0L, (void *) (long) PTRACE_O_TRACEEXEC) != 0)
	{
	  perror ("PTRACE_SETOPTIONS");
	  return 4;
	}

      if (ptrace (PTRACE_CONT, child, 0L, 0L) != 0)
	{
	  perror ("PTRACE_CONT");
	  return 5;
	}

      wait_for (child, W_STOPCODE (SIGTRAP | (PTRACE_EVENT_EXEC << 8)));

      if (ptrace (PTRACE_CONT, child, 0L, 0L) != 0)
	{
	  perror ("PTRACE_CONT");
	  return 6;
	}

      wait_for (child, W_EXITCODE (0, 0));

      return 0;
    }

Reported-by: Arnd Bergmann <arnd@arndb.de>
CC: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
2008-12-09 19:36:38 -08:00
J. Bruce Fields
a4f4d6df53 EXPORTFS: handle NULL returns from fh_to_dentry()/fh_to_parent()
While 440037287c "[PATCH] switch all filesystems over to
d_obtain_alias" removed some cases where fh_to_dentry() and
fh_to_parent() could return NULL, there are still a few NULL returns
left in individual filesystems.  Thus it was a mistake for that commit
to remove the handling of NULL returns in the callers.

Revert those parts of 440037287c which removed the NULL handling.

(We could, alternatively, modify all implementations to return -ESTALE
instead of NULL, but that proves to require fixing a number of
filesystems, and in some cases it's arguably more natural to return
NULL.)

Thanks to David for original patch and Linus, Christoph, and Hugh for
review.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-08 19:49:32 -08:00
Chris Mason
580afd76e4 Btrfs: Fix compressed checksum fsync log copies
The fsync logging code makes sure to onl copy the relevant checksum for each
extent based on the file extent pointers it finds.

But for compressed extents, it needs to copy the checksum for the
entire extent.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 19:15:39 -05:00
Chris Mason
c3027eb552 Btrfs: Add inode sequence number for NFS and reserved space in a few structs
This adds a sequence number to the btrfs inode that is increased on
every update.  NFS will be able to use that to detect when an inode has
changed, without relying on inaccurate time fields.

While we're here, this also:

Puts reserved space into the super block and inode

Adds a log root transid to the super so we can pick the newest super
based on the fsync log as well as the main transaction ID.  For now
the log root transid is always zero, but that'll get fixed.

Adds a starting offset to the dev_item.  This will let us do better
alignment calculations if we know the start of a partition on the disk.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:40:21 -05:00
Chris Mason
934d375bac Btrfs: Use map_private_extent_buffer during generic_bin_search
It is possible that generic_bin_search will be called on a tree block
that has not been locked.  This happens because cache_block_block skips
locking on the tree blocks.

Since the tree block isn't locked, we aren't allowed to change
the extent_buffer->map_token field.  Using map_private_extent_buffer
avoids any changes to the internal extent buffer fields.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:43:10 -05:00
Yan Zheng
a512bbf855 Btrfs: superblock duplication
This patch implements superblock duplication. Superblocks
are stored at offset 16K, 64M and 256G on every devices.
Spaces used by superblocks are preserved by the allocator,
which uses a reverse mapping function to find the logical
addresses that correspond to superblocks. Thank you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-08 16:46:26 -05:00
Chris Mason
d20f7043fa Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block.  Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block.  This means that when we read the inode,
we've probably read in at least some checksums as well.

But, this has a few problems:

* The checksums are indexed by logical offset in the file.  When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data.  It would be faster if we could checksum
the compressed data instead.

* If we implement encryption, we'll be checksumming the plain text and
storing that on disk.  This is significantly less secure.

* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct.  This makes the raid
layer balancing and extent moving much more expensive.

* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.

* There is potentitally one copy of the checksum in each subvolume
referencing an extent.

The solution used here is to store the extent checksums in a dedicated
tree.  This allows us to index the checksums by phyiscal extent
start and length.  It means:

* The checksum is against the data stored on disk, after any compression
or encryption is done.

* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.

This makes compression significantly faster by reducing the amount of
data that needs to be checksummed.  It will also allow much faster
raid management code in general.

The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent.  This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
Jonathan Corbet
218d11a8b0 Fix a race condition in FASYNC handling
Changeset a238b790d5 (Call fasync()
functions without the BKL) introduced a race which could leave
file->f_flags in a state inconsistent with what the underlying
driver/filesystem believes.  Revert that change, and also fix the same
races in ioctl_fioasync() and ioctl_fionbio().

This is a minimal, short-term fix; the real fix will not involve the
BKL.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-05 15:35:10 -08:00
Linus Torvalds
bbeba4c35c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev:
  [PATCH] fix bogus argument of blkdev_put() in pktcdvd
  [PATCH 2/2] documnt FMODE_ constants
  [PATCH 1/2] kill FMODE_NDELAY_NOW
  [PATCH] clean up blkdev_get a little bit
  [PATCH] Fix block dev compat ioctl handling
  [PATCH] kill obsolete temporary comment in swsusp_close()
2008-12-04 21:45:44 -08:00
Dave Chinner
576a488a27 [XFS] Fix hang after disallowed rename across directory quota domains
When project quota is active and is being used for directory tree
quota control, we disallow rename outside the current directory
tree. This requires a check to be made after all the inodes
involved in the rename are locked. We fail to unlock the inodes
correctly if we disallow the rename when the target is outside the
current directory tree. This results in a hang on the next access
to the inodes involved in failed rename.

Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2008-12-05 15:39:13 +11:00
Christoph Hellwig
fd4ce1acd0 [PATCH 1/2] kill FMODE_NDELAY_NOW
Update FMODE_NDELAY before each ioctl call so that we can kill the
magic FMODE_NDELAY_NOW.  It would be even better to do this directly
in setfl(), but for that we'd need to have FMODE_NDELAY for all files,
not just block special files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-12-04 04:22:57 -05:00
Christoph Hellwig
ebbefc011e [PATCH] clean up blkdev_get a little bit
The way the bd_claim for the FMODE_EXCL case is implemented is rather
confusing.  Clean it up to the most logical style.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-12-04 04:22:56 -05:00
Linus Torvalds
2433c41789 Merge branch 'for-2.6.28' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.28' of git://linux-nfs.org/~bfields/linux:
  NLM: client-side nlm_lookup_host() should avoid matching on srcaddr
  nfsd: use of unitialized list head on error exit in nfs4recover.c
  Add a reference to sunrpc in svc_addsock
  nfsd: clean up grace period on early exit
2008-12-03 16:40:37 -08:00
Linus Torvalds
51eaaa6776 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
  UBIFS: pre-allocate bulk-read buffer
  UBIFS: do not allocate too much
  UBIFS: do not print scary memory allocation warnings
  UBIFS: allow for gaps when dirtying the LPT
  UBIFS: fix compilation warnings
  MAINTAINERS: change UBI/UBIFS git tree URLs
  UBIFS: endian handling fixes and annotations
  UBIFS: remove printk
2008-12-02 15:56:55 -08:00
Chris Mason
c99e905c94 Btrfs: Fix sparse endian warnings in struct-funcs.c
The btrfs macros to access individual struct members on disk were
sending the same variable to functions that expected different types
of endianness.  This fix explicitly creates a variable of the correct
type instead of abusing a single variable for mixed purposes.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-02 11:18:37 -05:00
Chris Mason
2a7108ad89 Btrfs: rev the disk format for the inode compat and csum selection changes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-02 09:58:02 -05:00
Chris Mason
4022abf449 Btrfs: delete unused function: btrfs_invalidate_dcache_root
Snapshot and subvolume creation no longer need this helper.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-02 09:57:03 -05:00
Josef Bacik
607d432da0 Btrfs: add support for multiple csum algorithms
This patch gives us the space we will need in order to have different csum
algorithims at some point in the future.  We save the csum algorithim type
in the superblock, and use those instead of define's.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-12-02 07:17:45 -05:00
Josef Bacik
c6e2bac1a5 Btrfs: fix panic on error during mount
This needs to be applied on top of my previous patches, but is needed for more
than just my new stuff.  We're going to the wrong label when we have an error,
we try to stop the workers, but they are started below all of this code.  This
fixes it so we go to the right error label and not panic when we fail one of
these cases.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-12-02 06:36:10 -05:00
Josef Bacik
f2b636e80d Btrfs: add support for compat flags to btrfs
This adds the necessary disk format for handling compatibility flags
in the future to handle disk format changes.  We have a compat_flags,
compat_ro_flags and incompat_flags set for the super block.  Compat
flags will be to hold the features that are compatible with older
versions of btrfs, compat_ro flags have features that are compatible
with older versions of btrfs if the fs is mounted read only, and
incompat_flags has features that are incompatible with older versions
of btrfs.  This also axes the compat_flags field for the inode and
just makes the flags field a 64bit field, and changes the root item
flags field to 64bit.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-12-02 06:36:08 -05:00
Christoph Hellwig
7a865e8ac3 Btrfs: btrfs: pass void __user * to btrfs_ioctl_clone_range
Cleans the code up a little and also avoids a sparse warning due to the
incorrect cast in the current version of the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-12-02 09:52:24 -05:00
Christoph Hellwig
4bcabaa30a Btrfs: clean up btrfs_ioctl a little bit
Provide a void __user *argp pointer so that we can avoid duplicating
the cast for various sub-command calls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-12-02 06:36:08 -05:00
Christoph Hellwig
97288f2c71 Btrfs: corret fmode_t annotations
Make sure to propagate fmode_t properly and use the right constants for
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-12-02 06:36:09 -05:00
Christoph Hellwig
6e430f94e5 Btrfs: fix shadowed variable declarations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-02 06:36:09 -05:00
Christoph Hellwig
b2950863c6 Btrfs: make things static and include the right headers
Shut up various sparse warnings about symbols that should be either
static or have their declarations in scope.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-12-02 09:54:17 -05:00
Sage Weil
1ffa4f426c Btrfs: remove unneeded btrfs_start_delalloc_inodes call
It is called by btrfs_sync_fs.

Signed-off-by: Sage Weil <sage@newdream.net>
2008-12-02 09:53:09 -05:00
Sage Weil
6e3ad88729 Btrfs: remove unneeded total_trans
Remove unneeded debugging sanity check.  It gets corrupted anyway when
multiple btrfs file systems are mounted, throwing bad warnings along the
way.

Signed-off-by: Sage Weil <sage@newdream.net>
2008-12-02 06:36:10 -05:00
Christoph Hellwig
641f5219f2 Btrfs: sparse lock verification annotations for wait_on_state
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-02 06:36:10 -05:00
Randy Dunlap
0380155363 ntfs: don't fool kernel-doc
kernel-doc handles macros now (it has for quite some time), so change the
ntfs_debug() macro's kernel-doc to be just before the macro instead of
before a phony function prototype.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01 19:55:25 -08:00
Davide Libenzi
7ef9964e6d epoll: introduce resource usage limits
It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface.  Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds.  To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced.  A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:

  max_user_instances = Maximum number of devices - per user

  max_user_watches   = Maximum number of "watched" fds - per user

The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM.  As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users.  The
default value for "max_user_instances" is set to 128, that should be
enough too.

This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC).  The EMFILE from epoll_create() was already
listed, so that should be ok.

[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01 19:55:24 -08:00
Liu Hui
ce397c0616 Btrfs: Fix cow semantic in run_delalloc_nocow()
The file preallocation code reversed the logic to force nodatacow.
This fixes it.
2008-12-01 20:31:40 -05:00
Mark Fasheh
d6b58f89f7 ocfs2: fix regression in ocfs2_read_blocks_sync()
We're panicing in ocfs2_read_blocks_sync() if a jbd-managed buffer is seen.
At first glance, this seems ok but in reality it can happen. My test case
was to just run 'exorcist'. A struct inode is being pushed out of memory but
is then re-read at a later time, before the buffer has been checkpointed by
jbd. This causes a BUG to be hit in ocfs2_read_blocks_sync().

Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-12-01 14:46:58 -08:00
Coly Li
07d9a3954a ocfs2: fix return value set in init_dlmfs_fs()
In init_dlmfs_fs(), if calling kmem_cache_create() failed, the code will use return value from
calling bdi_init(). The correct behavior should be set status as -ENOMEM before going to "bail:".

Signed-off-by: Coly Li <coyli@suse.de>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-12-01 14:46:55 -08:00
David Teigland
07f9eebcdf ocfs2: fix wake_up in unlock_ast
In ocfs2_unlock_ast(), call wake_up() on lockres before releasing
the spin lock on it.  As soon as the spin lock is released, the
lockres can be freed.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-12-01 14:46:45 -08:00
David Teigland
66f502a416 ocfs2: initialize stack_user lvbptr
The locking_state dump, ocfs2_dlm_seq_show, reads the lvb on locks where it
has not yet been initialized by a lock call.

Signed-off-by: David Teigland <teigland@redhat.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-12-01 14:46:39 -08:00
Coly Li
3b5da0189c ocfs2: comments typo fix
This patch fixes two typos in comments of ocfs2.

Signed-off-by: Coly Li <coyli@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-12-01 14:46:31 -08:00
Linus Torvalds
8e36a5d6ad Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] fix regression in cifs_write_begin/cifs_write_end
2008-11-30 14:04:02 -08:00
Jan Kara
52b19ac993 udf: Fix BUG_ON() in destroy_inode()
udf_clear_inode() can leave behind buffers on mapping's i_private list (when
we truncated preallocation). Call invalidate_inode_buffers() so that the list
is properly cleaned-up before we return from udf_clear_inode(). This is ugly
and suggest that we should cleanup preallocation earlier than in clear_inode()
but currently there's no such call available since drop_inode() is called under
inode lock and thus is unusable for disk operations.

Signed-off-by: Jan Kara <jack@suse.cz>
2008-11-27 17:38:28 +01:00
Jeff Layton
a98ee8c1c7 [CIFS] fix regression in cifs_write_begin/cifs_write_end
The conversion to write_begin/write_end interfaces had a bug where we
were passing a bad parameter to cifs_readpage_worker. Rather than
passing the page offset of the start of the write, we needed to pass the
offset of the beginning of the page. This was reliably showing up as
data corruption in the fsx-linux test from LTP.

It also became evident that this code was occasionally doing unnecessary
read calls. Optimize those away by using the PG_checked flag to indicate
that the unwritten part of the page has been initialized.

CC: Nick Piggin <npiggin@suse.de>
Acked-by: Dave Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-11-26 19:32:33 +00:00
Chuck Lever
a8d82d9b95 NLM: client-side nlm_lookup_host() should avoid matching on srcaddr
Since commit c98451bd, the loop in nlm_lookup_host() unconditionally
compares the host's h_srcaddr field to the incoming source address.
For client-side nlm_host entries, both are always AF_UNSPEC, so this
check is unnecessary.

Since commit 781b61a6, which added support for AF_INET6 addresses to
nlm_cmp_addr(), nlm_cmp_addr() now returns FALSE for AF_UNSPEC
addresses, which causes nlm_lookup_host() to create a fresh nlm_host
entry every time it is called on the client.

These extra entries will eventually expire once the server is
unmounted, so the impact of this regression, introduced with lockd
IPv6 support in 2.6.28, should be minor.

We could fix this by adding an arm in nlm_cmp_addr() for AF_UNSPEC
addresses, but really, nlm_lookup_host() shouldn't be matching on the
srcaddr field for client-side nlm_host lookups.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-11-24 13:29:07 -06:00
J. Bruce Fields
e4625eb826 nfsd: use of unitialized list head on error exit in nfs4recover.c
Thanks to Matthew Dodd for this bug report:

A file label issue while running SELinux in MLS mode provoked the
following bug, which is a result of use before init on a 'struct list_head'.

In nfsd4_list_rec_dir() if the call to dentry_open() fails the 'goto
out' skips INIT_LIST_HEAD() which results in the normally improbable
case where list_entry() returns NULL.

Trace follows.

NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
SELinux:  Context unconfined_t:object_r:var_lib_nfs_t:s0 is not valid
(left unmapped).
type=1400 audit(1227298063.609:282): avc:  denied  { read } for
pid=1890 comm="rpc.nfsd" name="v4recovery" dev=dm-0 ino=148726
scontext=system_u:system_r:nfsd_t:s0-s15:c0.c1023
tcontext=system_u:object_r:unlabeled_t:s15:c0.c1023 tclass=dir
BUG: unable to handle kernel NULL pointer dereference at 00000004
IP: [<c050894e>] list_del+0x6/0x60
*pde = 0d9ce067 *pte = 00000000
Oops: 0000 [#1] SMP
Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs autofs4
sunrpc ipv6 dm_multipath scsi_dh ppdev parport_pc sg parport floppy
ata_piix pata_acpi ata_generic libata pcnet32 i2c_piix4 mii pcspkr
i2c_core dm_snapshot dm_zero dm_mirror dm_log dm_mod BusLogic sd_mod
scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last
unloaded: microcode]

Pid: 1890, comm: rpc.nfsd Not tainted (2.6.27.5-37.fc9.i686 #1)
EIP: 0060:[<c050894e>] EFLAGS: 00010217 CPU: 0
EIP is at list_del+0x6/0x60
EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: cd99e480
ESI: cf9caed8 EDI: 00000000 EBP: cf9caebc ESP: cf9caeb8
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process rpc.nfsd (pid: 1890, ti=cf9ca000 task=cf4de580 task.ti=cf9ca000)
Stack: 00000000 cf9caef0 d0a9f139 c0496d04 d0a9f217 fffffff3 00000000
00000000
        00000000 00000000 cf32b220 00000000 00000008 00000801 cf9caefc
d0a9f193
        00000000 cf9caf08 d0a9b6ea 00000000 cf9caf1c d0a874f2 cf9c3004
00000008
Call Trace:
  [<d0a9f139>] ? nfsd4_list_rec_dir+0xf3/0x13a [nfsd]
  [<c0496d04>] ? do_path_lookup+0x12d/0x175
  [<d0a9f217>] ? load_recdir+0x0/0x26 [nfsd]
  [<d0a9f193>] ? nfsd4_recdir_load+0x13/0x34 [nfsd]
  [<d0a9b6ea>] ? nfs4_state_start+0x2a/0xc5 [nfsd]
  [<d0a874f2>] ? nfsd_svc+0x51/0xff [nfsd]
  [<d0a87f2d>] ? write_svc+0x0/0x1e [nfsd]
  [<d0a87f48>] ? write_svc+0x1b/0x1e [nfsd]
  [<d0a87854>] ? nfsctl_transaction_write+0x3a/0x61 [nfsd]
  [<c04b6a4e>] ? sys_nfsservctl+0x116/0x154
  [<c04975c1>] ? putname+0x24/0x2f
  [<c04975c1>] ? putname+0x24/0x2f
  [<c048d49f>] ? do_sys_open+0xad/0xb7
  [<c048d337>] ? filp_close+0x50/0x5a
  [<c048d4eb>] ? sys_open+0x1e/0x26
  [<c0403cca>] ? syscall_call+0x7/0xb
  [<c064007b>] ? init_cyrix+0x185/0x490
  =======================
Code: 75 e1 8b 53 08 8d 4b 04 8d 46 04 e8 75 00 00 00 8b 53 10 8d 4b 0c
8d 46 0c e8 67 00 00 00 5b 5e 5f 5d c3 90 90 55 89 e5 53 89 c3 <8b> 40
04 8b 00 39 d8 74 16 50 53 68 3e d6 6f c0 6a 30 68 78 d6
EIP: [<c050894e>] list_del+0x6/0x60 SS:ESP 0068:cf9caeb8
---[ end trace a89c4ad091c4ad53 ]---

Cc: Matthew N. Dodd <Matthew.Dodd@spart.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-11-24 10:36:09 -06:00
J. Bruce Fields
2c5e76158f nfsd: clean up grace period on early exit
If nfsd was shut down before the grace period ended, we could end up
with a freed object still on grace_list.  Thanks to Jeff Moyer for
reporting the resulting list corruption warnings.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
2008-11-24 10:12:48 -06:00
Artem Bityutskiy
3477d20465 UBIFS: pre-allocate bulk-read buffer
To avoid memory allocation failure during bulk-read, pre-allocate
a bulk-read buffer, so that if there is only one bulk-reader at
a time, it would just use the pre-allocated buffer and would not
do any memory allocation. However, if there are more than 1 bulk-
reader, then only one reader would use the pre-allocated buffer,
while the other reader would allocate the buffer for itself.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-11-21 18:59:33 +02:00
Artem Bityutskiy
6c0c42cdfd UBIFS: do not allocate too much
Bulk-read allocates 128KiB or more using kmalloc. The allocation
starts failing often when the memory gets fragmented. UBIFS still
works fine in this case, because it falls-back to standard
(non-optimized) read method, though. This patch teaches bulk-read
to allocate exactly the amount of memory it needs, instead of
allocating 128KiB every time.

This patch is also a preparation to the further fix where we'll
have a pre-allocated bulk-read buffer as well. For example, now
the @bu object is prepared in 'ubifs_bulk_read()', so we could
path either pre-allocated or allocated information to
'ubifs_do_bulk_read()' later. Or teaching 'ubifs_do_bulk_read()'
not to allocate 'bu->buf' if it is already there.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-11-21 18:59:25 +02:00
Artem Bityutskiy
39ce81ce71 UBIFS: do not print scary memory allocation warnings
Bulk-read allocates a lot of memory with 'kmalloc()', and when it
is/gets fragmented 'kmalloc()' fails with a scarry warning. But
because bulk-read is just an optimization, UBIFS keeps working fine.
Supress the warning by passing __GFP_NOWARN option to 'kmalloc()'.

This patch also introduces a macro for the magic 128KiB constant.
This is just neater.

Note, this is not really fixes the problem we had, but just hides
the warnings. The further patches fix the problem.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-11-21 18:59:16 +02:00
Linus Torvalds
0cb39aa0ac Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] Do not attempt to close invalidated file handles
  [CIFS] fix check for dead tcon in smb_init
2008-11-20 13:14:16 -08:00
Steve French
ddb4cbfc53 [CIFS] Do not attempt to close invalidated file handles
If a connection with open file handles has gone down
and come back up and reconnected without reopening
the file handle yet, do not attempt to send an SMB close
request for this handle in cifs_close.  We were
checking for the connection being invalid in cifs_close
but since the connection may have been reconnected
we also need to check whether the file handle
was marked invalid (otherwise we could close the
wrong file handle by accident).

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-11-20 20:14:13 +00:00
Josef Bacik
ea6a478ed9 Btrfs: Fix for lockdep warnings with alloc_mutex and pinned_mutex
This the lockdep complaint by having a different mutex to gaurd caching the
block group, so you don't end up with this backwards dependancy.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-20 12:16:16 -05:00