Commit graph

10332 commits

Author SHA1 Message Date
Pekka Enberg
232087cb73 cifs: don't use GFP_KERNEL with GFP_NOFS
GFP_KERNEL and GFP_NOFS are mutually exclusive. If you combine them, you end up
with plain GFP_KERNEL which can deadlock in cases where you really want
GFP_NOFS.

Cc: Steve French <sfrench@samba.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-09-22 22:23:56 +00:00
Theodore Ts'o
f702ba0fd7 ext4: Don't use 'struct dentry' for internal lookups
This is a port of a patch from Linus which fixes a 200+ byte stack
usage problem in ext4_get_parent().

It's more efficient to pass down only the actual parts of the dentry
that matter: the parent inode and the name, instead of allocating a
struct dentry on the stack.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-22 15:21:01 -04:00
Theodore Ts'o
914258bf2c ext4/jbd2: Avoid WARN() messages when failing to write to the superblock
This fixes some very common warnings reported by kerneloops.org

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-06 21:35:40 -04:00
Steven Whitehouse
719ee34467 GFS2: high time to take some time over atime
Until now, we've used the same scheme as GFS1 for atime. This has failed
since atime is a per vfsmnt flag, not a per fs flag and as such the
"noatime" flag was not getting passed down to the filesystems. This
patch removes all the "special casing" around atime updates and we
simply use the VFS's atime code.

The net result is that GFS2 will now support all the same atime related
mount options of any other filesystem on a per-vfsmnt basis. We do lose
the "lazy atime" updates, but we gain "relatime". We could add lazy
atime to the VFS at a later date, if there is a requirement for that
variant still - I suspect relatime will be enough.

Also we lose about 100 lines of code after this patch has been applied,
and I have a suspicion that it will speed things up a bit, even when
atime is "on". So it seems like a nice clean up as well.

From a user perspective, everything stays the same except the loss of
the per-fs atime quantum tweekable (ought to be per-vfsmnt at the very
least, and to be honest I don't think anybody ever used it) and that a
number of options which were ignored before now work correctly.

Please let me know if you've got any comments. I'm pushing this out
early so that you can all see what my plans are.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-09-18 13:53:59 +01:00
Steven Whitehouse
37ec89e83c GFS2: The war on bloat
The following patch shrinks the gfs2_args structure which is embedded in
every GFS2 superblock. It cuts down the size of the options to a single
unsigned int (the 13 bits of bitfields will be rounded up to that size
by the compiler) from the current 11 unsigned ints. So on x86 thats 44
bytes shrinking to 4 bytes, in each and every GFS2 superblock.

Signed-off-by: Steven Whitehouse <swhitho@redhat.com>
2008-09-18 13:49:32 +01:00
Alexander Beregalov
7424bac82f UBIFS: fix printk format warnings
fs/ubifs/dir.c:428: warning: format '%llu' expects type 'long long
unsigned int', but argument 5 has type 'long unsigned int'

fs/ubifs/debug.c:541: warning: format '%llu' expects type 'long long
unsigned int', but argument 2 has type 'long unsigned int'

Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-09-18 09:57:57 +03:00
Adrian Hunter
6e14968c86 UBIFS: remove incorrect assert
The assert was not valid because one of the variables
'taken_empty_lebs' has transient values out of sync
with the other variables.

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-09-17 14:47:09 +03:00
Adrian Hunter
6dcfac4f13 UBIFS: TNC / GC race fixes
- update GC sequence number if any nodes may have been moved
even if GC did not finish the LEB
- don't ignore error return when reading

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-09-17 14:23:26 +03:00
Sebastian Siewior
0855f310df UBIFS: create the name of the background thread in every case
If the ubifs partition is mounted RO and then remounted RW we end
up with no thread name in ubifs_remount_rw() and the thread appears
nameless.

Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-09-17 10:07:56 +03:00
Lachlan McIlroy
2fd6f6ec64 [XFS] Don't do I/O beyond eof when unreserving space
When unreserving space with boundaries that are not block aligned we round
up the start and round down the end boundaries and then use this function,
xfs_zero_remaining_bytes(), to zero the parts of the blocks that got
dropped during the rounding. The problem is we don't consider if these
blocks are beyond eof. Worse still is if we encounter delayed allocations
beyond eof we will try to use the magic delayed allocation block number as
a real block number. If the file size is ever extended to expose these
blocks then we'll go through xfs_zero_eof() to zero them anyway.

SGI-PV: 983683

SGI-Modid: xfs-linux-melb:xfs-kern:32055a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
2008-09-17 16:52:50 +10:00
Lachlan McIlroy
e1f5dbd707 [XFS] Fix use-after-free with buffers
We have a use-after-free issue where log completions access buffers via
the buffer log item and the buffer has already been freed. Fix this by
taking a reference on the buffer when attaching the buffer log item and
release the hold when the buffer log item is detached and we no longer
need the buffer. Also create a new function xfs_buf_item_free() to combine
some common code.

SGI-PV: 985757

SGI-Modid: xfs-linux-melb:xfs-kern:32025a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
2008-09-17 16:52:13 +10:00
David Chinner
f9114eba1e [XFS] Prevent lockdep false positives when locking two inodes.
If we call xfs_lock_two_inodes() to grab both the iolock and the ilock,
then drop the ilocks on both inodes, then grab them again (as
xfs_swap_extents() does) then lockdep will report a locking order problem.
This is a false positive.

To avoid this, disallow xfs_lock_two_inodes() fom locking both inode locks
at once - force calers to make two separate calls. This means that nested
dropping and regaining of the ilocks will retain the same lockdep subclass
and so lockdep will not see anything wrong with this code.

SGI-PV: 986238

SGI-Modid: xfs-linux-melb:xfs-kern:31999a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Peter Leckie <pleckie@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2008-09-17 16:51:21 +10:00
David Chinner
b5b8c9acd5 [XFS] Fix barrier status change detection.
The current code in xlog_iodone() uses the wrong macro to check if the
barrier has been cleared due to an EOPNOTSUPP error form the lower layer.

SGI-PV: 986143

SGI-Modid: xfs-linux-melb:xfs-kern:31984a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Nathaniel W. Turner <nate@houseofnate.net>
Signed-off-by: Peter Leckie <pleckie@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2008-09-17 16:50:50 +10:00
Lachlan McIlroy
364f358a73 [XFS] Prevent direct I/O from mapping extents beyond eof
With the help from some tracing I found that we try to map extents beyond
eof when doing a direct I/O read. It appears that the way to inform the
generic direct I/O path (ie do_direct_IO()) that we have breached eof is
to return an unmapped buffer from xfs_get_blocks_direct(). This will cause
do_direct_IO() to jump to the hole handling code where is will check for
eof and then abort.

This problem was found because a direct I/O read was trying to map beyond
eof and was encountering delayed allocations. The delayed allocations
beyond eof are speculative allocations and they didn't get converted when
the direct I/O flushed the file because there was only enough space in the
current AG to convert and write out the dirty pages within eof. Note that
xfs_iomap_write_allocate() wont necessarily convert all the delayed
allocation passed to it - it will return after allocating the first extent
- so if the delayed allocation extends beyond eof then it will stay that
way.

SGI-PV: 983683

SGI-Modid: xfs-linux-melb:xfs-kern:31929a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
2008-09-17 16:50:14 +10:00
Christoph Hellwig
6efdf28177 [XFS] Fix regression introduced by remount fixup
Logically we would return an error in xfs_fs_remount code to prevent users
from believing they might have changed mount options using remount which
can't be changed.

But unfortunately mount(8) adds all options from mtab and fstab to the
mount arguments in some cases so we can't blindly reject options, but have
to check for each specified option if it actually differs from the
currently set option and only reject it if that's the case.

Until that is implemented we return success for every remount request, and
silently ignore all options that we can't actually change.

SGI-PV: 985710

SGI-Modid: xfs-linux-melb:xfs-kern:31908a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2008-09-17 16:49:33 +10:00
Lachlan McIlroy
31bd61f2bb [XFS] Move memory allocations for log tracing out of the critical path
Memory allocations for log->l_grant_trace and iclog->ic_trace are done on
demand when the first event is logged. In xlog_state_get_iclog_space() we
call xlog_trace_iclog() under a spinlock and allocating memory here can
cause us to sleep with a spinlock held and deadlock the system.

For the log grant tracing we use KM_NOSLEEP but that means we can lose
trace entries. Since there is no locking to serialize the log grant
tracing we could race and have multiple allocations and leak memory.

So move the allocations to where we initialize the log/iclog structures.
Use KM_NOFS to avoid recursing into the filesystem and drop log->l_trace
since it's not even used.

SGI-PV: 983738

SGI-Modid: xfs-linux-melb:xfs-kern:31896a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
2008-09-17 16:45:37 +10:00
Steve French
388e57b275 [CIFS] use common code for turning off ATTR_READONLY in cifs_unlink
We already have a cifs_set_file_info function that can flip DOS
attribute bits. Have cifs_unlink call it to handle turning ATTR_HIDDEN
on and ATTR_READONLY off when an unlink attempt returns -EACCES.

This also removes a level of indentation from cifs_unlink.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-09-16 23:50:58 +00:00
Jeff Layton
5f0319a790 cifs: clean up variables in cifs_unlink
Change parameters to cifs_unlink to match the ones used in the generic
VFS. Add some local variables to cut down on the amount of struct
dereferencing that needs to be done, and eliminate some unneeded NULL
pointer checks on the parent directory inode. Finally, rename pTcon
to "tcon" to more closely match standard kernel coding style.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-09-16 20:14:34 +00:00
Abhijith Das
acd2c8aa02 GFS2: GFS2 will panic if you misspell any mount options
The gfs2 superblock pointer is NULL after a failed mount. When control
eventually goes to gfs2_kill_sb, we dereference this NULL pointer. This
patch ensures that the gfs2 superblock pointer is not NULL before being
dereferenced in gfs2_kill_sb.

Signed-off-by:   Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-09-15 16:08:32 +01:00
Bob Peterson
acb57a3652 GFS2: Direct IO write at end of file error
This patch fixes a problem whereby a direct_io write doesn't fall
back to buffered write properly at end of file.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-09-15 10:31:54 +01:00
Andrew Morton
8d99f83b94 rescan_partitions(): make device capacity errors non-fatal
Herton Krzesinski reports that the error-checking changes in
04ebd4aee5 ("block/ioctl.c and
fs/partition/check.c: check value returned by add_partition") cause his
buggy USB camera to no longer mount.  "The camera is an Olympus X-840.
The original issue comes from the camera itself: its format program
creates a partition with an off by one error".

Buggy devices happen.  It is better for the kernel to warn and to proceed
with the mount.

Reported-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br>
Cc: Abdel Benamrouche <draconux@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-13 14:41:52 -07:00
Hugh Dickins
d7a3e4959c mm: ifdef Quicklists in /proc/meminfo
A "Quicklists:          0 kB" line has just started appearing in
/proc/meminfo, but most architectures (including x86) don't have
them configured, so #ifdef it, like the highmem lines.

And those architectures which do have quicklists configured are
using them for page tables: so let's place it next to PageTables.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-13 14:41:51 -07:00
Eric Sesterhenn
1558182f65 bfs: fix Lockdep warning
This fixes:

  =============================================
  [ INFO: possible recursive locking detected ]
  2.6.27-rc5-00283-g70bb089 #68
  ---------------------------------------------
  touch/6855 is trying to acquire lock:
   (&info->bfs_lock){--..}, at: [<c02262f5>] bfs_delete_inode+0x9e/0x18c

  but task is already holding lock:
   (&info->bfs_lock){--..}, at: [<c0226c00>] bfs_create+0x45/0x187

  other info that might help us debug this:
  2 locks held by touch/6855:
   #0:  (&type->i_mutex_dir_key#5){--..}, at: [<c018ad13>] do_filp_open+0x10b/0x62f
   #1:  (&info->bfs_lock){--..}, at: [<c0226c00>] bfs_create+0x45/0x187

  stack backtrace:
  Pid: 6855, comm: touch Not tainted 2.6.27-rc5-00283-g70bb089 #68
   [<c013e769>] validate_chain+0x458/0x9f4
   [<c013bece>] ? trace_hardirqs_off+0xb/0xd
   [<c013f36b>] __lock_acquire+0x666/0x6e0
   [<c013f440>] lock_acquire+0x5b/0x77
   [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c
   [<c06aab74>] mutex_lock_nested+0xbc/0x234
   [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c
   [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c
   [<c02262f5>] bfs_delete_inode+0x9e/0x18c
   [<c0226257>] ? bfs_delete_inode+0x0/0x18c
   [<c01925e1>] generic_delete_inode+0x94/0xfe
   [<c019265d>] generic_drop_inode+0x12/0x12f
   [<c0191b7e>] iput+0x4b/0x4e
   [<c0226d1e>] bfs_create+0x163/0x187
   [<c0188b42>] vfs_create+0xa6/0x114
   [<c018adb5>] do_filp_open+0x1ad/0x62f
   [<c0107cdc>] ? native_sched_clock+0x82/0x96
   [<c06ac309>] ? _spin_unlock+0x27/0x3c
   [<c019379e>] ? alloc_fd+0xbf/0xc9
   [<c06ae2f4>] ? sub_preempt_count+0x9d/0xab
   [<c019379e>] ? alloc_fd+0xbf/0xc9
   [<c0180391>] do_sys_open+0x42/0xb8
   [<c041d564>] ? trace_hardirqs_on_thunk+0xc/0x10
   [<c0180449>] sys_open+0x1e/0x26
   [<c01038bd>] sysenter_do_call+0x12/0x31
   =======================

The problem is that we don't unlock the bfs->lock mutex before calling
iput (we do in the other cases).

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-13 14:41:51 -07:00
Alexey Dobriyan
665020c35e proc: more debugging for "already registered" case
Print parent directory name as well.

The aim is to catch non-creation of parent directory when proc_mkdir will
return NULL and all subsequent registrations go directly in /proc instead
of intended directory.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Fixed insane printk string while at it.  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-13 14:41:50 -07:00
Eric Sandeen
730c213c79 ext4: use percpu data structures for lg_prealloc_list
lg_prealloc_list seems to cry out for a per-cpu data structure; on a large
smp system I think this should be better.  I've lightly tested this change
on a 4-cpu system.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-13 15:23:29 -04:00
Theodore Ts'o
8eea80d52b ext4: Renumber EXT4_IOC_MIGRATE
Pick an ioctl number for EXT4_IOC_MIGRATE that won't conflict with
other ext4 ioctl's.  Since there haven't been any major userspace
users of this ioctl, we can afford to change this now, to avoid
potential problems later.

Also, reorder the ioctl numbers in ext4.h to avoid this sort of
mistake in the future.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-13 19:54:35 -04:00
Aneesh Kumar K.V
4db46fc266 ext4: hook the ext3 migration interface to the EXT4_IOC_SETFLAGS ioctl
This patch hooks the ext3 to ext4 migrate interface to
EXT4_IOC_SETFLAGS ioctl. The userspace interface is via chattr +e.  We
only allow setting extent flags.  Clearing extent flag (migrating from
ext4 to ext3) is not supported.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-08 23:34:06 -04:00
Aneesh Kumar K.V
2a43a87800 ext4: elevate write count for migrate ioctl
The migrate ioctl writes to the filsystem, so we need to elevate the
write count.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-13 12:52:26 -04:00
Linus Torvalds
e2858ce3ed Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
  udf: add llseek method
  udf: Fix error paths in udf_new_inode()
  udf: Fix lock inversion between iprune_mutex and alloc_mutex (v2)
2008-09-11 08:40:11 -07:00
Tao Ma
0e116227a0 ocfs2: Fix a bug in direct IO read.
ocfs2 will become read-only if we try to read the bytes which pass
the end of i_size. This can be easily reproduced by following steps:
1. mkfs a ocfs2 volume with bs=4k cs=4k and nosparse.
2. create a small file(say less than 100 bytes) and we will create the file
   which is allocated 1 cluster.
3. read 8196 bytes from the kernel using O_DIRECT which exceeds the limit.
4. The ocfs2 volume becomes read-only and dmesg shows:
OCFS2: ERROR (device sda13): ocfs2_direct_IO_get_blocks:
Inode 66010 has a hole at block 1
File system is now read-only due to the potential of on-disk corruption.
Please run fsck.ocfs2 once the file system is unmounted.

So suppress the ERROR message.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-09-10 01:44:08 -07:00
Linus Torvalds
b975dee381 Merge branch 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6
* 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6:
  UBIFS: make minimum fanout 3
  UBIFS: fix division by zero
  UBIFS: amend f_fsid
  UBIFS: fill f_fsid
  UBIFS: improve statfs reporting even more
  UBIFS: introduce LEB overhead
  UBIFS: add forgotten gc_idx_lebs component
  UBIFS: fix assertion
  UBIFS: improve statfs reporting
  UBIFS: remove incorrect index space check
  UBIFS: push empty flash hack down
  UBIFS: do not update min_idx_lebs in stafs
  UBIFS: allow for racing between GC and TNC
  UBIFS: always read hashed-key nodes under TNC mutex
  UBIFS: fix zero-length truncations
2008-09-09 11:52:12 -07:00
Chuck Lever
af904deaf6 NFS: Restore missing hunk in NFS mount option parser
Automounter maps can contain mount options valid for other NFS
implementations but not for Linux.  The Linux automounter uses the
mount command's "-s" command line option ("s" for "sloppy") so that
mount requests containing such options are not rejected.

Commit f45663ce5f attempted to address a
known regression with text-based NFS mount option parsing.  Unrecognized
mount options would cause mount requests to fail, even if the "-s"
option was used on the mount command line.

Unfortunately, this commit was not complete as submitted.  It adds a
new mount option, "sloppy".  But it is missing a hunk, so it now allows
NFS mounts with unrecognized mount options, even if the "sloppy" option
is not present.  This could be a problem if a required critical mount
option such as "sync" is misspelled, for example, and is considered a
regression from 2.6.26.

This patch restores the missing hunk.  Now, the default behavior of
text-based NFS mount options is as before: any unrecognized mount option
will cause the mount to fail.

Please include this in 2.6.27-rc.

Thanks to Neil Brown for reporting this.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-08 15:35:19 -07:00
Christoph Hellwig
5c89468c12 udf: add llseek method
UDF currently doesn't set a llseek method for regular files, which
means it will fall back to default_llseek.  This means no one can seek
beyond 2 Gigabytes on udf, and that there's not protection vs
the i_size updates from writers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2008-09-08 20:31:04 +02:00
Li Zefan
7ee1ec4ca3 ext4: add missing unlock in ext4_check_descriptors() on error path
If there group descriptors are corrupted we need unlock the block
group lock before returning from the function; else we will oops when
freeing a spinlock which is still being held.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 10:47:19 -04:00
Theodore Ts'o
05496769e5 jbd2: clean up how the journal device name is printed
Calculate the journal device name once and stash it away in the
journal_s structure.  This avoids needing to call bdevname()
everywhere and reduces stack usage by not needing to allocate an
on-stack buffer.  In addition, we eliminate the '/' that can appear in
device names (e.g. "cciss/c0d0p9" --- see kernel bugzilla #11321) that
can cause problems when creating proc directory names, and include the
inode number to support ocfs2 which creates multiple journals with
different inode numbers.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-16 14:36:17 -04:00
Alexey Dobriyan
899fc1a4cf ext4: fix #11321: create /proc/ext4/*/stats more carefully
ext4 creates per-suberblock directory in /proc/ext4/ . Name used as
basis is taken from bdevname, which, surprise, can contain slash.

However, proc while allowing to use proc_create("a/b", parent) form of
PDE creation, assumes that parent/a was already created.

bdevname in question is 'cciss/c0d0p9', directory is not created and all
this stuff goes directly into /proc (which is real bug).

Warning comes when _second_ partition is mounted.

http://bugzilla.kernel.org/show_bug.cgi?id=11321

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-14 10:21:33 -04:00
Frederic Bohe
c62a11fd95 Update flex_bg free blocks and free inodes counters when resizing.
This fixes a bug which prevented the newly created inodes after a
resize from being used on filesystems with flex_bg.

Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 10:20:24 -04:00
Eric Sandeen
9d9f177572 ext4: Avoid printk floods in the face of directory corruption
Note: some people thinks this represents a security bug, since it
might make the system go away while it is printing a large number of
console messages, especially if a serial console is involved.  Hence,
it has been assigned CVE-2008-3528, but it requires that the attacker
either has physical access to your machine to insert a USB disk with a
corrupted filesystem image (at which point why not just hit the power
button), or is otherwise able to convince the system administrator to
mount an arbitrary filesystem image (at which point why not just
include a setuid shell or world-writable hard disk device file or some
such).  Me, I think they're just being silly. --tytso

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Cc: Eugene Teo <eugeneteo@kernel.sg>
2008-10-09 11:15:52 -04:00
Aneesh Kumar K.V
cf17fea657 ext4: Properly update i_disksize.
With delayed allocation we use i_data_sem to update i_disksize.  We need
to update i_disksize only if the new size specified is greater than the
current value and we need to make sure we don't race with other
i_disksize update.  With delayed allocation we will switch to the
write_begin function for non-delayed allocation if we are low on free
blocks.  This means the write_begin function for non-delayed allocation
also needs to use the same locking.

We also need to check and update i_disksize even if the new size is less
that inode.i_size because of delayed allocation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-13 13:06:18 -04:00
Aneesh Kumar K.V
ae4d537211 ext4: truncate block allocated on a failed ext4_write_begin
For blocksize < pagesize we need to remove blocks that got allocated in
block_write_begin() if we fail with ENOSPC for later blocks.
block_write_begin() internally does this if it allocated pages locally.
This makes sure we don't have blocks outside inode.i_size during ENOSPC.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-13 13:10:25 -04:00
Aneesh Kumar K.V
df22291ff0 ext4: Retry block allocation if we have free blocks left
When we truncate files, the meta-data blocks released are not reused
untill we commit the truncate transaction.  That means delayed get_block
request will return ENOSPC even if we have free blocks left.  Force a
journal commit and retry block allocation if we get ENOSPC with free
blocks left.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 23:05:34 -04:00
Aneesh Kumar K.V
166348dd37 ext4: Don't add the inode to journal handle until after the block is allocated
Make sure we don't add the inode to the journal handle until after the
block allocation, so that a journal commit will not include the inode in
case of block allocation failure.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 23:08:40 -04:00
Aneesh Kumar K.V
68629f29c6 ext4: Fix ext4 nomballoc allocator for ENOSPC
We run into ENOSPC error on nonmballoc ext4, even when there is free blocks
on the filesystem.

The patch includes two changes:

a) Set reservation to NULL if we trying to allocate near group_target_block
from the goal group if the free block in the group is less than windows.
This should give us a better chance to allocate near group_target_block.
This also ensures that if we are not allocating near group_target_block
then we don't trun off reservation. This should enable us to allocate
with reservation from other groups that have large free blocks count.

b) we don't need to check the window size if the block reservation is off.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 23:09:17 -04:00
Aneesh Kumar K.V
5c79161689 ext4: Signed arithmetic fix
This patch converts some usage of ext4_fsblk_t to s64.  This is needed
so that some of the sign conversion works as expected in if loops.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-08 23:12:24 -04:00
Aneesh Kumar K.V
79f0be8d2e ext4: Switch to non delalloc mode when we are low on free blocks count.
The delayed allocation code allocates blocks during writepages(), which
can not handle block allocation failures.  To deal with this, we switch
away from delayed allocation mode when we are running low on free
blocks.  This also allows us to avoid needing to reserve a large number
of meta-data blocks in case all of the requested blocks are
discontiguous.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-08 23:13:30 -04:00
Aneesh Kumar K.V
6bc6e63fcd ext4: Add percpu dirty block accounting.
This patch adds dirty block accounting using percpu_counters.  Delayed
allocation block reservation is now done by updating dirty block
counter.  In a later patch we switch to non delalloc mode if the
filesystem free blocks is greater than 150% of total filesystem dirty
blocks

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao<cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10 09:39:00 -04:00
Aneesh Kumar K.V
030ba6bc67 ext4: Retry block reservation
During block reservation if we don't have enough blocks left, retry
block reservation with smaller block counts.  This makes sure we try
fallocate and DIO with smaller request size and don't fail early.  The
delayed allocation reservation cannot try with smaller block count. So
retry block reservation to handle temporary disk full conditions.  Also
print free blocks details if we fail block allocation during writepages.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 23:14:50 -04:00
Aneesh Kumar K.V
a30d542a00 ext4: Make sure all the block allocation paths reserve blocks
With delayed allocation we need to make sure block are reserved before
we attempt to allocate them. Otherwise we get block allocation failure
(ENOSPC) during writepages which cannot be handled. This would mean
silent data loss (We do a printk stating data will be lost). This patch
updates the DIO and fallocate code path to do block reservation before
block allocation. This is needed to make sure parallel DIO and fallocate
request doesn't take block out of delayed reserve space.

When free blocks count go below a threshold we switch to a slow patch
which looks at other CPU's accumulated percpu counter values.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-09 10:56:23 -04:00
David Woodhouse
e17c6d5616 Introduce HAVE_AOUT symbol to remove hard-coded arch list for BINFMT_AOUT
HAVE_AOUT doesn't quite do the same thing as the recently removed
ARCH_SUPPORTS_AOUT config option. That was set even on platforms where
binfmt_aout isn't supported, although it's not entirely clear why.

So it's best just to introduce a new symbol, handled consistently with
other similar HAVE_xxx symbols; with a simple 'select' in the arch Kconfig.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2008-09-06 19:30:22 +01:00
David Woodhouse
6b213e1bc2 Remove redundant CONFIG_ARCH_SUPPORTS_AOUT
We don't need this any more; arguably we never really did.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2008-09-06 19:30:20 +01:00