Commit graph

15218 commits

Author SHA1 Message Date
Christoph Hellwig
746cd1e7e4 block: use blkdev_issue_discard in blk_ioctl_discard
blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
the only difference between the two is that blkdev_issue_discard needs to
send a barrier discard request and blk_ioctl_discard a non-barrier one,
and blk_ioctl_discard needs to wait on the request.  To facilitates this
add a flags argument to blkdev_issue_discard to control both aspects of the
behaviour.  This will be very useful later on for using the waiting
funcitonality for other callers.

Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-14 08:24:53 +02:00
Nikanth Karthikesan
a9327cac44 Seperate read and write statistics of in_flight requests
Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.

This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-14 08:24:52 +02:00
Linus Torvalds
86d710146f Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (87 commits)
  NFSv4: Disallow 'mount -t nfs4 -overs=2' and 'mount -t nfs4 -overs=3'
  NFS: Allow the "nfs" file system type to support NFSv4
  NFS: Move details of nfs4_get_sb() to a helper
  NFS: Refactor NFSv4 text-based mount option validation
  NFS: Mount option parser should detect missing "port="
  NFS: out of date comment regarding O_EXCL above nfs3_proc_create()
  NFS: Handle a zero-length auth flavor list
  SUNRPC: Ensure that sunrpc gets initialised before nfs, lockd, etc...
  nfs: fix compile error in rpc_pipefs.h
  nfs: Remove reference to generic_osync_inode from a comment
  SUNRPC: cache must take a reference to the cache detail's module on open()
  NFS: Use the DNS resolver in the mount code.
  NFS: Add a dns resolver for use with NFSv4 referrals and migration
  SUNRPC: Fix a typo in cache_pipefs_files
  nfs: nfs4xdr: optimize low level decoding
  nfs: nfs4xdr: get rid of READ_BUF
  nfs: nfs4xdr: simplify decode_exchange_id by reusing decode_opaque_inline
  nfs: nfs4xdr: get rid of COPYMEM
  nfs: nfs4xdr: introduce decode_sessionid helper
  nfs: nfs4xdr: introduce decode_verifier helper
  ...
2009-09-11 16:39:11 -07:00
Theodore Ts'o
7ad9bb651f ext4: Fix initalization of s_flex_groups
The s_flex_groups array should have been initialized using atomic_add
to sum up the free counts from the block groups that make up a
flex_bg.  By using atomic_set, the value of the s_flex_groups array
was set to the values of the last block group in the flex_bg.  

The impact of this bug is that the block and inode allocation
algorithms might not pick the best flex_bg for new allocation.

Thanks to Damien Guibouret for pointing out this problem!

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-11 16:51:28 -04:00
Linus Torvalds
774a694f8c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (64 commits)
  sched: Fix sched::sched_stat_wait tracepoint field
  sched: Disable NEW_FAIR_SLEEPERS for now
  sched: Keep kthreads at default priority
  sched: Re-tune the scheduler latency defaults to decrease worst-case latencies
  sched: Turn off child_runs_first
  sched: Ensure that a child can't gain time over it's parent after fork()
  sched: enable SD_WAKE_IDLE
  sched: Deal with low-load in wake_affine()
  sched: Remove short cut from select_task_rq_fair()
  sched: Turn on SD_BALANCE_NEWIDLE
  sched: Clean up topology.h
  sched: Fix dynamic power-balancing crash
  sched: Remove reciprocal for cpu_power
  sched: Try to deal with low capacity, fix update_sd_power_savings_stats()
  sched: Try to deal with low capacity
  sched: Scale down cpu_power due to RT tasks
  sched: Implement dynamic cpu_power
  sched: Add smt_gain
  sched: Update the cpu_power sum during load-balance
  sched: Add SD_PREFER_SIBLING
  ...
2009-09-11 13:23:18 -07:00
Trond Myklebust
ab3bbaa8b2 Merge branch 'nfs-for-2.6.32' 2009-09-11 14:59:37 -04:00
Linus Torvalds
a9c86d4259 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6: (377 commits)
  ASoC: au1x: PSC-AC97 bugfixes
  ALSA: dummy - Increase MAX_PCM_SUBSTREAMS to 128
  ALSA: dummy - Add debug proc file
  ALSA: Add const prefix to proc helper functions
  ALSA: Re-export snd_pcm_format_name() function
  ALSA: hda - Use auto model for HP laptops with ALC268 codec
  ALSA: cs46xx - Fix minimum period size
  ASoC: Fix WM835x Out4 capture enumeration
  ALSA: Remove unneeded ifdef from sound/core.h
  ALSA: Remove struct snd_monitor_file from public sound/core.h
  ASoC: Remove unuused hw_read_t
  sound: oxygen: work around MCE when changing volume
  ALSA: dummy - Fake buffer allocations
  ALSA: hda/realtek: Added support for CLEVO M540R subsystem, 6 channel + digital
  ASoC: fix pxa2xx-ac97.c breakage
  ALSA: dummy - Fix the timer calculation in systimer mode
  ALSA: dummy - Add more description
  ALSA: dummy - Better jiffies handling
  ALSA: dummy - Support high-res timer mode
  ALSA: Release v1.0.21
  ...
2009-09-11 09:19:35 -07:00
Linus Torvalds
a12e4d304c Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block
* 'writeback' of git://git.kernel.dk/linux-2.6-block:
  writeback: check for registered bdi in flusher add and inode dirty
  writeback: add name to backing_dev_info
  writeback: add some debug inode list counters to bdi stats
  writeback: get rid of pdflush completely
  writeback: switch to per-bdi threads for flushing data
  writeback: move dirty inodes from super_block to backing_dev_info
  writeback: get rid of generic_sync_sb_inodes() export
2009-09-11 09:17:05 -07:00
Linus Torvalds
f6f7919086 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (57 commits)
  binfmt_elf: fix PT_INTERP bss handling
  TPM: Fixup boot probe timeout for tpm_tis driver
  sysfs: Add labeling support for sysfs
  LSM/SELinux: inode_{get,set,notify}secctx hooks to access LSM security context information.
  VFS: Factor out part of vfs_setxattr so it can be called from the SELinux hook for inode_setsecctx.
  KEYS: Add missing linux/tracehook.h #inclusions
  KEYS: Fix default security_session_to_parent()
  Security/SELinux: includecheck fix kernel/sysctl.c
  KEYS: security_cred_alloc_blank() should return int under all circumstances
  IMA: open new file for read
  KEYS: Add a keyctl to install a process's session keyring on its parent [try #6]
  KEYS: Extend TIF_NOTIFY_RESUME to (almost) all architectures [try #6]
  KEYS: Do some whitespace cleanups [try #6]
  KEYS: Make /proc/keys use keyid not numread as file position [try #6]
  KEYS: Add garbage collection for dead, revoked and expired keys. [try #6]
  KEYS: Flag dead keys to induce EKEYREVOKED [try #6]
  KEYS: Allow keyctl_revoke() on keys that have SETATTR but not WRITE perm [try #6]
  KEYS: Deal with dead-type keys appropriately [try #6]
  CRED: Add some configurable debugging [try #6]
  selinux: Support for the new TUN LSM hooks
  ...
2009-09-11 08:55:49 -07:00
Miklos Szeredi
723590ed52 splice: update mtime and atime on files
Splice should update the modification and access times on regular
files just like read and write. Not updating mtime will confuse
backup tools, etc...

This patch only adds the time updates for regular files.  For pipes
and other special files that splice touches the need for updating the
times is less clear.  Let's discuss and fix that separately.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:34:33 +02:00
Jens Axboe
1f98a13f62 bio: first step in sanitizing the bio->bi_rw flag testing
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:31 +02:00
Jens Axboe
500b067c5e writeback: check for registered bdi in flusher add and inode dirty
Also a debugging aid. We want to catch dirty inodes being added to
backing devices that don't do writeback.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 09:20:26 +02:00
Jens Axboe
d993831fa7 writeback: add name to backing_dev_info
This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 09:20:26 +02:00
Jens Axboe
d0bceac747 writeback: get rid of pdflush completely
It is now unused, so kill it off.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 09:20:25 +02:00
Jens Axboe
03ba3782e8 writeback: switch to per-bdi threads for flushing data
This gets rid of pdflush for bdi writeout and kupdated style cleaning.
pdflush writeout suffers from lack of locality and also requires more
threads to handle the same workload, since it has to work in a
non-blocking fashion against each queue. This also introduces lumpy
behaviour and potential request starvation, since pdflush can be starved
for queue access if others are accessing it. A sample ffsb workload that
does random writes to files is about 8% faster here on a simple SATA drive
during the benchmark phase. File layout also seems a LOT more smooth in
vmstat:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
 0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
 1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
 0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
 0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
 0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
 0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
 0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
 0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
 0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
 1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
 0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
 0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
 1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
 0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
 0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
 1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
 0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
 1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
 1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
 0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
SSD based writeback test on XFS performs over 20% better as well, with
the throughput being very stable around 1GB/sec, where pdflush only
manages 750MB/sec and fluctuates wildly while doing so. Random buffered
writes to many files behave a lot better as well, as does random mmap'ed
writes.

A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 09:20:25 +02:00
Jens Axboe
66f3b8e2e1 writeback: move dirty inodes from super_block to backing_dev_info
This is a first step at introducing per-bdi flusher threads. We should
have no change in behaviour, although sb_has_dirty_inodes() is now
ridiculously expensive, as there's no easy way to answer that question.
Not a huge problem, since it'll be deleted in subsequent patches.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 09:20:25 +02:00
Jens Axboe
d8a8559cd7 writeback: get rid of generic_sync_sb_inodes() export
This adds two new exported functions:

- writeback_inodes_sb(), which only attempts to writeback dirty inodes on
  this super_block, for WB_SYNC_NONE writeout.
- sync_inodes_sb(), which writes out all dirty inodes on this super_block
  and also waits for the IO to complete.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 09:20:25 +02:00
Andreas Schlick
1f7bebb9e9 ext4: Always set dx_node's fake_dirent explicitly.
When ext4_dx_add_entry() has to split an index node, it has to ensure that
name_len of dx_node's fake_dirent is also zero, because otherwise e2fsck
won't recognise it as an intermediate htree node and consider the htree to
be corrupted.

Signed-off-by: Andreas Schlick <schlick@lavabit.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-10 23:16:07 -04:00
Theodore Ts'o
0e3d2a6313 ext4: Fix async commit mode to be safe by using a barrier
Previously the journal_async_commit mount option was equivalent to
using barrier=0 (and just as unsafe).  This patch fixes it so that we
eliminate the barrier before the commit block (by not using ordered
mode), and explicitly issuing an empty barrier bio after writing the
commit block.  Because of the journal checksum, it is safe to do this;
if the journal blocks are not all written before a power failure, the
checksum in the commit block will prevent the last transaction from
being replayed.

Using the fs_mark benchmark, using journal_async_commit shows a 50%
improvement:

FSUse%        Count         Size    Files/sec     App Overhead
     8         1000        10240         30.5            28242

vs.

FSUse%        Count         Size    Files/sec     App Overhead
     8         1000        10240         45.8            28620


Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-11 09:30:12 -04:00
James Morris
a3c8b97396 Merge branch 'next' into for-linus 2009-09-11 08:04:49 +10:00
Theodore Ts'o
71290b368a ext4: Don't update superblock write time when filesystem is read-only
This avoids updating the superblock write time when we are mounting
the root file system read/only but we need to replay the journal; at
that point, for people who are east of GMT and who make their clock
tick in localtime for Windows bug-for-bug compatibility, and this will
cause e2fsck to complain and force a full file system check.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-10 17:31:04 -04:00
Alex Elder
a4872d5b6a Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-09-10 14:33:56 -05:00
Takashi Iwai
3827119e20 Merge branch 'topic/soundcore-preclaim' into for-linus
* topic/soundcore-preclaim:
  sound: make OSS device number claiming optional and schedule its removal
  sound: request char-major-* module aliases for missing OSS devices
  chrdev: implement __[un]register_chrdev()
2009-09-10 15:33:04 +02:00
Roland McGrath
9f0ab4a3f0 binfmt_elf: fix PT_INTERP bss handling
In fs/binfmt_elf.c, load_elf_interp() calls padzero() for .bss even if
the PT_LOAD has no PROT_WRITE and no .bss.  This generates EFAULT.

Here is a small test case.  (Yes, there are other, useful PT_INTERP
which have only .text and no .data/.bss.)

	----- ptinterp.S
	_start: .globl _start
		 nop
		 int3
	-----
	$ gcc -m32 -nostartfiles -nostdlib -o ptinterp ptinterp.S
	$ gcc -m32 -Wl,--dynamic-linker=ptinterp -o hello hello.c
	$ ./hello
	Segmentation fault  # during execve() itself

	After applying the patch:
	$ ./hello
	Trace trap  # user-mode execution after execve() finishes

If the ELF headers are actually self-inconsistent, then dying is fine.
But having no PROT_WRITE segment is perfectly normal and correct if
there is no segment with p_memsz > p_filesz (i.e. bss).  John Reiser
suggested checking for PROT_WRITE in the bss logic.  I think it makes
most sense to simply apply the bss logic only when there is bss.

This patch looks less trivial than it is due to some reindentation.
It just moves the "if (last_bss > elf_bss) {" test up to include the
partial-page bss logic as well as the more-pages bss logic.

Reported-by: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-09-10 20:11:12 +10:00
Artem Bityutskiy
873a64c762 UBIFS: amend commentaries
This patch amends and nicifies commentaries in file.c, as well as
fixes some spelling problems.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-09-10 12:06:47 +03:00
Artem Bityutskiy
0dcd18e407 UBIFS: check ubifs_scan error codes better
The 'ubifs_scan()' function returns -EUCLEAN if something is corrupted
and recovery is needed, otherwise it returns other error codes. However,
in few places UBIFS does not check the error codes and runs recovery.
This patch changes this behavior and makes UBIFS start recovery only
on -EUCLEAN errors.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Adrian Hunter <Adrian.Hunter@nokia.com>
2009-09-10 12:06:47 +03:00
Artem Bityutskiy
348709bad3 UBIFS: do not print scary error messages needlessly
At the moment UBIFS print large and scary error messages and
flash dumps in case of nearly any corruption, even if it is
a recoverable corruption. For example, if the master node is
corrupted, ubifs_scan() prints error dumps, then UBIFS recovers
just fine and goes on.

This patch makes UBIFS print scary error messages only in
real cases, which are not recoverable. It adds 'quiet' argument
to the 'ubifs_scan()' function, so the caller may ask 'ubi_scan()'
not to print error messages if the caller is able to do recovery.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Adrian Hunter <Adrian.Hunter@nokia.com>
2009-09-10 12:06:47 +03:00
Artem Bityutskiy
e3c3efc243 UBIFS: add inode size debugging check
Add one more check to UBIFS - a check that makes sure that there
are no data nodes beyond inode size. And few commantaries fixes
along the line.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Adrian Hunter <Adrian.Hunter@nokia.com>
2009-09-10 09:58:11 +03:00
Aneesh Kumar K.V
08c3a81338 ext4: Clarify the locking details in mballoc
We don't need to take the alloc_sem lock when we are adding new
groups, since mballoc won't see the new group added until we bump
sbi->s_groups_count.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2009-09-09 23:50:17 -04:00
Aneesh Kumar K.V
f41c075053 ext4: check for need init flag in ext4_mb_load_buddy
We should check for need init flag with the group's alloc_sem held, to
make sure while we are loading the buddy cache and holding a reference
to it, a file system resize can't add new blocks to same group.

The patch also drops the need init flag check in
ext4_mb_regular_allocator() because doing the check without holding
alloc_sem is racy.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2009-09-09 23:34:50 -04:00
Aneesh Kumar K.V
b6a758ec3a ext4: move ext4_mb_init_group() function earlier in the mballoc.c
This moves the function around so that it can be called from
ext4_mb_load_buddy().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-09 23:47:46 -04:00
Linus Torvalds
526b678093 Merge branch 'lookup-permissions-cleanup'
* lookup-permissions-cleanup:
  jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()'
  ext[234]: move over to 'check_acl' permission model
  shmfs: use 'check_acl' instead of 'permission'
  Make 'check_acl()' a first-class filesystem op
  Simplify exec_permission_lite(), part 3
  Simplify exec_permission_lite() further
  Simplify exec_permission_lite() logic
  Do not call 'ima_path_check()' for each path component
2009-09-09 20:04:54 -07:00
Roland McGrath
752015d1b0 binfmt_elf: fix PT_INTERP bss handling
In fs/binfmt_elf.c, load_elf_interp() calls padzero() for .bss even if
the PT_LOAD has no PROT_WRITE and no .bss.  This generates EFAULT.

Here is a small test case.  (Yes, there are other, useful PT_INTERP
which have only .text and no .data/.bss.)

	----- ptinterp.S
	_start: .globl _start
		 nop
		 int3
	-----
	$ gcc -m32 -nostartfiles -nostdlib -o ptinterp ptinterp.S
	$ gcc -m32 -Wl,--dynamic-linker=ptinterp -o hello hello.c
	$ ./hello
	Segmentation fault  # during execve() itself

	After applying the patch:
	$ ./hello
	Trace trap  # user-mode execution after execve() finishes

If the ELF headers are actually self-inconsistent, then dying is fine.
But having no PROT_WRITE segment is perfectly normal and correct if
there is no segment with p_memsz > p_filesz (i.e. bss).  John Reiser
suggested checking for PROT_WRITE in the bss logic.  I think it makes
most sense to simply apply the bss logic only when there is bss.

This patch looks less trivial than it is due to some reindentation.
It just moves the "if (last_bss > elf_bss) {" test up to include the
partial-page bss logic as well as the more-pages bss logic.

Reported-by: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-09 20:03:47 -07:00
Frank Mayhar
91ac6f4331 ext4: Make non-journal fsync work properly
Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
mode:  If we're not using a journal, ext4_write_inode() now calls
ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
with a new "do_sync" parameter.  If that parameter is nonzero _and_ we're
not using a journal, ext4_do_update_inode() calls sync_dirty_buffer()
instead of ext4_handle_dirty_metadata().

This problem was found in power-fail testing, checking the amount of
loss of files and blocks after a power failure when using fsync() and
when not using fsync().  It turned out that using fsync() was actually
worse than not doing so, possibly because it increased the likelihood
that the inodes would remain unflushed and would therefore be lost at
the power failure.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-09 22:33:47 -04:00
Theodore Ts'o
fe188c0e08 ext4: Assure that metadata blocks are written during fsync in no journal mode
When there is no journal present, we must attach buffer heads
associated with extent tree and indirect blocks to the inode's
mapping->private_list via mark_buffer_dirty_inode() so that
ext4_sync_file() --- which is called to service fsync() and
fdatasync() system calls --- can write out the inode's metadata blocks
by calling sync_mapping_buffers().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-12 13:41:55 -04:00
Theodore Ts'o
c7acb4c166 ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
When ext4 is using a journal, a metadata block which is deallocated
must be passed into the journal layer so it can be dropped from the
current transaction and/or revoked.  This is done by calling the
functions ext4_journal_forget() and ext4_journal_revoke(), which call
jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.

Since the jbd2_journal_forget() and jbd2_journal_revoke() call
bforget(), if ext4 is not using a journal, ext4_journal_forget() and
ext4_journal_revoke() must call bforget() to avoid a dirty metadata
block overwriting a block after it has been reallocated and reused for
another inode's data block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-09 21:32:41 -04:00
David P. Quigley
ddd29ec659 sysfs: Add labeling support for sysfs
This patch adds a setxattr handler to the file, directory, and symlink
inode_operations structures for sysfs. The patch uses hooks introduced in the
previous patch to handle the getting and setting of security information for
the sysfs inodes. As was suggested by Eric Biederman the struct iattr in the
sysfs_dirent structure has been replaced by a structure which contains the
iattr, secdata and secdata length to allow the changes to persist in the event
that the inode representing the sysfs_dirent is evicted. Because sysfs only
stores this information when a change is made all the optional data is moved
into one dynamically allocated field.

This patch addresses an issue where SELinux was denying virtd access to the PCI
configuration entries in sysfs. The lack of setxattr handlers for sysfs
required that a single label be assigned to all entries in sysfs. Granting virtd
access to every entry in sysfs is not an acceptable solution so fine grained
labeling of sysfs is required such that individual entries can be labeled
appropriately.

[sds:  Fixed compile-time warnings, coding style, and setting of inode security init flags.]

Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: Stephen D. Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
2009-09-10 10:11:29 +10:00
David P. Quigley
b1ab7e4b2a VFS: Factor out part of vfs_setxattr so it can be called from the SELinux hook for inode_setsecctx.
This factors out the part of the vfs_setxattr function that performs the
setting of the xattr and its notification. This is needed so the SELinux
implementation of inode_setsecctx can handle the setting of the xattr while
maintaining the proper separation of layers.

Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-09-10 10:11:22 +10:00
Christoph Hellwig
4734d401d4 xfs: use correct log reservation when handling ENOSPC in xfs_create
We added the ENOSPC handling patch in xfs_create just after it got mered
with xfs_mkdir.  Change the log reservation to the variable for either
the create or mkdir value so it does the right thing if get here for creating
a directory.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-09-09 18:19:02 -05:00
Steven Whitehouse
2b88f7c535 GFS2: Remove unused sysfs file
The /sys/fs/gfs2/<fsname>/lock_module/id file has been unused for
some time now, so we can remove it. We still accept the mount option
though, as userspace still sends that.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-09-09 15:59:35 +01:00
Trond Myklebust
2ecda72b49 NFSv4: Disallow 'mount -t nfs4 -overs=2' and 'mount -t nfs4 -overs=3'
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-09-08 19:50:07 -04:00
Chuck Lever
764302ccb8 NFS: Allow the "nfs" file system type to support NFSv4
When mounting an "nfs" type file system, recognize "v4," "vers=4," or
"nfsvers=4" mount options, and convert the file system to "nfs4" under
the covers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[trondmy: fixed up binary mount code so it sets the 'version' field too]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-09-08 19:50:03 -04:00
Chuck Lever
a6fe23be90 NFS: Move details of nfs4_get_sb() to a helper
Clean up: Refactor nfs4_get_sb() to allow its guts to be invoked by
nfs_get_sb().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-09-08 19:50:00 -04:00
Chuck Lever
7630c852e1 NFS: Refactor NFSv4 text-based mount option validation
Clean up: Refactor the part of nfs4_validate_mount_options() that
handles text-based options, so we can call it from the NFSv2/v3
option validation function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-09-08 19:49:57 -04:00
Chuck Lever
4cfd74fc99 NFS: Mount option parser should detect missing "port="
The meaning of not specifying the "port=" mount option is different
for "-t nfs" and "-t nfs4" mounts.  The default port value for
NFSv2/v3 mounts is 0, but the default for NFSv4 mounts is 2049.

To support "-t nfs -o vers=4", the mount option parser must detect
when "port=" is missing so that the correct default port value can be
set depending on which NFS version is requested.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-09-08 19:49:47 -04:00
Harshula Jayasuriya
dbab8360ed NFS: out of date comment regarding O_EXCL above nfs3_proc_create()
Hi Trond,

Recently we were observing the behaviour difference between a 2.4.x and
2.6.x kernel with respect to O_EXCL. A comment from 2.4.x era, "For now,
we don't implement O_EXCL." seems inaccurate in TOT.

If so, here's a patch to remove the comment.

This patch is against:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

Signed-off-by: Harshula Jayasuriya <harshula@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-09-08 19:49:33 -04:00
Linus Torvalds
18f4c64477 jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()'
This avoids an indirect call in the VFS for each path component lookup.

Well, at least as long as you own the directory in question, and the ACL
check is unnecessary.

Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-08 11:09:04 -07:00
Linus Torvalds
1d5ccd1c42 ext[234]: move over to 'check_acl' permission model
Don't implement per-filesystem 'extX_permission()' functions that have
to be called for every path component operation, and instead just expose
the actual ACL checking so that the VFS layer can now do it for us.

Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-08 11:09:04 -07:00
Linus Torvalds
5909ccaa30 Make 'check_acl()' a first-class filesystem op
This is stage one in flattening out the callchains for the common
permission testing.  Rather than have most filesystem implement their
own inode->i_op->permission function that just calls back down to the
VFS layers 'generic_permission()' with the per-filesystem ACL checking
function, the filesystem can just expose its 'check_acl' function
directly, and let the VFS layer do everything for it.

This is all just preparatory - no filesystem actually enables this yet.

Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-08 11:07:44 -07:00
Linus Torvalds
cb9179ead0 Simplify exec_permission_lite(), part 3
Don't call down to the generic inode_permission() function just to
call the inode-specific permission function - just do it directly.

The generic inode_permission() code does things like checking MAY_WRITE
and devcgroup_inode_permission(), neither of which are relevant for the
light pathname walk permission checks (we always do just MAY_EXEC, and
the inode is never a special device).

Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-08 11:07:44 -07:00