Commit graph

30039 commits

Author SHA1 Message Date
Stefan Behrens
7a9e998768 Btrfs: make the scrub page array dynamically allocated
With the modified design (in order to support the devive replace
procedure) it is necessary to alloc the page array dynamically.
The reason is that pages are reused. At first a page is used for
the bio to read the data from the filesystem, then the same page
is reused for the bio that writes the data to the target disk.
Since the read process and the write process are completely
decoupled, this requires a new concept of refcounts and get/put
functions for pages, and it requires to use newly created pages
for each read bio which are freed after the write operation
is finished.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:30 -05:00
Stefan Behrens
a36cf8b893 Btrfs: remove the block device pointer from the scrub context struct
The block device is removed from the scrub context state structure.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in the scrub
context struct and moved into the lower level scope of scrub_bio,
fixup and page structures where the block device context is known.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:30 -05:00
Stefan Behrens
d9d181c1ba Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.

Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:29 -05:00
Liu Bo
d25628bdd6 Btrfs: protect devices list with its mutex
Since we've kill the bigger one volume_mutex, we need to add devices
list mutex back.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:28 -05:00
Liu Bo
b53d3f5db2 Btrfs: cleanup for btrfs_btree_balance_dirty
- 'nr' is no more used.
- btrfs_btree_balance_dirty() and __btrfs_btree_balance_dirty() can share
  a bunch of code.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:28 -05:00
Alexander Block
3ef5969cd8 Btrfs: merge inode_list in __merge_refs
When __merge_refs merges two refs, it is also needed to merge the
inode_list of both refs. Otherwise we have missed backrefs and memory
leaks. This happens for example if two inodes share an extent and
both lie in the same leaf and thus also have the same parent.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:27 -05:00
Tsutomu Itoh
e1f5790e05 Btrfs: set hole punching time properly
Even if the hole punching is executed, the modification time of the
file is not updated.
So, current time is set to inode.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:26 -05:00
Stefan Behrens
d03f918ab9 Btrfs: Don't trust the superblock label and simply printk("%s") it
Someone who is root or capable(CAP_SYS_ADMIN) could corrupt the
superblock and make Btrfs printk("%s") crash while holding the
uuid_mutex since nobody forces a limit on the string. Since the
uuid_mutex is significant, the system would be unusable
afterwards.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:26 -05:00
Liu Bo
109f2365f1 Btrfs: fix a double free on pending snapshots in error handling
When creating a snapshot, failing to commit a transaction can end up
with aborting the transaction, following by doing a cleanup for it, where
we'll free all snapshots pending to disk.

So we check it and avoid double free on pending snapshots.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:25 -05:00
Liu Bo
37c4146d22 Btrfs: fix a deadlock in aborting transaction due to ENOSPC
When committing a transaction, we may bail out of running delayed refs
due to ENOSPC, and then abort the current transaction to flip into readonly.

But we'll hit a deadlock on ref head's lock since we forget to release
its lock and other cleanup stuff.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:25 -05:00
Julia Lawall
6c1500f22a fs/btrfs: drop if around WARN_ON
Just use WARN_ON rather than an if containing only WARN_ON(1).

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression e;
@@
- if (e) WARN_ON(1);
+ WARN_ON(e);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:24 -05:00
Julia Lawall
31b1a2bd75 fs/btrfs: use WARN
Use WARN rather than printk followed by WARN_ON(1), for conciseness.

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression list es;
@@

-printk(
+WARN(1,
  es);
-WARN_ON(1);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:23 -05:00
Miao Xie
5269b67e3d Btrfs: fix missing log when BTRFS_INODE_NEEDS_FULL_SYNC is set
If we set BTRFS_INODE_NEEDS_FULL_SYNC, we should log all the extent,
but now we forget to take it into account, and set a wrong max key,
if so, we will skip the file extent metadata when doing logging. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:22 -05:00
Miao Xie
bbe1426764 Btrfs: fix unprotected extent map operation when logging file extents
We forget to protect the modified_extents list, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:22 -05:00
Miao Xie
315a9850da Btrfs: fix wrong file extent length
There are two types of the file extent - inline extent and regular extent,
When we log file extents, we didn't take inline extent into account, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:21 -05:00
Miao Xie
ca46963718 Btrfs: fix missing flush when committing a transaction
Consider the following case:
	Task1				Task2
	start_transaction
					commit_transaction
					  check pending snapshots list and the
					  list is empty.
	add pending snapshot into list
					  skip the delalloc flush
	end_transaction
					  ...

And then the problem that the snapshot is different with the source subvolume
happen.

This patch fixes the above problem by flush all pending stuffs when all the
other tasks end the transaction.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:21 -05:00
Miao Xie
b7d5b0a819 Btrfs: fix joining the same transaction handler more than 2 times
If we flush inodes with pending delalloc in a transaction, we may join
the same transaction handler more than 2 times.

The reason is:
  Task						use_count of trans handle
  commit_transaction				1
    |-> btrfs_start_delalloc_inodes		1
	  |-> run_delalloc_nocow		1
		|-> join_transaction		2
		|-> cow_file_range		2
			|-> join_transaction	3

In fact, cow_file_range needn't join the transaction again because the caller
have joined the transaction, so we fix this problem by this way.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:20 -05:00
Liu Bo
4fde183d8c Btrfs: cleanup for btrfs_wait_order_range
Variable 'found' is no more used.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:19 -05:00
Liu Bo
9f3959c53d Btrfs: get right arguments for btrfs_wait_ordered_range
btrfs_wait_ordered_range expects for 'len' instead of 'end'.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:19 -05:00
Liu Bo
183f37fa35 Btrfs: do not log extents when we only log new names
When we log new names, we need to log just enough to recreate the inode
during log replay, and there is no need to log extents along with it.

This actually fixes a bug revealed by xfstests 241, where it shows
that we're logging some extents that have not updated metadata,
so we don't get proper EXTENT_DATA items to be copied to log tree.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:18 -05:00
Stefan Behrens
292fd7fc39 Btrfs: don't allow degraded mount if too many devices are missing
The current behavior is to allow mounting or remounting a filesystem
writeable in degraded mode if at least one writeable device is
present.
The next failed write access to a missing device which is above
the tolerance of the configured level of redundancy results in an
read-only enforcement. Even without this, the next time
barrier_all_devices() is called and more devices are missing than
tolerable, the switch to read-only mode takes place.

In order to behave predictably and to provide proper feedback to
the user at mount time, this patch compares the number of missing
devices with the number of devices that are tolerated to be missing
according to the configured RAID level. If more devices are missing
than tolerated, e.g. if two devices are missing in case of RAID1,
only a read-only mount and remount is allowed.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:18 -05:00
Masanari Iida
d142324873 Btrfs: Fix typo in fs/btrfs
Correct spelling typo in btrfs.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:17 -05:00
jeff.liu
0253f40ef9 Btrfs: Remove the invalid shrink size check up from btrfs_shrink_dev()
Remove an invalid size check up from btrfs_shrink_dev().

The new size should not larger than the device->total_bytes as it was
already verified before coming to here(i.e. new_size < old_size).

Remove invalid check up for btrfs_shrink_dev().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:16 -05:00
Andy Adamson
eb96d5c97b SUNRPC handle EKEYEXPIRED in call_refreshresult
Currently, when an RPCSEC_GSS context has expired or is non-existent
and the users (Kerberos) credentials have also expired or are non-existent,
the client receives the -EKEYEXPIRED error and tries to refresh the context
forever.  If an application is performing I/O, or other work against the share,
the application hangs, and the user is not prompted to refresh/establish their
credentials. This can result in a denial of service for other users.

Users are expected to manage their Kerberos credential lifetimes to mitigate
this issue.

Move the -EKEYEXPIRED handling into the RPC layer. Try tk_cred_retry number
of times to refresh the gss_context, and then return -EACCES to the application.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-12 15:36:02 -05:00
Linus Torvalds
9977d9b379 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull big execve/kernel_thread/fork unification series from Al Viro:
 "All architectures are converted to new model.  Quite a bit of that
  stuff is actually shared with architecture trees; in such cases it's
  literally shared branch pulled by both, not a cherry-pick.

  A lot of ugliness and black magic is gone (-3KLoC total in this one):

   - kernel_thread()/kernel_execve()/sys_execve() redesign.

     We don't do syscalls from kernel anymore for either kernel_thread()
     or kernel_execve():

     kernel_thread() is essentially clone(2) with callback run before we
     return to userland, the callbacks either never return or do
     successful do_execve() before returning.

     kernel_execve() is a wrapper for do_execve() - it doesn't need to
     do transition to user mode anymore.

     As a result kernel_thread() and kernel_execve() are
     arch-independent now - they live in kernel/fork.c and fs/exec.c
     resp.  sys_execve() is also in fs/exec.c and it's completely
     architecture-independent.

   - daemonize() is gone, along with its parts in fs/*.c

   - struct pt_regs * is no longer passed to do_fork/copy_process/
     copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump.

   - sys_fork()/sys_vfork()/sys_clone() unified; some architectures
     still need wrappers (ones with callee-saved registers not saved in
     pt_regs on syscall entry), but the main part of those suckers is in
     kernel/fork.c now."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits)
  do_coredump(): get rid of pt_regs argument
  print_fatal_signal(): get rid of pt_regs argument
  ptrace_signal(): get rid of unused arguments
  get rid of ptrace_signal_deliver() arguments
  new helper: signal_pt_regs()
  unify default ptrace_signal_deliver
  flagday: kill pt_regs argument of do_fork()
  death to idle_regs()
  don't pass regs to copy_process()
  flagday: don't pass regs to copy_thread()
  bfin: switch to generic vfork, get rid of pointless wrappers
  xtensa: switch to generic clone()
  openrisc: switch to use of generic fork and clone
  unicore32: switch to generic clone(2)
  score: switch to generic fork/vfork/clone
  c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone()
  take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
  mn10300: switch to generic fork/vfork/clone
  h8300: switch to generic fork/vfork/clone
  tile: switch to generic clone()
  ...

Conflicts:
	arch/microblaze/include/asm/Kbuild
2012-12-12 12:22:13 -08:00
Jeff Layton
be7e985804 nfs: fix page dirtying in NFS DIO read codepath
The NFS DIO code will dirty pages that catch read responses in order to
handle the case where someone is doing DIO reads into an mmapped buffer.
The existing code doesn't really do the right thing though since it
doesn't take into account the case where we might be attempting to read
past the EOF.

Fix the logic in that code to only dirty pages that ended up receiving
data from the read. Note too that it really doesn't matter if
NFS_IOHDR_ERROR is set or not. All that matters is if the page was
altered by the read.

Cc: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-12 12:56:19 -05:00
Jeff Layton
67fad106a2 nfs: don't zero out the rest of the page if we hit the EOF on a DIO READ
Eryu provided a test program that would segfault when attempting to read
past the EOF on file that was opened O_DIRECT. The buffer given to the
read() call was on the stack, and when he attempted to read past it it
would scribble over the rest of the stack page.

If we hit the end of the file on a DIO READ request, then we don't want
to zero out the rest of the buffer. These aren't pagecache pages after
all, and there's no guarantee that the buffers that were passed in
represent entire pages.

Cc: <stable@vger.kernel.org> # v3.5+
Cc: Fred Isaman <iisaman@netapp.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-12 12:56:09 -05:00
Linus Torvalds
6facac1ab6 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "This includes a set of misc.  cifs fixes (most importantly some byte
  range lock related write fixes from Pavel, and some ACL and idmap
  related fixes from Jeff) but also includes the SMB2.02 dialect
  enablement, and a key fix for SMB3 mounts.

  Default authentication upgraded to ntlmv2 for cifs (it was already
  ntlmv2 for smb2)"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (43 commits)
  CIFS: Fix write after setting a read lock for read oplock files
  cifs: parse the device name into UNC and prepath
  cifs: fix up handling of prefixpath= option
  cifs: clean up handling of unc= option
  cifs: fix SID binary to string conversion
  fix "disabling echoes and oplocks" on SMB2 mounts
  Do not send SMB2 signatures for SMB3 frames
  cifs: deal with id_to_sid embedded sid reply corner case
  cifs: fix hardcoded default security descriptor length
  cifs: extra sanity checking for cifs.idmap keys
  cifs: avoid extra allocation for small cifs.idmap keys
  cifs: simplify id_to_sid and sid_to_id mapping code
  CIFS: Fix possible data coherency problem after oplock break to None
  CIFS: Do not permit write to a range mandatory locked with a read lock
  cifs: rename cifs_readdir_lookup to cifs_prime_dcache and make it void return
  cifs: Add CONFIG_CIFS_DEBUG and rename use of CIFS_DEBUG
  cifs: Make CIFS_DEBUG possible to undefine
  SMB3 mounts fail with access denied to some servers
  cifs: Remove unused cEVENT macro
  cifs: always zero out smb_vol before parsing options
  ...
2012-12-12 09:24:13 -08:00
Linus Torvalds
3f1c64f410 xfs: update for 3.8-rc1
- remove the xfssyncd mess
 - only update the last_sync_lsn when a transaction completes
 - zero allocation_args on the kernel stack
 - fix AGF/alloc workqueue deadlock
 - silence uninitialised f.file warning
 - Update inode alloc comments
 - Update mount options documentation
 - report projid32bit feature in geometry call
 - speculative preallocation inode tracking
 - fix attr tree double split corruption
 - fix broken error handling in xfs_vm_writepage
 - drop buffer io reference when a bad bio is built
 - add more attribute tree trace points
 - growfs infrastructure changes for 3.8
 - fs/xfs/xfs_fs_subr.c die die die
 - add CRC infrastructure
 - add CRC checks to the log
 - Remove description of nodelaylog mount option from xfs.txt
 - inode allocation should use unmapped buffers
 - byte range granularity for XFS_IOC_ZERO_RANGE
 - fix direct IO nested transaction deadlock
 - fix stray dquot unlock when reclaiming dquots
 - fix sparse reported log CRC endian issue
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQx3MFAAoJENaLyazVq6ZO3MgP/Aux0lrRv8H3j7Bj0zGbx6Wc
 BUVVUTEifMDCvzyQJNqhX7SeINb1IhlYtoH6FDICENA/yWhIzAs7z/xhDbJLFKDq
 xDNwt2WMqIN+DVX/X3SFxevIKMjufKjI4nM9qzcJn7sFmsxreaZLI1Xw5jEab2ou
 kkx//YTzYzk56tUzd6xOI1QFXqU7N2wGylx+eyPcNtFrgVOMUDCIhlVE0WT8sUVv
 jfGrLMVG6g6RLVHqpExzmuXJS04kv+R6WK4J4ZkHZ/shspYRs1lFO2OKUP1gw5/W
 8acDhJPIwKJb5mnPoxvYTXTqUv0jYvBQ+UWv6OWSK4EprN0ePVOFZOCQ9YIIirrU
 jiITiK1c0t9z6O1jenYoGvqptcfb+xodgSpa7lvlMquZCqaoX3mimg/01ii+aZu5
 6b4W4dMZ6UMI8WiCS44GcLdsfC7wuUL1Kwdoh95xzPyT7nUPFiikWkMmYVF1btLQ
 5kftZhOImlhU+js/nVcRfAEiLr1eS1g/nF+3+zNLYGSO0ZaIuJ/8Zg297mYS0pQx
 yWaeblX3idgpnPZxUHDvrPNnJsbnchujNer0V2fCGRcX3nmF5rvaeclsz//eRF/b
 TrpXYxgKr/oXbKmB5KaTHpG+wKSJYKHzMV7fzUwPdLM106OEt98Ke1ptAmTZU9gp
 ugXEwjMEsB3qDwD95JVz
 =Cl7A
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.8-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs update from Ben Myers:
 "There is plenty going on, including the cleanup of xfssyncd, metadata
  verifiers, CRC infrastructure for the log, tracking of inodes with
  speculative allocation, a cleanup of xfs_fs_subr.c, fixes for
  XFS_IOC_ZERO_RANGE, and important fix related to log replay (only
  update the last_sync_lsn when a transaction completes), a fix for
  deadlock on AGF buffers, documentation and comment updates, and a few
  more cleanups and fixes.

  Details:
   - remove the xfssyncd mess
   - only update the last_sync_lsn when a transaction completes
   - zero allocation_args on the kernel stack
   - fix AGF/alloc workqueue deadlock
   - silence uninitialised f.file warning
   - Update inode alloc comments
   - Update mount options documentation
   - report projid32bit feature in geometry call
   - speculative preallocation inode tracking
   - fix attr tree double split corruption
   - fix broken error handling in xfs_vm_writepage
   - drop buffer io reference when a bad bio is built
   - add more attribute tree trace points
   - growfs infrastructure changes for 3.8
   - fs/xfs/xfs_fs_subr.c die die die
   - add CRC infrastructure
   - add CRC checks to the log
   - Remove description of nodelaylog mount option from xfs.txt
   - inode allocation should use unmapped buffers
   - byte range granularity for XFS_IOC_ZERO_RANGE
   - fix direct IO nested transaction deadlock
   - fix stray dquot unlock when reclaiming dquots
   - fix sparse reported log CRC endian issue"

Fix up trivial conflict in fs/xfs/xfs_fsops.c due to the same patch
having been applied twice (commits eaef854335 and 1375cb65e8: "xfs:
growfs: don't read garbage for new secondary superblocks") with later
updates to the affected code in the XFS tree.

* tag 'for-linus-v3.8-rc1' of git://oss.sgi.com/xfs/xfs: (78 commits)
  xfs: fix sparse reported log CRC endian issue
  xfs: fix stray dquot unlock when reclaiming dquots
  xfs: fix direct IO nested transaction deadlock.
  xfs: byte range granularity for XFS_IOC_ZERO_RANGE
  xfs: inode allocation should use unmapped buffers.
  xfs: Remove the description of nodelaylog mount option from xfs.txt
  xfs: add CRC checks to the log
  xfs: add CRC infrastructure
  xfs: convert buffer verifiers to an ops structure.
  xfs: connect up write verifiers to new buffers
  xfs: add pre-write metadata buffer verifier callbacks
  xfs: add buffer pre-write callback
  xfs: Add verifiers to dir2 data readahead.
  xfs: add xfs_da_node verification
  xfs: factor and verify attr leaf reads
  xfs: factor dir2 leaf read
  xfs: factor out dir2 data block reading
  xfs: factor dir2 free block reading
  xfs: verify dir2 block format buffers
  xfs: factor dir2 block read operations
  ...
2012-12-12 09:19:45 -08:00
Linus Torvalds
22a40fd9a6 dlm for 3.8
This set fixes some conditions in which value blocks are invalidated,
 and includes two trivial cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQx3nAAAoJEDgbc8f8gGmqfBgQAIw46bVtZxxK/DcyGMR5UCtV
 ARLHieFbse3fJNkIXOD96G7Psk/dDgjulclewccXcdgu+VGyXQ1g1YJG9/L0Sv17
 mRL+WlceXWWZ9LuUwRicBNerwd3MBLGndEV78fFweopV7FNYBF7qTWzywTLHCGmf
 iB5/jdSLhPzj4ele+BA1XUHqQYOiDEmbLlw8sSNU3kxiOCO/lqWlQLd1t+YoOUI/
 8YPOfmPFxu0SbBmvoTlr53w+gvDpoTV1AdVJO2Pe7yuIAAUWcMN1NHTfyL3ua3K9
 LPh/eSltcKLdS7wjcNoufL5CEPsaTnmO28MZdHO+S3JG2T7glhBo6j/c3v1JO4rV
 MpBFu1Blm2CeWzmC8tTzwyK/mLRrfjue/4PV11rYUcaBIl/NwaapnUXF8doS81EX
 jDgTX7flZa4ykv6f/yca8aJJdIRQtpS4AsmKMigL8TN1JQd22e3LOr30JFkLy7D5
 fzidJbhusbeD2kDsskXwMfyF5kUYXLdVQQqwM3BK8+YwjqyM9ReI5XHQWJrdJyzH
 u3q6HjO8Wb0e3al2Ay1BhYhkARBm+1vBjxc9fdNXuXdESUvrB5GBB2xrWVgv0Uu8
 Dj/ml9hiLr2PcZ0yo+kqkLOpRVrTxQ03IBAaAjsraVjX7exFlEv5gsecP/3Ps7P3
 yWxfHLapr/8Vf4itF7hn
 =Pb3H
 -----END PGP SIGNATURE-----

Merge tag 'dlm-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This set fixes some conditions in which value blocks are invalidated,
  and includes two trivial cleanups."

* tag 'dlm-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: fix lvb invalidation conditions
  fs/dlm: remove CONFIG_EXPERIMENTAL
  dlm: remove unused variable in *dlm_lowcomms_get_buffer()
2012-12-12 09:15:45 -08:00
Linus Torvalds
251a8cfeda Patch series to allow EFI variable backend to pstore
to hold multiple records.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQxjFJAAoJEKurIx+X31iBNssQAKCjiXe7dT8DigoKSPj3w1sJ
 8xcuJKmlZ5ykgElfQ1Lcfuz0KuMT8SZfBHrRCL389Yvtxq71NL9pMX1OyCqyfVyO
 uuKWD/+jVc5d11F15hC2l8zIU7v2S3TXQ64rZEgDDFXIPlTbEcEBy92arpq6Avj3
 6ioR+rf4jvOeC66keIcDOIl6r4tVn2f16B9Z6ffSOit0dVeRAOc7bB9LLaohETSY
 rCSKHgZLHBZRQv2yGKS3LB6KSL2DVm+lgNhzAVw4kEJNTJl04clK854L8EM7WfGj
 qREsvSUB4H4I2XMBW+HCU/6+4c4alSEr/2c8+xBeynL4+I2y+VLwySkwE+8DPgEV
 BjCuyjSmpZVuVz6R28hcJIbkK0sN1gntXitb2V157+89f3gVEMYwKaY787TuNl9g
 oWYjjeOk66r5xshdVvCnFIgZxgFItznn+kgRkNq0clGH+nWVsZfuYw/q2PLm9xoM
 TJR+ZZwMJC25TE99rh9CqVc0kuQi5SXfqjpd+fW8oJ2kSxj4/7SqmOpmK7oWLFNi
 kf0xS1yVsQB5mRx3nvERvA36lzNSiQRFXnMjh84be4NX8dYqiGoJQjOh/laemjxd
 8hpoMVuUMtXLC4roejyyLwddtTLshWm2ByULAHJRJpuUKzichkaqhuvvkMUjKfDj
 NunMg/OfNs78sTlNSATf
 =NAdA
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-pstore_mevent' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull pstore fixes from Tony Luck:
 "Patch series to allow EFI variable backend to pstore to hold multiple
  records."

* tag 'please-pull-pstore_mevent' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  efi_pstore: Add a format check for an existing variable name at erasing time
  efi_pstore: Add a format check for an existing variable name at reading time
  efi_pstore: Add a sequence counter to a variable name
  efi_pstore: Add ctime to argument of erase callback
  efi_pstore: Remove a logic erasing entries from a write callback to hold multiple logs
  efi_pstore: Add a logic erasing entries to an erase callback
  efi_pstore: Check remaining space with QueryVariableInfo() before writing data
2012-12-12 07:50:59 -08:00
Linus Torvalds
f57d54bab6 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The biggest change affects group scheduling: we now track the runnable
  average on a per-task entity basis, allowing a smoother, exponential
  decay average based load/weight estimation instead of the previous
  binary on-the-runqueue/off-the-runqueue load weight method.

  This will inevitably disturb workloads that were in some sort of
  borderline balancing state or unstable equilibrium, so an eye has to
  be kept on regressions.

  For that reason the new load average is only limited to group
  scheduling (shares distribution) at the moment (which was also hurting
  the most from the prior, crude weight calculation and whose scheduling
  quality wins most from this change) - but we plan to extend this to
  regular SMP balancing as well in the future, which will simplify and
  speed up things a bit.

  Other changes involve ongoing preparatory work to extend NOHZ to the
  scheduler as well, eventually allowing completely irq-free user-space
  execution."

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled"
  cputime: Comment cputime's adjusting code
  cputime: Consolidate cputime adjustment code
  cputime: Rename thread_group_times to thread_group_cputime_adjusted
  cputime: Move thread_group_cputime() to sched code
  vtime: Warn if irqs aren't disabled on system time accounting APIs
  vtime: No need to disable irqs on vtime_account()
  vtime: Consolidate a bit the ctx switch code
  vtime: Explicitly account pending user time on process tick
  vtime: Remove the underscore prefix invasion
  sched/autogroup: Fix crash on reboot when autogroup is disabled
  cputime: Separate irqtime accounting from generic vtime
  cputime: Specialize irq vtime hooks
  kvm: Directly account vtime to system on guest switch
  vtime: Make vtime_account_system() irqsafe
  vtime: Gather vtime declarations to their own header file
  sched: Describe CFS load-balancer
  sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
  sched: Make __update_entity_runnable_avg() fast
  sched: Update_cfs_shares at period edge
  ...
2012-12-11 18:21:38 -08:00
Linus Torvalds
608ff1a210 Merge branch 'akpm' (Andrew's patchbomb)
Merge misc updates from Andrew Morton:
 "About half of most of MM.  Going very early this time due to
  uncertainty over the coreautounifiednumasched things.  I'll send the
  other half of most of MM tomorrow.  The rest of MM awaits a slab merge
  from Pekka."

* emailed patches from Andrew Morton: (71 commits)
  memory_hotplug: ensure every online node has NORMAL memory
  memory_hotplug: handle empty zone when online_movable/online_kernel
  mm, memory-hotplug: dynamic configure movable memory and portion memory
  drivers/base/node.c: cleanup node_state_attr[]
  bootmem: fix wrong call parameter for free_bootmem()
  avr32, kconfig: remove HAVE_ARCH_BOOTMEM
  mm: cma: remove watermark hacks
  mm: cma: skip watermarks check for already isolated blocks in split_free_page()
  mm, oom: fix race when specifying a thread as the oom origin
  mm, oom: change type of oom_score_adj to short
  mm: cleanup register_node()
  mm, mempolicy: remove duplicate code
  mm/vmscan.c: try_to_freeze() returns boolean
  mm: introduce putback_movable_pages()
  virtio_balloon: introduce migration primitives to balloon pages
  mm: introduce compaction and migration for ballooned pages
  mm: introduce a common interface for balloon pages mobility
  mm: redefine address_space.assoc_mapping
  mm: adjust address_space_operations.migratepage() return code
  arch/sparc/kernel/sys_sparc_64.c: s/COLOUR/COLOR/
  ...
2012-12-11 18:05:37 -08:00
David Rientjes
a9c58b907d mm, oom: change type of oom_score_adj to short
The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
so this range can be represented by the signed short type with no
functional change.  The extra space this frees up in struct signal_struct
will be used for per-thread oom kill flags in the next patch.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:27 -08:00
Rafael Aquini
252aa6f5be mm: redefine address_space.assoc_mapping
Overhaul struct address_space.assoc_mapping renaming it to
address_space.private_data and its type is redefined to void*.  By this
approach we consistently name the .private_* elements from struct
address_space as well as allow extended usage for address_space
association with other data structures through ->private_data.

Also, all users of old ->assoc_mapping element are converted to reflect
its new name and type change (->private_data).

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:26 -08:00
Rafael Aquini
78bd52097d mm: adjust address_space_operations.migratepage() return code
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a
guest, thus imposing performance penalties associated with the reduced
number of transparent huge pages that could be used by the guest workload.

This patch-set follows the main idea discussed at 2012 LSFMMS session:
"Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
to introduce the required changes to the virtio_balloon driver, as well as
the changes to the core compaction & migration bits, in order to make
those subsystems aware of ballooned pages and allow memory balloon pages
become movable within a guest, thus avoiding the aforementioned
fragmentation issue

Following are numbers that prove this patch benefits on allowing
compaction to be more effective at memory ballooned guests.

Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
chunks, at every minute (inflating/deflating), while test was running:

===BEGIN stress-highalloc

STRESS-HIGHALLOC
                 highalloc-3.7     highalloc-3.7
                     rc4-clean         rc4-patch
Pass 1          55.00 ( 0.00%)    62.00 ( 7.00%)
Pass 2          54.00 ( 0.00%)    62.00 ( 8.00%)
while Rested    75.00 ( 0.00%)    80.00 ( 5.00%)

MMTests Statistics: duration
                 3.7         3.7
           rc4-clean   rc4-patch
User         1207.59     1207.46
System       1300.55     1299.61
Elapsed      2273.72     2157.06

MMTests Statistics: vmstat
                                3.7         3.7
                          rc4-clean   rc4-patch
Page Ins                    3581516     2374368
Page Outs                  11148692    10410332
Swap Ins                         80          47
Swap Outs                      3641         476
Direct pages scanned          37978       33826
Kswapd pages scanned        1828245     1342869
Kswapd pages reclaimed      1710236     1304099
Direct pages reclaimed        32207       31005
Kswapd efficiency               93%         97%
Kswapd velocity             804.077     622.546
Direct efficiency               84%         91%
Direct velocity              16.703      15.682
Percentage direct scans          2%          2%
Page writes by reclaim        79252        9704
Page writes file              75611        9228
Page writes anon               3641         476
Page reclaim immediate        16764       11014
Page rescued immediate            0           0
Slabs scanned               2171904     2152448
Direct inode steals             385        2261
Kswapd inode steals          659137      609670
Kswapd skipped wait               1          69
THP fault alloc                 546         631
THP collapse alloc              361         339
THP splits                      259         263
THP fault fallback               98          50
THP collapse fail                20          17
Compaction stalls               747         499
Compaction success              244         145
Compaction failures             503         354
Compaction pages moved       370888      474837
Compaction move failure       77378       65259

===END stress-highalloc

This patch:

Introduce MIGRATEPAGE_SUCCESS as the default return code for
address_space_operations.migratepage() method and documents the expected
return code for the same method in failure cases.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:26 -08:00
Michel Lespinasse
0865935598 mm: use vm_unmapped_area() in hugetlbfs
Update the hugetlb_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:25 -08:00
Andi Kleen
42d7395feb mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB
There was some desire in large applications using MAP_HUGETLB or
SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
others.  This is useful together with NUMA policy: use 2MB interleaving
on some mappings, but 1GB on local mappings.

This patch extends the IPC/SHM syscall interfaces slightly to allow
specifying the page size.

It borrows some upper bits in the existing flag arguments and allows
encoding the log of the desired page size in addition to the *_HUGETLB
flag.  When 0 is specified the default size is used, this makes the
change fully compatible.

Extending the internal hugetlb code to handle this is straight forward.
Instead of a single mount it just keeps an array of them and selects the
right mount based on the specified page size.  When no page size is
specified it uses the mount of the default page size.

The change is not visible in /proc/mounts because internal mounts don't
appear there.  It also has very little overhead: the additional mounts
just consume a super block, but not more memory when not used.

I also exported the new flags to the user headers (they were previously
under __KERNEL__).  Right now only symbols for x86 and some other
architecture for 1GB and 2MB are defined.  The interface should already
work for all other architectures though.  Only architectures that define
multiple hugetlb sizes actually need it (that is currently x86, tile,
powerpc).  However tile and powerpc have user configurable hugetlb
sizes, so it's not easy to add defines.  A program on those
architectures would need to query sysfs and use the appropiate log2.

[akpm@linux-foundation.org: cleanups]
[rientjes@google.com: fix build]
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:25 -08:00
Namjae Jeon
d0e1d66b5a writeback: remove nr_pages_dirtied arg from balance_dirty_pages_ratelimited_nr()
There is no reason to pass the nr_pages_dirtied argument, because
nr_pages_dirtied value from the caller is unused in
balance_dirty_pages_ratelimited_nr().

Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:21 -08:00
Linus Torvalds
c6bd5bcc49 TTY/Serial merge for 3.8-rc1
Here's the big tty/serial tree set of changes for 3.8-rc1.
 
 Contained in here is a bunch more reworks of the tty port layer from Jiri and
 bugfixes from Alan, along with a number of other tty and serial driver updates
 by the various driver authors.
 
 Also, Jiri has been coerced^Wconvinced to be the co-maintainer of the TTY
 layer, which is much appreciated by me.
 
 All of these have been in the linux-next tree for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlDHhgwACgkQMUfUDdst+ynI6wCcC+YeBwncnoWHvwLAJOwAZpUL
 bysAn28o780/lOsTzp3P1Qcjvo69nldo
 =hN/g
 -----END PGP SIGNATURE-----

Merge tag 'tty-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull TTY/Serial merge from Greg Kroah-Hartman:
 "Here's the big tty/serial tree set of changes for 3.8-rc1.

  Contained in here is a bunch more reworks of the tty port layer from
  Jiri and bugfixes from Alan, along with a number of other tty and
  serial driver updates by the various driver authors.

  Also, Jiri has been coerced^Wconvinced to be the co-maintainer of the
  TTY layer, which is much appreciated by me.

  All of these have been in the linux-next tree for a while.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fixed up some trivial conflicts in the staging tree, due to the fwserial
driver having come in both ways (but fixed up a bit in the serial tree),
and the ioctl handling in the dgrp driver having been done slightly
differently (staging tree got that one right, and removed both
TIOCGSOFTCAR and TIOCSSOFTCAR).

* tag 'tty-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (146 commits)
  staging: sb105x: fix potential NULL pointer dereference in mp_chars_in_buffer()
  staging/fwserial: Remove superfluous free
  staging/fwserial: Use WARN_ONCE when port table is corrupted
  staging/fwserial: Destruct embedded tty_port on teardown
  staging/fwserial: Fix build breakage when !CONFIG_BUG
  staging: fwserial: Add TTY-over-Firewire serial driver
  drivers/tty/serial/serial_core.c: clean up HIGH_BITS_OFFSET usage
  staging: dgrp: dgrp_tty.c: Audit the return values of get/put_user()
  staging: dgrp: dgrp_tty.c: Remove the TIOCSSOFTCAR ioctl handler from dgrp driver
  serial: ifx6x60: Add modem power off function in the platform reboot process
  serial: mxs-auart: unmap the scatter list before we copy the data
  serial: mxs-auart: disable the Receive Timeout Interrupt when DMA is enabled
  serial: max310x: Setup missing "can_sleep" field for GPIO
  tty/serial: fix ifx6x60.c declaration warning
  serial: samsung: add devicetree properties for non-Exynos SoCs
  serial: samsung: fix potential soft lockup during uart write
  tty: vt: Remove redundant null check before kfree.
  tty/8250 Add check for pci_ioremap_bar failure
  tty/8250 Add support for Commtech's Fastcom Async-335 and Fastcom Async-PCIe cards
  tty/8250 Add XR17D15x devices to the exar_handle_irq override
  ...
2012-12-11 14:08:47 -08:00
Linus Torvalds
cff2f741b8 Driver core updates for 3.8-rc1
Here's the large driver core updates for 3.8-rc1.
 
 The biggest thing here is the various __dev* marking removals.  This is
 going to be a pain for the merge with different subsystem trees, I know,
 but all of the patches included here have been ACKed by their various
 subsystem maintainers, as they wanted them to go through here.
 
 If this is too much of a pain, I can pull all of them out of this tree
 and just send you one with the other fixes/updates and then, after
 3.8-rc1 is out, do the rest of the removals to ensure we catch them all,
 it's up to you.  The merges should all be trivial, and Stephen has been
 doing them all in linux-next for a few weeks now quite easily.
 
 Other than the __dev* marking removals, there's nothing major here, some
 firmware loading updates and other minor things in the driver core.
 
 All of these have (much to Stephen's annoyance), been in linux-next for
 a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlDHkPkACgkQMUfUDdst+ykaWgCfW7AM30cv0nzoVO08ax6KjlG1
 KVYAn3z/KYazvp4B6LMvrW9y0G34Wmad
 =yvVr
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg Kroah-Hartman:
 "Here's the large driver core updates for 3.8-rc1.

  The biggest thing here is the various __dev* marking removals.  This
  is going to be a pain for the merge with different subsystem trees, I
  know, but all of the patches included here have been ACKed by their
  various subsystem maintainers, as they wanted them to go through here.

  If this is too much of a pain, I can pull all of them out of this tree
  and just send you one with the other fixes/updates and then, after
  3.8-rc1 is out, do the rest of the removals to ensure we catch them
  all, it's up to you.  The merges should all be trivial, and Stephen
  has been doing them all in linux-next for a few weeks now quite
  easily.

  Other than the __dev* marking removals, there's nothing major here,
  some firmware loading updates and other minor things in the driver
  core.

  All of these have (much to Stephen's annoyance), been in linux-next
  for a while.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fixed up trivial conflicts in drivers/gpio/gpio-{em,stmpe}.c due to gpio
update.

* tag 'driver-core-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (93 commits)
  modpost.c: Stop checking __dev* section mismatches
  init.h: Remove __dev* sections from the kernel
  acpi: remove use of __devinit
  PCI: Remove __dev* markings
  PCI: Always build setup-bus when PCI is enabled
  PCI: Move pci_uevent into pci-driver.c
  PCI: Remove CONFIG_HOTPLUG ifdefs
  unicore32/PCI: Remove CONFIG_HOTPLUG ifdefs
  sh/PCI: Remove CONFIG_HOTPLUG ifdefs
  powerpc/PCI: Remove CONFIG_HOTPLUG ifdefs
  mips/PCI: Remove CONFIG_HOTPLUG ifdefs
  microblaze/PCI: Remove CONFIG_HOTPLUG ifdefs
  dma: remove use of __devinit
  dma: remove use of __devexit_p
  firewire: remove use of __devinitdata
  firewire: remove use of __devinit
  leds: remove use of __devexit
  leds: remove use of __devinit
  leds: remove use of __devexit_p
  mmc: remove use of __devexit
  ...
2012-12-11 13:13:55 -08:00
Eric Paris
1ca39ab9d2 inotify: automatically restart syscalls
We were mistakenly returning EINTR when we found an outstanding signal.
Instead we should returen ERESTARTSYS and allow the kernel to handle
things the right way.

Patch-from: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2012-12-11 13:44:37 -05:00
Lino Sanfilippo
8b99c3ccf7 inotify: dont skip removal of watch descriptor if creation of ignored event failed
In inotify_ignored_and_remove_idr() the removal of a watch descriptor is skipped
if the allocation of an ignored event failed and we are leaking memory (the
watch descriptor and the mark linked to it).
This patch ensures that the watch descriptor is removed regardless of whether
event creation failed or not.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2012-12-11 13:44:37 -05:00
Lino Sanfilippo
03a1cec1f1 fanotify: dont merge permission events
Boyd Yang reported a problem for the case that multiple threads of the same
thread group are waiting for a reponse for a permission event.
In this case it is possible that some of the threads are never woken up, even
if the response for the event has been received
(see http://marc.info/?l=linux-kernel&m=131822913806350&w=2).

The reason is that we are currently merging permission events if they belong to
the same thread group. But we are not prepared to wake up more than one waiter
for each event. We do

wait_event(group->fanotify_data.access_waitq, event->response ||
			atomic_read(&group->fanotify_data.bypass_perm));
and after that
  event->response = 0;

which is the reason that even if we woke up all waiters for the same event
some of them may see event->response being already set 0 again, then go back to
sleep and block forever.

With this patch we avoid that more than one thread is waiting for a response
by not merging permission events for the same thread group any more.

Reported-by: Boyd Yang <boyd.yang@gmail.com>
Signed-off-by: Lino Sanfilippo <LinoSanfilipp@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2012-12-11 13:44:37 -05:00
Eric Paris
0a6b6bd591 fsnotify: make fasync generic for both inotify and fanotify
inotify is supposed to support async signal notification when information
is available on the inotify fd.  This patch moves that support to generic
fsnotify functions so it can be used by all notification mechanisms.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-12-11 13:44:36 -05:00
Lino Sanfilippo
6960b0d909 fsnotify: change locking order
On Mon, Aug 01, 2011 at 04:38:22PM -0400, Eric Paris wrote:
>
> I finally built and tested a v3.0 kernel with these patches (I know I'm
> SOOOOOO far behind).  Not what I hoped for:
>
> > [  150.937798] VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds.  Have a nice day...
> > [  150.945290] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070
> > [  150.946012] IP: [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> > [  150.946012] PGD 2bf9e067 PUD 2bf9f067 PMD 0
> > [  150.946012] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [  150.946012] CPU 0
> > [  150.946012] Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ext4 jbd2 crc16 joydev ata_piix i2c_piix4 pcspkr uinput ipv6 autofs4 usbhid [last unloaded: scsi_wait_scan]
> > [  150.946012]
> > [  150.946012] Pid: 2764, comm: syscall_thrash Not tainted 3.0.0+ #1 Red Hat KVM
> > [  150.946012] RIP: 0010:[<ffffffff810ffd58>]  [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> > [  150.946012] RSP: 0018:ffff88002c2e5df8  EFLAGS: 00010282
> > [  150.946012] RAX: 000000004e370d9f RBX: 0000000000000000 RCX: ffff88003a029438
> > [  150.946012] RDX: 0000000033630a5f RSI: 0000000000000000 RDI: ffff88003491c240
> > [  150.946012] RBP: ffff88002c2e5e08 R08: 0000000000000000 R09: 0000000000000000
> > [  150.946012] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003a029428
> > [  150.946012] R13: ffff88003a029428 R14: ffff88003a029428 R15: ffff88003499a610
> > [  150.946012] FS:  00007f5a05420700(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
> > [  150.946012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [  150.946012] CR2: 0000000000000070 CR3: 000000002a662000 CR4: 00000000000006f0
> > [  150.946012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  150.946012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [  150.946012] Process syscall_thrash (pid: 2764, threadinfo ffff88002c2e4000, task ffff88002bfbc760)
> > [  150.946012] Stack:
> > [  150.946012]  ffff88003a029438 ffff88003a029428 ffff88002c2e5e38 ffffffff81102f76
> > [  150.946012]  ffff88003a029438 ffff88003a029598 ffffffff8160f9c0 ffff88002c221250
> > [  150.946012]  ffff88002c2e5e68 ffffffff8115e9be ffff88002c2e5e68 ffff88003a029438
> > [  150.946012] Call Trace:
> > [  150.946012]  [<ffffffff81102f76>] shmem_evict_inode+0x76/0x130
> > [  150.946012]  [<ffffffff8115e9be>] evict+0x7e/0x170
> > [  150.946012]  [<ffffffff8115ee40>] iput_final+0xd0/0x190
> > [  150.946012]  [<ffffffff8115ef33>] iput+0x33/0x40
> > [  150.946012]  [<ffffffff81180205>] fsnotify_destroy_mark_locked+0x145/0x160
> > [  150.946012]  [<ffffffff81180316>] fsnotify_destroy_mark+0x36/0x50
> > [  150.946012]  [<ffffffff81181937>] sys_inotify_rm_watch+0x77/0xd0
> > [  150.946012]  [<ffffffff815aca52>] system_call_fastpath+0x16/0x1b
> > [  150.946012] Code: 67 4a 00 b8 e4 ff ff ff eb aa 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 8b 9f 40 05 00 00
> > [  150.946012]  83 7b 70 00 74 1c 4c 8d a3 80 00 00 00 4c 89 e7 e8 d2 5d 4a
> > [  150.946012] RIP  [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> > [  150.946012]  RSP <ffff88002c2e5df8>
> > [  150.946012] CR2: 0000000000000070
>
> Looks at aweful lot like the problem from:
> http://www.spinics.net/lists/linux-fsdevel/msg46101.html
>

I tried to reproduce this bug with your test program, but without success.
However, if I understand correctly, this occurs since we dont hold any locks when
we call iput() in mark_destroy(), right?
With the patches you tested, iput() is also not called within any lock, since the
groups mark_mutex is released temporarily before iput() is called.  This is, since
the original codes behaviour is similar.
However since we now have a mutex as the biggest lock, we can do what you
suggested (http://www.spinics.net/lists/linux-fsdevel/msg46107.html) and
call iput() with the mutex held to avoid the race.
The patch below implements this. It uses nested locking to avoid deadlock in case
we do the final iput() on an inode which still holds marks and thus would take
the mutex again when calling fsnotify_inode_delete() in destroy_inode().

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2012-12-11 13:44:36 -05:00
Lino Sanfilippo
64c20d2a20 fsnotify: dont put marks on temporary list when clearing marks by group
In clear_marks_by_group_flags() the mark list of a group is iterated and the
marks are put on a temporary list.
Since we introduced fsnotify_destroy_mark_locked() we dont need the temp list
any more and are able to remove the marks while the mark list is iterated and
the mark list mutex is held.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2012-12-11 13:44:36 -05:00
Lino Sanfilippo
d5a335b845 fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark()
This patch introduces fsnotify_add_mark_locked() and fsnotify_remove_mark_locked()
which are essentially the same as fsnotify_add_mark() and fsnotify_remove_mark() but
assume that the caller has already taken the groups mark mutex.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2012-12-11 13:44:36 -05:00
Lino Sanfilippo
e2a29943e9 fsnotify: pass group to fsnotify_destroy_mark()
In fsnotify_destroy_mark() dont get the group from the passed mark anymore,
but pass the group itself as an additional parameter to the function.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2012-12-11 13:44:36 -05:00
Miao Xie
9afab8820b Btrfs: make ordered extent be flushed by multi-task
Though the process of the ordered extents is a bit different with the delalloc inode
flush, but we can see it as a subset of the delalloc inode flush, so we also handle
them by flush workers.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:38 -05:00