Commit graph

34236 commits

Author SHA1 Message Date
Yan, Zheng
81c6aea527 ceph: handle frag mismatch between readdir request and reply
If client has outdated directory fragments information, it may request
readdir an non-existent directory fragment. In this case, the MDS finds
an approximate directory fragment and sends its contents back to the
client. When receiving a reply with fragment that is different than the
requested one, the client need to reset the 'readdir offset'.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-09-30 14:49:53 -07:00
Yan, Zheng
53e879a485 ceph: remove outdated frag information
If directory fragments change, fill_inode() inserts new frags into
the fragtree, but it does not remove outdated frags from the fragtree.
This patch fixes it.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-09-30 14:49:28 -07:00
Linus Torvalds
522d6d38f8 Merge branch 'akpm' (fixes from Andrew Morton)
Merge misc fixes from Andrew Morton.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
  pidns: fix free_pid() to handle the first fork failure
  ipc,msg: prevent race with rmid in msgsnd,msgrcv
  ipc/sem.c: update sem_otime for all operations
  mm/hwpoison: fix the lack of one reference count against poisoned page
  mm/hwpoison: fix false report on 2nd attempt at page recovery
  mm/hwpoison: fix test for a transparent huge page
  mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood
  block: change config option name for cmdline partition parsing
  mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration
  mm: avoid reinserting isolated balloon pages into LRU lists
  arch/parisc/mm/fault.c: fix uninitialized variable usage
  include/asm-generic/vtime.h: avoid zero-length file
  nilfs2: fix issue with race condition of competition between segments for dirty blocks
  Documentation/kernel-parameters.txt: replace kernelcore with Movable
  mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored
  kernel/kmod.c: check for NULL in call_usermodehelper_exec()
  ipc/sem.c: synchronize the proc interface
  ipc/sem.c: optimize sem_lock()
  ipc/sem.c: fix race in sem_lock()
  mm/compaction.c: periodically schedule when freeing pages
  ...
2013-09-30 14:32:32 -07:00
Vyacheslav Dubeyko
7f42ec3941 nilfs2: fix issue with race condition of competition between segments for dirty blocks
Many NILFS2 users were reported about strange file system corruption
(for example):

   NILFS: bad btree node (blocknr=185027): level = 0, flags = 0x0, nchildren = 768
   NILFS error (device sda4): nilfs_bmap_last_key: broken bmap (inode number=11540)

But such error messages are consequence of file system's issue that takes
place more earlier.  Fortunately, Jerome Poulin <jeromepoulin@gmail.com>
and Anton Eliasson <devel@antoneliasson.se> were reported about another
issue not so recently.  These reports describe the issue with segctor
thread's crash:

  BUG: unable to handle kernel paging request at 0000000000004c83
  IP: nilfs_end_page_io+0x12/0xd0 [nilfs2]

  Call Trace:
   nilfs_segctor_do_construct+0xf25/0x1b20 [nilfs2]
   nilfs_segctor_construct+0x17b/0x290 [nilfs2]
   nilfs_segctor_thread+0x122/0x3b0 [nilfs2]
   kthread+0xc0/0xd0
   ret_from_fork+0x7c/0xb0

These two issues have one reason.  This reason can raise third issue
too.  Third issue results in hanging of segctor thread with eating of
100% CPU.

REPRODUCING PATH:

One of the possible way or the issue reproducing was described by
Jermoe me Poulin <jeromepoulin@gmail.com>:

1. init S to get to single user mode.
2. sysrq+E to make sure only my shell is running
3. start network-manager to get my wifi connection up
4. login as root and launch "screen"
5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies.
6. lscp | xz -9e > lscp.txt.xz
7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs
8. start a screen to dump /proc/kmsg to text file since rsyslog is killed
9. start a screen and launch strace -f -o find-cat.log -t find
/mnt/nilfs -type f -exec cat {} > /dev/null \;
10. start a screen and launch strace -f -o apt-get.log -t apt-get update
11. launch the last command again as it did not crash the first time
12. apt-get crashes
13. ps aux > ps-aux-crashed.log
13. sysrq+W
14. sysrq+E  wait for everything to terminate
15. sysrq+SUSB

Simplified way of the issue reproducing is starting kernel compilation
task and "apt-get update" in parallel.

REPRODUCIBILITY:

The issue is reproduced not stable [60% - 80%].  It is very important to
have proper environment for the issue reproducing.  The critical
conditions for successful reproducing:

(1) It should have big modified file by mmap() way.

(2) This file should have the count of dirty blocks are greater that
    several segments in size (for example, two or three) from time to time
    during processing.

(3) It should be intensive background activity of files modification
    in another thread.

INVESTIGATION:

First of all, it is possible to see that the reason of crash is not valid
page address:

  NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
  NILFS [nilfs_segctor_complete_write]:2101 segbuf->sb_segnum 6783

Moreover, value of b_page (0x1a82) is 6786.  This value looks like segment
number.  And b_blocknr with b_size values look like block numbers.  So,
buffer_head's pointer points on not proper address value.

Detailed investigation of the issue is discovered such picture:

  [-----------------------------SEGMENT 6783-------------------------------]
  NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
  NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
  NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
  NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
  NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
  NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
  NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
  NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111149024, segbuf->sb_segnum 6783

  [-----------------------------SEGMENT 6784-------------------------------]
  NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
  NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
  NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
  NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff8802174a6798, bh->b_assoc_buffers.prev ffff880221cffee8
  NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
  NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
  NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
  NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
  NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
  NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
  NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6784
  NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
  NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111150080, segbuf->sb_segnum 6784, segbuf->sb_nbio 0
  [----------] ditto
  NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111164416, segbuf->sb_segnum 6784, segbuf->sb_nbio 15

  [-----------------------------SEGMENT 6785-------------------------------]
  NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
  NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
  NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
  NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff880219277e80, bh->b_assoc_buffers.prev ffff880221cffc88
  NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
  NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
  NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
  NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
  NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
  NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6785
  NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
  NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111165440, segbuf->sb_segnum 6785, segbuf->sb_nbio 0
  [----------] ditto
  NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111177728, segbuf->sb_segnum 6785, segbuf->sb_nbio 12

  NILFS [nilfs_segctor_do_construct]:2399 nilfs_segctor_wait
  NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6783
  NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6784
  NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6785

  NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82

  BUG: unable to handle kernel paging request at 0000000000001a82
  IP: [<ffffffffa024d0f2>] nilfs_end_page_io+0x12/0xd0 [nilfs2]

Usually, for every segment we collect dirty files in list.  Then, dirty
blocks are gathered for every dirty file, prepared for write and
submitted by means of nilfs_segbuf_submit_bh() call.  Finally, it takes
place complete write phase after calling nilfs_end_bio_write() on the
block layer.  Buffers/pages are marked as not dirty on final phase and
processed files removed from the list of dirty files.

It is possible to see that we had three prepare_write and submit_bio
phases before segbuf_wait and complete_write phase.  Moreover, segments
compete between each other for dirty blocks because on every iteration
of segments processing dirty buffer_heads are added in several lists of
payload_buffers:

  [SEGMENT 6784]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
  [SEGMENT 6785]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8

The next pointer is the same but prev pointer has changed.  It means
that buffer_head has next pointer from one list but prev pointer from
another.  Such modification can be made several times.  And, finally, it
can be resulted in various issues: (1) segctor hanging, (2) segctor
crashing, (3) file system metadata corruption.

FIX:
This patch adds:

(1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write()
    for every proccessed dirty block;

(2) checking of BH_Async_Write flag in
    nilfs_lookup_dirty_data_buffers() and
    nilfs_lookup_dirty_node_buffers();

(3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(),
    nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page().

Reported-by: Jerome Poulin <jeromepoulin@gmail.com>
Reported-by: Anton Eliasson <devel@antoneliasson.se>
Cc: Paul Fertser <fercerpav@gmail.com>
Cc: ARAI Shun-ichi <hermes@ceres.dti.ne.jp>
Cc: Piotr Szymaniak <szarpaj@grubelek.pl>
Cc: Juan Barry Manuel Canham <Linux@riotingpacifist.net>
Cc: Zahid Chowdhury <zahid.chowdhury@starsolutions.com>
Cc: Elmer Zhang <freeboy6716@gmail.com>
Cc: Kenneth Langga <klangga@gmail.com>
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30 14:31:02 -07:00
Dan Aloni
7202365696 fs/binfmt_elf.c: prevent a coredump with a large vm_map_count from Oopsing
A high setting of max_map_count, and a process core-dumping with a large
enough vm_map_count could result in an NT_FILE note not being written,
and the kernel crashing immediately later because it has assumed
otherwise.

Reproduction of the oops-causing bug described here:

    https://lkml.org/lkml/2013/8/30/50

Rge ussue originated in commit 2aa362c49c ("coredump: extend core dump
note section to contain file names of mapped file") from Oct 4, 2012.

This patch make that section optional in that case.  fill_files_note()
should signify the error, and also let the info struct in
elf_core_dump() be zero-initialized so that we can check for the
optionally written note.

[akpm@linux-foundation.org: avoid abusing E2BIG, remove a couple of not-really-needed local variables]
[akpm@linux-foundation.org: fix sparse warning]
Signed-off-by: Dan Aloni <alonid@stratoscale.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Reported-by: Martin MOKREJS <mmokrejs@gmail.com>
Tested-by: Martin MOKREJS <mmokrejs@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30 14:31:01 -07:00
Al Viro
13f3583892 afs: dget_parent() can't return a negative dentry
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-29 22:02:24 -04:00
Al Viro
7b9a2378b4 ocfs2: needs ->d_lock to poke in ->d_parent->d_inode from ->d_revalidate()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-29 22:02:20 -04:00
Lubomir Rintel
4947555584 sysv: Add forgotten superblock lock init for v7 fs
Superblock lock was replaced with (un)lock_super() removal, but left
uninitialized for Seventh Edition UNIX filesystem in the following commit (3.7):
c07cb01 sysv: drop lock/unlock super

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-29 22:02:02 -04:00
Greg Kroah-Hartman
88502b9c0a Merge 3.12-rc3 into driver-core-next
We want the driver core and sysfs fixes in here to make merges and
development easier.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-29 18:29:23 -07:00
Anna Schumaker
367156d9a8 NFS: Give "flavor" an initial value to fix a compile warning
The previous patch introduces a compile warning by not assigning an initial
value to the "flavor" variable.  This could only be a problem if the server
returns a supported secflavor list of length zero, but it's better to
fix this before it's ever hit.

Signed-off-by: Anna Schumaker <bjschuma@netapp.com>
Acked-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-09-29 16:03:34 -04:00
Weston Andros Adamson
58a8cf1212 NFSv4.1: try SECINFO_NO_NAME flavs until one works
Call nfs4_lookup_root_sec for each flavor returned by SECINFO_NO_NAME until
one works.

One example of a situation this fixes:

 - server configured for krb5
 - server principal somehow gets deleted from KDC
 - server still thinking krb is good, sends krb5 as first entry in
    SECINFO_NO_NAME response
 - client tries krb5, but this fails without even sending an RPC because
    gssd's requests to the KDC can't find the server's principal

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-09-29 16:03:34 -04:00
Trond Myklebust
acd65e5bc1 NFSv4.1: Ensure memory ordering between nfs4_ds_connect and nfs4_fl_prepare_ds
We need to ensure that the initialisation of the data server nfs_client
structure in nfs4_ds_connect is correctly ordered w.r.t. the read of
ds->ds_clp in nfs4_fl_prepare_ds.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-09-29 15:58:35 -04:00
Trond Myklebust
52b26a3e1b NFSv4.1: nfs4_fl_prepare_ds - fix bugs when the connect attempt fails
- Fix an Oops when nfs4_ds_connect() returns an error.
- Always check the device status after waiting for a connect to complete.

Reported-by: Andy Adamson <andros@netapp.com>
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: <stable@vger.kernel.org> # v3.10+
2013-09-29 15:56:35 -04:00
Linus Torvalds
ddd23eb182 xfs: bugfixes for 3.12-rc3
- fix for directory node collapse regression
 - fix for recovery over stale on disk structures
 - fix for eofblocks ioctl
 - fix asserts in xfs_inode_free
 - lock the ail before removing an item from it
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJSRvPvAAoJENaLyazVq6ZOoXAP/3/AD1iuqGWBy2wIjISNJupu
 ST4gW5FXgBlG/sr1zGOA/L6VCdAaQMFSnlOnGpOjAyCH8VJ+XVb+4WCammzQ9CEu
 YJsbjlra52V3cOhGxDsuE9uEDIAqxnyiZndF/Pk1gnyGzpt5da3CxwI9UVuR8Dlt
 9O/dJXuKGdKgBtQKzOfrzIDXQZ2zE5NPvueHsDU6i6Hd7YECwG5j4fRqS8jf49jY
 sleZo0CVkYx1IcIB//oVXa+JbyelhiS5D8Ro3dUbW8lJTToHq9RaIiE/xrz2nz2c
 lUcaQdecxAnN4NKtHu/QR18u5HauA7FH26cV+PUGWWmbjHTP3oybsnHJWZr+08cX
 +NfC9dpLl3VTapVNCxVeuToE2DhgTQuWOODiIFSi+Ljt3P6zfIKKBMqwJ6FceaJD
 4afVT2GDEtvZuFbfhyvXUzP9bm0PLoZRI68bLLi772W2Zc8aiprJyKAcm/GtlxTk
 biluUwoGn+zD60u0GwAtlQ2g+/jTReoeAjez+LW1dWNrUwfVdnZh8xlFGEXEd/dS
 scx6BHSlJtwVKEWWYMO+JLHJ077yNR5RPoRFFokS1XGOSGiEuuSkAkS0eAcC5WCX
 egfGHzkymOOQhcXb/qPjpCVBbHEnCuYw6b24eNyceyqfE8o6iZmsUcF3kOzsa8Up
 fOVpOP7UovMR5teKQxAa
 =AUDD
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.12-rc3' of git://oss.sgi.com/xfs/xfs

Pull xfs bugfixes from Ben Myers:
 - fix for directory node collapse regression
 - fix for recovery over stale on disk structures
 - fix for eofblocks ioctl
 - fix asserts in xfs_inode_free
 - lock the ail before removing an item from it

* tag 'xfs-for-linus-v3.12-rc3' of git://oss.sgi.com/xfs/xfs:
  xfs: fix node forward in xfs_node_toosmall
  xfs: log recovery lsn ordering needs uuid check
  xfs: fix XFS_IOC_FREE_EOFBLOCKS definition
  xfs: asserting lock not held during freeing not valid
  xfs: lock the AIL before removing the buffer item
2013-09-28 13:52:05 -07:00
David Howells
f1fe29b4a0 NFS: Use i_writecount to control whether to get an fscache cookie in nfs_open()
Use i_writecount to control whether to get an fscache cookie in nfs_open() as
NFS does not do write caching yet.  I *think* this is the cause of a problem
encountered by Mark Moseley whereby __fscache_uncache_page() gets a NULL
pointer dereference because cookie->def is NULL:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffff812a1903>] __fscache_uncache_page+0x23/0x160
PGD 0
Thread overran stack, or stack corrupted
Oops: 0000 [#1] SMP
Modules linked in: ...
CPU: 7 PID: 18993 Comm: php Not tainted 3.11.1 #1
Hardware name: Dell Inc. PowerEdge R420/072XWF, BIOS 1.3.5 08/21/2012
task: ffff8804203460c0 ti: ffff880420346640
RIP: 0010:[<ffffffff812a1903>] __fscache_uncache_page+0x23/0x160
RSP: 0018:ffff8801053af878 EFLAGS: 00210286
RAX: 0000000000000000 RBX: ffff8800be2f8780 RCX: ffff88022ffae5e8
RDX: 0000000000004c66 RSI: ffffea00055ff440 RDI: ffff8800be2f8780
RBP: ffff8801053af898 R08: 0000000000000001 R09: 0000000000000003
R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00055ff440
R13: 0000000000001000 R14: ffff8800c50be538 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88042fc60000(0063) knlGS:00000000e439c700
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000000000010 CR3: 0000000001d8f000 CR4: 00000000000607f0
Stack:
...
Call Trace:
[<ffffffff81365a72>] __nfs_fscache_invalidate_page+0x42/0x70
[<ffffffff813553d5>] nfs_invalidate_page+0x75/0x90
[<ffffffff811b8f5e>] truncate_inode_page+0x8e/0x90
[<ffffffff811b90ad>] truncate_inode_pages_range.part.12+0x14d/0x620
[<ffffffff81d6387d>] ? __mutex_lock_slowpath+0x1fd/0x2e0
[<ffffffff811b95d3>] truncate_inode_pages_range+0x53/0x70
[<ffffffff811b969d>] truncate_inode_pages+0x2d/0x40
[<ffffffff811b96ff>] truncate_pagecache+0x4f/0x70
[<ffffffff81356840>] nfs_setattr_update_inode+0xa0/0x120
[<ffffffff81368de4>] nfs3_proc_setattr+0xc4/0xe0
[<ffffffff81357f78>] nfs_setattr+0xc8/0x150
[<ffffffff8122d95b>] notify_change+0x1cb/0x390
[<ffffffff8120a55b>] do_truncate+0x7b/0xc0
[<ffffffff8121f96c>] do_last+0xa4c/0xfd0
[<ffffffff8121ffbc>] path_openat+0xcc/0x670
[<ffffffff81220a0e>] do_filp_open+0x4e/0xb0
[<ffffffff8120ba1f>] do_sys_open+0x13f/0x2b0
[<ffffffff8126aaf6>] compat_SyS_open+0x36/0x50
[<ffffffff81d7204c>] sysenter_dispatch+0x7/0x24

The code at the instruction pointer was disassembled:

> (gdb) disas __fscache_uncache_page
> Dump of assembler code for function __fscache_uncache_page:
> ...
> 0xffffffff812a18ff <+31>: mov 0x48(%rbx),%rax
> 0xffffffff812a1903 <+35>: cmpb $0x0,0x10(%rax)
> 0xffffffff812a1907 <+39>: je 0xffffffff812a19cd <__fscache_uncache_page+237>

These instructions make up:

	ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);

That cmpb is the faulting instruction (%rax is 0).  So cookie->def is NULL -
which presumably means that the cookie has already been at least partway
through __fscache_relinquish_cookie().

What I think may be happening is something like a three-way race on the same
file:

	PROCESS 1	PROCESS 2	PROCESS 3
	===============	===============	===============
	open(O_TRUNC|O_WRONLY)
			open(O_RDONLY)
					open(O_WRONLY)
	-->nfs_open()
	-->nfs_fscache_set_inode_cookie()
	nfs_fscache_inode_lock()
	nfs_fscache_disable_inode_cookie()
	__fscache_relinquish_cookie()
	nfs_inode->fscache = NULL
	<--nfs_fscache_set_inode_cookie()

			-->nfs_open()
			-->nfs_fscache_set_inode_cookie()
			nfs_fscache_inode_lock()
			nfs_fscache_enable_inode_cookie()
			__fscache_acquire_cookie()
			nfs_inode->fscache = cookie
			<--nfs_fscache_set_inode_cookie()
	<--nfs_open()
	-->nfs_setattr()
	...
	...
	-->nfs_invalidate_page()
	-->__nfs_fscache_invalidate_page()
	cookie = nfsi->fscache
					-->nfs_open()
					-->nfs_fscache_set_inode_cookie()
					nfs_fscache_inode_lock()
					nfs_fscache_disable_inode_cookie()
					-->__fscache_relinquish_cookie()
	-->__fscache_uncache_page(cookie)
	<crash>
					<--__fscache_relinquish_cookie()
					nfs_inode->fscache = NULL
					<--nfs_fscache_set_inode_cookie()

What is needed is something to prevent process #2 from reacquiring the cookie
- and I think checking i_writecount should do the trick.

It's also possible to have a two-way race on this if the file is opened
O_TRUNC|O_RDONLY instead.

Reported-by: Mark Moseley <moseleymark@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-27 18:40:25 +01:00
David Howells
94d30ae90a FS-Cache: Provide the ability to enable/disable cookies
Provide the ability to enable and disable fscache cookies.  A disabled cookie
will reject or ignore further requests to:

	Acquire a child cookie
	Invalidate and update backing objects
	Check the consistency of a backing object
	Allocate storage for backing page
	Read backing pages
	Write to backing pages

but still allows:

	Checks/waits on the completion of already in-progress objects
	Uncaching of pages
	Relinquishment of cookies

Two new operations are provided:

 (1) Disable a cookie:

	void fscache_disable_cookie(struct fscache_cookie *cookie,
				    bool invalidate);

     If the cookie is not already disabled, this locks the cookie against other
     dis/enablement ops, marks the cookie as being disabled, discards or
     invalidates any backing objects and waits for cessation of activity on any
     associated object.

     This is a wrapper around a chunk split out of fscache_relinquish_cookie(),
     but it reinitialises the cookie such that it can be reenabled.

     All possible failures are handled internally.  The caller should consider
     calling fscache_uncache_all_inode_pages() afterwards to make sure all page
     markings are cleared up.

 (2) Enable a cookie:

	void fscache_enable_cookie(struct fscache_cookie *cookie,
				   bool (*can_enable)(void *data),
				   void *data)

     If the cookie is not already enabled, this locks the cookie against other
     dis/enablement ops, invokes can_enable() and, if the cookie is not an
     index cookie, will begin the procedure of acquiring backing objects.

     The optional can_enable() function is passed the data argument and returns
     a ruling as to whether or not enablement should actually be permitted to
     begin.

     All possible failures are handled internally.  The cookie will only be
     marked as enabled if provisional backing objects are allocated.

A later patch will introduce these to NFS.  Cookie enablement during nfs_open()
is then contingent on i_writecount <= 0.  can_enable() checks for a race
between open(O_RDONLY) and open(O_WRONLY/O_RDWR).  This simplifies NFS's cookie
handling and allows us to get rid of open(O_RDONLY) accidentally introducing
caching to an inode that's open for writing already.

One operation has its API modified:

 (3) Acquire a cookie.

	struct fscache_cookie *fscache_acquire_cookie(
		struct fscache_cookie *parent,
		const struct fscache_cookie_def *def,
		void *netfs_data,
		bool enable);

     This now has an additional argument that indicates whether the requested
     cookie should be enabled by default.  It doesn't need the can_enable()
     function because the caller must prevent multiple calls for the same netfs
     object and it doesn't need to take the enablement lock because no one else
     can get at the cookie before this returns.

Signed-off-by: David Howells <dhowells@redhat.com
2013-09-27 18:40:25 +01:00
David Howells
8fb883f3e3 FS-Cache: Add use/unuse/wake cookie wrappers
Add wrapper functions for dealing with cookie->n_active:

 (*) __fscache_use_cookie() to increment it.

 (*) __fscache_unuse_cookie() to decrement and test against zero.

 (*) __fscache_wake_unused_cookie() to wake up anyone waiting for it to reach
     zero.

The second and third are split so that the third can be done after cookie->lock
has been released in case the waiter wakes up whilst we're still holding it and
tries to get it.

We will need to wake-on-zero once the cookie disablement patch is applied
because it will then be possible to see n_active become zero without the cookie
being relinquished.

Also move the cookie usement out of fscache_attr_changed_op() and into
fscache_attr_changed() and the operation struct so that cookie disablement
will be able to track it.

Whilst we're at it, only increment n_active if we're about to do
fscache_submit_op() so that we don't have to deal with undoing it if anything
earlier fails.  Possibly this should be moved into fscache_submit_op() which
could look at FSCACHE_OP_UNUSE_COOKIE.

Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-27 18:40:25 +01:00
Linus Torvalds
e1f8826f51 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull reiserfs and UDF fixes from Jan Kara:
 "The contains fix of an UDF oops when mounting corrupted media and a
  fix of a race in reiserfs leading to oops"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  reiserfs: fix race with flush_used_journal_lists and flush_journal_list
  reiserfs: remove useless flush_old_journal_lists
  udf: Fortify LVID loading
2013-09-27 09:31:09 -07:00
Steven Whitehouse
af5c269799 GFS2: Clean up reservation removal
The reservation for an inode should be cleared when it is truncated so
that we can start again at a different offset for future allocations.
We could try and do better than that, by resetting the search based on
where the truncation started from, but this is only a first step.

In addition, there are three callers of gfs2_rs_delete() but only one
of those should really be testing the value of i_writecount. While
we get away with that in the other cases currently, I think it would
be better if we made that test specific to the one case which
requires it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-09-27 12:49:33 +01:00
Benjamin LaHaise
5e9ae2e5da aio: fix use-after-free in aio_migratepage
Dmitry Vyukov managed to trigger a case where aio_migratepage can cause a
use-after-free during teardown of the aio ring buffer's mapping.  This turns
out to be caused by access to the ioctx's ring_pages via the migratepage
operation which was not being protected by any locks during ioctx freeing.
Use the address_space's private_lock to protect use and updates of the mapping's
private_data, and make ioctx teardown unlink the ioctx from the address space.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-09-26 20:34:51 -04:00
Tejun Heo
cfec0bc835 sysfs: @name comes before @ns
Some internal sysfs functions which take explicit namespace argument
are weird in that they place the optional @ns in front of @name which
is contrary to the established convention.  This is confusing and
error-prone especially as @ns and @name may be interchanged without
causing compilation warning.

Swap the positions of @name and @ns in the following internal
functions.

 sysfs_find_dirent()
 sysfs_rename()
 sysfs_hash_and_remove()
 sysfs_name_hash()
 sysfs_name_compare()
 create_dir()

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 15:34:38 -07:00
Tejun Heo
388975ccca sysfs: clean up sysfs_get_dirent()
The pre-existing sysfs interfaces which take explicit namespace
argument are weird in that they place the optional @ns in front of
@name which is contrary to the established convention.  For example,
we end up forcing vast majority of sysfs_get_dirent() users to do
sysfs_get_dirent(parent, NULL, name), which is silly and error-prone
especially as @ns and @name may be interchanged without causing
compilation warning.

This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the
positions of @name and @ns, and sysfs_get_dirent() is now a wrapper
around sysfs_get_dirent_ns().  This makes confusions a lot less
likely.

There are other interfaces which take @ns before @name.  They'll be
updated by following patches.

This patch doesn't introduce any functional changes.

v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol
    error on module builds.  Reported by build test robot.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 15:33:18 -07:00
Tejun Heo
cb26a31157 sysfs: drop kobj_ns_type handling
The way namespace tags are implemented in sysfs is more complicated
than necessary.  As each tag is a pointer value and required to be
non-NULL under a namespace enabled parent, there's no need to record
separately what type each tag is or where namespace is enabled.

If multiple namespace types are needed, which currently aren't, we can
simply compare the tag to a set of allowed tags in the superblock
assuming that the tags, being pointers, won't have the same value
across multiple types.  Also, whether to filter by namespace tag or
not can be trivially determined by whether the node has any tagged
children or not.

This patch rips out kobj_ns_type handling from sysfs.  sysfs no longer
cares whether specific type of namespace is enabled or not.  If a
sysfs_dirent has a non-NULL tag, the parent is marked as needing
namespace filtering and the value is tested against the allowed set of
tags for the superblock (currently only one but increasing this number
isn't difficult) and the sysfs_dirent is ignored if it doesn't match.

This removes most kobject namespace knowledge from sysfs proper which
will enable proper separation and layering of sysfs.  The namespace
sanity checks in fs/sysfs/dir.c are replaced by the new sanity check
in kobject_namespace().  As this is the only place ktype->namespace()
is called for sysfs, this doesn't weaken the sanity check
significantly.  I omitted converting the sanity check in
sysfs_do_create_link_sd().  While the check can be shifted to upper
layer, mistakes there are well contained and should be easily visible
anyway.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 15:30:22 -07:00
Tejun Heo
4b30ee58ee sysfs: remove ktype->namespace() invocations in symlink code
There's no reason for sysfs to be calling ktype->namespace().  It is
backwards, obfuscates what's going on and unnecessarily tangles two
separate layers.

There are two places where symlink code calls ktype->namespace().

* sysfs_do_create_link_sd() calls it to find out the namespace tag of
  the target directory.  Unless symlinking races with cross-namespace
  renaming, this equals @target_sd->s_ns.

* sysfs_rename_link() uses it to find out the new namespace to rename
  to and the new namespace can be different from the existing one.
  The function is renamed to sysfs_rename_link_ns() with an explicit
  @ns argument and the ktype->namespace() invocation is shifted to the
  device layer.

While this patch replaces ktype->namespace() invocation with the
recorded result in @target_sd, this shouldn't result in any behvior
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 15:30:22 -07:00
Tejun Heo
e34ff49061 sysfs: remove ktype->namespace() invocations in directory code
For some unrecognizable reason, namespace information is communicated
to sysfs through ktype->namespace() callback when there's *nothing*
which needs the use of a callback.  The whole sequence of operations
is completely synchronous and sysfs operations simply end up calling
back into the layer which just invoked it in order to find out the
namespace information, which is completely backwards, obfuscates
what's going on and unnecessarily tangles two separate layers.

This patch doesn't remove ktype->namespace() but shifts its handling
to kobject layer.  We probably want to get rid of the callback in the
long term.

This patch adds an explicit param to sysfs_{create|rename|move}_dir()
and renames them to sysfs_{create|rename|move}_dir_ns(), respectively.
ktype->namespace() invocations are moved to the calling sites of the
above functions.  A new helper kboject_namespace() is introduced which
directly tests kobj_ns_type_operations->type which should give the
same result as testing sysfs_fs_type(parent_sd) and returns @kobj's
namespace tag as necessary.  kobject_namespace() is extern as it will
be used from another file in the following patches.

This patch should be an equivalent conversion without any functional
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 15:30:22 -07:00
Tejun Heo
58292cbe66 sysfs: make attr namespace interface less convoluted
sysfs ns (namespace) implementation became more convoluted than
necessary while trying to hide ns information from visible interface.
The relatively recent attr ns support is a good example.

* attr ns tag is determined by sysfs_ops->namespace() callback while
  dir tag is determined by kobj_type->namespace().  The placement is
  arbitrary.

* Instead of performing operations with explicit ns tag, the namespace
  callback is routed through sysfs_attr_ns(), sysfs_ops->namespace(),
  class_attr_namespace(), class_attr->namespace().  It's not simpler
  in any sense.  The only thing this convolution does is traversing
  the whole stack backwards.

The namespace callbacks are unncessary because the operations involved
are inherently synchronous.  The information can be provided in in
straight-forward top-down direction and reversing that direction is
unnecessary and against basic design principles.

This backward interface is unnecessarily convoluted and hinders
properly separating out sysfs from driver model / kobject for proper
layering.  This patch updates attr ns support such that

* sysfs_ops->namespace() and class_attr->namespace() are dropped.

* sysfs_{create|remove}_file_ns(), which take explicit @ns param, are
  added and sysfs_{create|remove}_file() are now simple wrappers
  around the ns aware functions.

* ns handling is dropped from sysfs_chmod_file().  Nobody uses it at
  this point.  sysfs_chmod_file_ns() can be added later if necessary.

* Explicit @ns is propagated through class_{create|remove}_file_ns()
  and netdev_class_{create|remove}_file_ns().

* driver/net/bonding which is currently the only user of attr
  namespace is updated to use netdev_class_{create|remove}_file_ns()
  with @bh->net as the ns tag instead of using the namespace callback.

This patch should be an equivalent conversion without any functional
difference.  It makes the code easier to follow, reduces lines of code
a bit and helps proper separation and layering.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 14:50:01 -07:00
Tejun Heo
bcac3769ca sysfs: drop semicolon from to_sysfs_dirent() definition
The expansion of to_sysfs_dirent() contains an unncessary trailing
semicolon making it impossible to use in the middle of statements.
Drop it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 14:48:28 -07:00
Mark Tinguely
997def25e4 xfs: fix node forward in xfs_node_toosmall
Commit f5ea1100 cleans up the disk to host conversions for
node directory entries, but because a variable is reused in
xfs_node_toosmall() the next node is not correctly found.
If the original node is small enough (<= 3/8 of the node size),
this change may incorrectly cause a node collapse when it should
not. That will cause an assert in xfstest generic/319:

   Assertion failed: first <= last && last < BBTOB(bp->b_length),
   file: /root/newest/xfs/fs/xfs/xfs_trans_buf.c, line: 569

Keep the original node header to get the correct forward node.

(When a node is considered for a merge with a sibling, it overwrites the
 sibling pointers of the original incore nodehdr with the sibling's
 pointers.  This leads to loop considering the original node as a merge
 candidate with itself in the second pass, and so it incorrectly
 determines a merge should occur.)

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

[v3: added Dave Chinner's (slightly modified) suggestion to the commit header,
	cleaned up whitespace.  -bpm]
2013-09-26 10:38:17 -05:00
Trond Myklebust
5bc2afc2b5 NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem method
Determine if we've created a new file by examining the directory change
attribute and/or the O_EXCL flag.

This fixes a regression when doing a non-exclusive create of a new file.
If the FILE_CREATED flag is not set, the atomic_open() command will
perform full file access permissions checks instead of just checking
for MAY_OPEN.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-09-26 10:20:18 -04:00
Milosz Tanski
ffc79664d1 ceph: hung on ceph fscache invalidate in some cases
In some cases I'm on my ceph client cluster I'm seeing hunk kernel tasks in
the invalidate page code path. This is due to the fact that we don't check if
the page is marked as cache before calling fscache_wait_on_page_write().

This is the log from the hang

INFO: task XXXXXX:12034 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 ...
Call Trace:
[<ffffffff81568d09>] schedule+0x29/0x70
[<ffffffffa01d4cbd>] __fscache_wait_on_page_write+0x6d/0xb0 [fscache]
[<ffffffff81083520>] ? add_wait_queue+0x60/0x60
[<ffffffffa029a3e9>] ceph_invalidate_fscache_page+0x29/0x50 [ceph]
[<ffffffffa027df00>] ceph_invalidatepage+0x70/0x190 [ceph]
[<ffffffff8112656f>] ? delete_from_page_cache+0x5f/0x70
[<ffffffff81133cab>] truncate_inode_page+0x8b/0x90
[<ffffffff81133ded>] truncate_inode_pages_range.part.12+0x13d/0x620
[<ffffffff8113431d>] truncate_inode_pages_range+0x4d/0x60
[<ffffffff811343b5>] truncate_inode_pages+0x15/0x20
[<ffffffff8119bbf6>] evict+0x1a6/0x1b0
[<ffffffff8119c3f3>] iput+0x103/0x190
 ...

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-09-25 18:20:14 -07:00
Steve French
ffe67b5859 [CIFS] update cifs.ko version
To 2.02

Signed-off-by: Steve French <smfrench@gmail.com>
2013-09-25 19:01:27 -05:00
Steve French
05c715f2a9 [CIFS] Remove ext2 flags that have been moved to fs.h
These flags were unused by cifs and since the EXT flags have
been moved to common code in uapi/linux/fs.h we won't need
to have a cifs specific copy.

Signed-off-by: Steve French <smfrench@gmail.com>
2013-09-25 18:58:13 -05:00
Russ W. Knize
2e5558f4a5 f2fs: account for orphan inodes during recovery
During recovery, orphan inodes are deleted via truncate_hole().
These orphans are added by recover_dentry() via f2fs_delete_entry().
However, f2fs_delete_entry() adds them via add_orphan_inode()
without calling acquire_orphan_inode() first.  This prevents the
counters from being incremented properly, which causes them to
underflow when remove_orphan_inode() is called later on.

Signed-off-by: Russ Knize <rknize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-25 17:59:32 +09:00
Russ Knize
52ab956000 f2fs: don't GC or take an fs_lock from f2fs_initxattrs()
f2fs_initxattrs() is called internally from within F2FS and should
not call functions that are used by VFS handlers.  This avoids
certain deadlocks:

- vfs_create()
 - f2fs_create() <-- takes an fs_lock
  - f2fs_add_link()
   - __f2fs_add_link()
    - init_inode_metadata()
     - f2fs_init_security()
      - security_inode_init_security()
       - f2fs_initxattrs()
        - f2fs_setxattr() <-- also takes an fs_lock

If the caller happens to grab the same fs_lock from the pool in both
places, they will deadlock.  There are also deadlocks involving
multiple threads and mutexes:

- f2fs_write_begin()
 - f2fs_balance_fs() <-- takes gc_mutex
  - f2fs_gc()
   - write_checkpoint()
    - block_operations()
     - mutex_lock_all() <-- blocks trying to grab all fs_locks

- f2fs_mkdir() <-- takes an fs_lock
 - __f2fs_add_link()
  - f2fs_init_security()
   - security_inode_init_security()
    - f2fs_initxattrs()
     - f2fs_setxattr()
      - f2fs_balance_fs() <-- blocks trying to take gc_mutex

Signed-off-by: Russ Knize <Russ.Knize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-25 17:51:24 +09:00
Russ W. Knize
885166c03c f2fs: don't let the orphan inode counter underflow
Accounting errors from buggy code calling the acquire/release/remove
orphan inode interfaces can cause n_orphans to underflow, which will
then cause acquire_orphan_inode() to return -ENOSPC on the next
operation.  This commit guards against that condition.

Signed-off-by: Russ Knize <rknize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-25 17:49:12 +09:00
Chao Yu
691c6fd2a2 f2fs: remove unneeded write checkpoint in recover_fsync_data
Previously, recover_fsync_data still to write checkpoint when there is
nothing to recover with normal umount image.
It may reduce mount performance and flash memory lifetime, so let's remove
it.

Signed-off-by: Tan Shu <shu.tan@samsung.com>
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-25 17:18:16 +09:00
Linus Torvalds
a153e67bda Merge branch 'akpm' (patches from Andrew Morton)
Merge fixes from Andrew Morton:
 "Bunch of fixes.

  And a reversion of mhocko's "Soft limit rework" patch series.  This is
  actually your fault for opening the merge window when I was off racing ;)

  I didn't read the email thread before sending everything off.
  Johannes Weiner raised significant issues:

    http://www.spinics.net/lists/cgroups/msg08813.html

  and we agreed to back it all out"

I clearly need to be more aware of Andrew's racing schedule.

* akpm:
  MAINTAINERS: update mach-bcm related email address
  checkpatch: make extern in .h prototypes quieter
  cciss: fix info leak in cciss_ioctl32_passthru()
  cpqarray: fix info leak in ida_locked_ioctl()
  kernel/reboot.c: re-enable the function of variable reboot_default
  audit: fix endless wait in audit_log_start()
  revert "memcg, vmscan: integrate soft reclaim tighter with zone shrinking code"
  revert "memcg: get rid of soft-limit tree infrastructure"
  revert "vmscan, memcg: do softlimit reclaim also for targeted reclaim"
  revert "memcg: enhance memcg iterator to support predicates"
  revert "memcg: track children in soft limit excess to improve soft limit"
  revert "memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything"
  revert "memcg: track all children over limit in the root"
  revert "memcg, vmscan: do not fall into reclaim-all pass too quickly"
  fs/ocfs2/super.c: use a bigger nodestr in ocfs2_dismount_volume
  watchdog: update watchdog_thresh properly
  watchdog: update watchdog attributes atomically
2013-09-24 17:00:35 -07:00
Goldwyn Rodrigues
99d7a8824a fs/ocfs2/super.c: use a bigger nodestr in ocfs2_dismount_volume
While printing 32-bit node numbers, an 8-byte string is not enough.
Increase the size of the string to 12 chars.

This got left out in commit 49fa8140e4 ("fs/ocfs2/super.c: Use bigger
nodestr to accomodate 32-bit node numbers").

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 17:00:25 -07:00
Kent Overstreet
2f6cf0de02 block: Fix bio_copy_data()
The memcpy() in bio_copy_data() was using the wrong offset vars, leading
to data corruption in weird unusual setups.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.9
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:42 -07:00
Dave Chinner
566055d33a xfs: log recovery lsn ordering needs uuid check
After a fair number of xfstests runs, xfs/182 started to fail
regularly with a corrupted directory - a directory read verifier was
failing after recovery because it found a block with a XARM magic
number (remote attribute block) rather than a directory data block.

The first time I saw this repeated failure I did /something/ and the
problem went away, so I was never able to find the underlying
problem. Test xfs/182 failed again today, and I found the root
cause before I did /something else/ that made it go away.

Tracing indicated that the block in question was being correctly
logged, the log was being flushed by sync, but the buffer was not
being written back before the shutdown occurred. Tracing also
indicated that log recovery was also reading the block, but then
never writing it before log recovery invalidated the cache,
indicating that it was not modified by log recovery.

More detailed analysis of the corpse indicated that the filesystem
had a uuid of "a4131074-1872-4cac-9323-2229adbcb886" but the XARM
block had a uuid of "8f32f043-c3c9-e7f8-f947-4e7f989c05d3", which
indicated it was a block from an older filesystem. The reason that
log recovery didn't replay it was that the LSN in the XARM block was
larger than the LSN of the transaction being replayed, and so the
block was not overwritten by log recovery.

Hence, log recovery cant blindly trust the magic number and LSN in
the block - it must verify that it belongs to the filesystem being
recovered before using the LSN. i.e. if the UUIDs don't match, we
need to unconditionally recovery the change held in the log.

This patch was first tested on a block device that was repeatedly
causing xfs/182 to fail with the same failure on the same block with
the same directory read corruption signature (i.e. XARM block). It
did not fail, and hasn't failed since.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-24 12:35:57 -05:00
Dave Chinner
b771af2fcb xfs: fix XFS_IOC_FREE_EOFBLOCKS definition
It uses a kernel internal structure in it's definition rather than
the user visible structure that is passed to the ioctl.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-24 12:35:08 -05:00
Dave Chinner
b313a5f1cb xfs: asserting lock not held during freeing not valid
When we free an inode, we do so via RCU. As an RCU lookup can occur
at any time before we free an inode, and that lookup takes the inode
flags lock, we cannot safely assert that the flags lock is not held
just before marking it dead and running call_rcu() to free the
inode.

We check on allocation of a new inode structre that the lock is not
held, so we still have protection against locks being leaked and
hence not correctly initialised when allocated out of the slab.
Hence just remove the assert...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-24 12:32:57 -05:00
Dave Chinner
4885235806 xfs: lock the AIL before removing the buffer item
Regression introduced by commit 46f9d2e ("xfs: aborted buf items can
be in the AIL") which fails to lock the AIL before removing the
item. Spinlock debugging throws a warning about this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-24 12:31:41 -05:00
Jeff Mahoney
721a769c03 reiserfs: fix race with flush_used_journal_lists and flush_journal_list
There are two locks involved in managing the journal lists. The general
reiserfs_write_lock and the journal->j_flush_mutex.

While flush_journal_list is sleeping to acquire the j_flush_mutex or to
submit a block for write, it will drop the write lock. This allows
another thread to acquire the write lock and ultimately call
flush_used_journal_lists to traverse the list of journal lists and
select one for flushing. It can select the journal_list that has just
had flush_journal_list called on it in the original thread and call it
again with the same journal_list.

The second thread then drops the write lock to acquire j_flush_mutex and
the first thread reacquires it and continues execution and eventually
clears and frees the journal list before dropping j_flush_mutex and
returning.

The second thread acquires j_flush_mutex and ends up operating on a
journal_list that has already been released. If the memory hasn't
been reused, we'll soon after hit a BUG_ON because the transaction id
has already been cleared. If it's been reused, we'll crash in other
fun ways.

Since flush_journal_list will synchronize on j_flush_mutex, we can fix
the race by taking a proper reference in flush_used_journal_lists
and checking to see if it's still valid after the mutex is taken. It's
safe to iterate the list of journal lists and pick a list with
just the write lock as long as a reference is taken on the journal list
before we drop the lock. We already have code to handle whether a
transaction has been flushed already so we can use that to handle the
race and get rid of the trans_id BUG_ON.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-09-24 11:24:21 +02:00
Jeff Mahoney
7bc9cc07ee reiserfs: remove useless flush_old_journal_lists
Commit a3172027 introduced test_transaction as a requirement for
flushing old lists -- but it can never return 1 unless the transaction
has already been flushed.

As a result, we have a routine that iterates the j_realblocks list but
doesn't actually do anything. Since it's been this way since 2006 and
the latency numbers were what Chris expected, let's just rip it out.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-09-24 11:24:21 +02:00
Jan Kara
69d75671d9 udf: Fortify LVID loading
A user has reported an oops in udf_statfs() that was caused by
numOfPartitions entry in LVID structure being corrupted. Fix the problem
by verifying whether numOfPartitions makes sense at least to the extent
that LVID fits into a single block as it should.

Reported-by: Juergen Weigert <jw@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-09-24 11:23:33 +02:00
Chao Yu
cc7b1bb173 f2fs: avoid allocating failure in bio_alloc
This patch add macro MAX_BIO_BLOCKS to limit value of npages in
f2fs_bio_alloc, it can avoid allocating failure in bio_alloc caused by
npages is larger than BIO_MAX_PAGES.

Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-24 17:45:48 +09:00
Jin Xu
a57e564d14 f2fs: optimize the victim searching loop slightly
Since the MAX_VICTIM_SEARCH has been enlarged from 20 to 4096,
the victim searching overhead will be increased much than before,
especially for SSR that searches victim for use quiet often.
This patch intends to reduce the overhead a little bit by:
- make the get_gc_cost a inline routine to reduce function call
  overhead
- reduce multiplication and division operations
- reduce unnecessary comparison operation

Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-24 17:45:48 +09:00
Yu Chao
e76eebee70 f2fs: optimize fs_lock for better performance
There is a performance problem: when all sbi->fs_lock are holded, then
all the following threads may get the same next_lock value from sbi->next_lock_num
in function mutex_lock_op, and wait for the same lock(fs_lock[next_lock]),
it may cause performance reduce.
So we move the sbi->next_lock_num++ before getting lock, this will average the
following threads if all sbi->fs_lock are holded.

v1-->v2:
	Drop the needless spin_lock as Jaegeuk suggested.

Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-24 17:45:48 +09:00
Miklos Szeredi
5ca1db41ec GFS2: fix dentry leaks
We need to dput() the result of d_splice_alias(), unless it is passed to
finish_no_open().

Edited by Steven Whitehouse in order to make it apply to the current
GFS2 git tree, and taking account of a prerequisite patch which hasn't
been applied.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: stable@vger.kernel.org
2013-09-23 13:30:57 +01:00