Commit graph

30852 commits

Author SHA1 Message Date
Linus Torvalds
850cb82b75 dlm for 3.9
This includes a single patch to avoid excessive and
 unnecessary scanning of rsbs to free.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJREX6ZAAoJEDgbc8f8gGmqDJ4QAKFuq+vwBb+21lNVDk1egBTf
 nDcpuntzZfnqgg+VKC5LJm0bBAvgexH2GrQMIPDFPPMgAOFjTxXDvm93wVGeAK6d
 JpkiZxW1a0lyjR+7NMvDFnoIdVJvKw+GezNITiVo/lMghl51NWOQTH8QmSBKdEdJ
 cxwzm/ZV9gBS5bv5hfeq2hCDVD1q5jWohMjYHb+xyx2E/c4z8L8enEMR1yBBrB/4
 qw3DgGRMc0eoWzZrXKt1F6MYtM8QnU/H+WxxmLUYc4SMbClIonivAzzOa0PHO5nV
 YvewWwpo2VVT50EIxARXi0XQzJ0aYbTwo+E7KoG+MzK5sCmLJaP8inH5UnlltvEu
 OyhZUkh7qPNjhUQokP3FlNWidZkZNM+gJW6hmZwXtSGfGH6pe620QBrbghcV3nCh
 QXyENcKUcyDNJQSBQFKV/s4ql4RI+iDS3PfovyOdNqkEDWurWKx0AYcrn2dSNYt1
 wjFPH94P3NOwz4nIabpYiICrEPXLdXw+CCvHz56DgBL7EFzrWAzC97v5arYpouhZ
 NmsnzV8+PGHDWA+NM73fKGokW259pvk/zOPtM13zoyXP18X6adGKUqMpZA7jZge9
 RHzO/+A+3ScMXaXdb3c2P7GYH/iSSJXBmcTg8BceOYxICTq1wh4HEzvm39vEwMop
 3BMloPmzKxzfis69swIu
 =WtiV
 -----END PGP SIGNATURE-----

Merge tag 'dlm-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm update from David Teigland:
 "This includes a single patch to avoid excessive and unnecessary
  scanning of rsbs to free."

* tag 'dlm-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: avoid scanning unchanged toss lists
2013-02-21 09:25:23 -08:00
Linus Torvalds
2171ee8f43 NFS client bugfixes for Linux 3.9
- Fix an Oops in the pNFS layoutget code
 - Fix a number of NFSv4 and v4.1 state recovery deadlocks and hangs
   due to the interaction of the session drain lock and state management
   locks.
 - Remove task->tk_xprt, which was hiding a lot of RCU dereferencing bugs
 - Fix a long standing NFSv3 posix lock recovery bug.
 - Revert commit 324d003b0c. It turned out
   that the root cause of the deadlock was due to interactions with the
   workqueues that have now been resolved.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJRJUfVAAoJEGcL54qWCgDySIoP/2QC2qNKqfcoIwOqOAR0pu5m
 wHybR67WXt/R6EK17l273apHfnJQb/IFFKRBfAIAW0vy4bJLwHMtutIeUweFWht2
 hmluYqVWAa9lPwaY1YMolo8Mfizaj0cY3QHv2ogF4poY0SLO6L8O/xo8n8nrn/pk
 QJW9tcvz43A2tbW92t833sQMFQSSjs9L2tvxM/BEahLrSEE4dl6WMWscmHQ/IXt/
 Hs3ADc4//dDF36k5lu7tU0qaVNDacjMXCV6Bw0HUCWCwjWzqmPxSlTDKAcer2vpm
 UawKDFTF4X1QYwbtGlNju0xhsSGRg7ZNRVEeRxqSROT280nzpp7C3CjM9sECtM2l
 SR9RqoYfBsEAyWZtsALApuSBZZZHOvUfVJEL1Cm0ZSeEbWSOO/DT07KRVgfb97h4
 hmE7pn5htgXdiGN2YCphMjrqjaoP4aKmULenj1h1jHbDOgrlqp2QLWqg/5NxMsTw
 M3vmYqQWkW60JN6e1pWJbsoyUkSMTVycRNrZ6WgShtd1lrz3tg66rDXYWdIhU6pT
 6qAKrkXnM5JkLoMoImNQod00Vr1ZXpVc8DLoFtug8rGohCdFJhWkYHefqVJ9rFhv
 2S2NpKDsFiA0q3aK0/YAkUnykYDAnLn7BM5SvhJjq8mKru4bP70PrN9LO8aqhE4E
 fFgj2MGG5QwrL0/GMuls
 =dthb
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:

 - Fix an Oops in the pNFS layoutget code

 - Fix a number of NFSv4 and v4.1 state recovery deadlocks and hangs due
   to the interaction of the session drain lock and state management
   locks.

 - Remove task->tk_xprt, which was hiding a lot of RCU dereferencing
   bugs

 - Fix a long standing NFSv3 posix lock recovery bug.

 - Revert commit 324d003b0c ("NFS: add nfs_sb_deactive_async to avoid
   deadlock").  It turned out that the root cause of the deadlock was
   due to interactions with the workqueues that have now been resolved.

* tag 'nfs-for-3.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (22 commits)
  NLM: Ensure that we resend all pending blocking locks after a reclaim
  umount oops when remove blocklayoutdriver first
  sunrpc: silence build warning in gss_fill_context
  nfs: remove kfree() redundant null checks
  NFSv4.1: Don't decode skipped layoutgets
  NFSv4.1: Fix bulk recall and destroy of layouts
  NFSv4.1: Fix an ABBA locking issue with session and state serialisation
  NFSv4: Fix a reboot recovery race when opening a file
  NFSv4: Ensure delegation recall and byte range lock removal don't conflict
  NFSv4: Fix up the return values of nfs4_open_delegation_recall
  NFSv4.1: Don't lose locks when a server reboots during delegation return
  NFSv4.1: Prevent deadlocks between state recovery and file locking
  NFSv4: Allow the state manager to mark an open_owner as being recovered
  SUNRPC: Add missing static declaration to _gss_mech_get_by_name
  Revert "NFS: add nfs_sb_deactive_async to avoid deadlock"
  SUNRPC: Nuke the tk_xprt macro
  SUNRPC: Avoid RCU dereferences in the transport bind and connect code
  SUNRPC: Fix an RCU dereference in xprt_reserve
  SUNRPC: Pass pointers to struct rpc_xprt to the congestion window
  SUNRPC: Fix an RCU dereference in xs_local_rpcbind
  ...
2013-02-21 09:23:01 -08:00
Linus Torvalds
9b9a72a8a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull GFS2 updates from Steven Whitehouse:
 "This is one of the smallest collections of patches for the merge
  window for some time.  There are some clean ups relating to the
  transaction code and the shrinker, which are mostly in preparation for
  further development, but also make the code much easier to follow in
  these areas.

  There is a patch which allows the use of ->writepages even in the
  default ordered write mode for all writebacks.  This results in
  sending larger i/os to the block layer, and a subsequent increase in
  performance.  It also reduces the number of different i/o paths by
  one.

  There is also a bug fix reinstating the withdraw ack system which
  somehow got lost when the lock modules were merged into GFS2."

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
  GFS2: Reinstate withdraw ack system
  GFS2: Get a block reservation before resizing a file
  GFS2: Split glock lru processing into two parts
  GFS2: Use ->writepages for ordered writes
  GFS2: Clean up freeze code
  GFS2: Merge gfs2_attach_bufdata() into trans.c
  GFS2: Copy gfs2_trans_add_bh into new data/meta functions
  GFS2: Split gfs2_trans_add_bh() into two
  GFS2: Merge revoke adding functions
  GFS2: Separate LRU scanning from shrinker
2013-02-21 09:21:23 -08:00
Linus Torvalds
736a4c1177 xfs: update for 3.9-rc1
For 3.9-rc1 there are primarily bugfixes and a few cleanups.
 
 - fix(es) for compound buffers
 - remove unused XFS_TRANS_DEBUG routines
 - fix for dquot soft timer asserts due to overflow of d_blk_softlimit
 - don't zero allocation args structure members after they are memset(0)
 - fix for regression in dir v2 code introduced in commit 20f7e9f3
 - remove obsolete simple_strto<foo>
 - fix return value when filesystem probe finds no XFS magic, a
   regression introduced in 9802182.
 - remove boolean_t typedef completely
 - fix stack switch in __xfs_bmapi_allocate by moving the check for stack
   switch up into xfs_bmapi_write.
 - fix build error due to incomplete boolean_t removal
 - fix oops in _xfs_buf_find by validating that the requested block is
   within the filesystem bounds.
 - limit speculative preallocation near ENOSPC.
 - fix an unmount hang in xfs_wait_buftarg by freeing the
   xfs_buf_log_item in xfs_buf_item_unlock.
 - fix a possible use after free with AIO.
 - fix xfs_swap_extents after removal of xfs_flushinval_pages, a
   regression introduced in fb59581404.
 - replace hardcoded 128 with log header size
 - add memory barrier before wake_up_bit in xfs_ifunlock
 - limit speculative preallocation on sparse files
 - fix xa_lock recursion bug introduced in 90810b9e82
 - fix write verifier for symlinks
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJRI/pkAAoJENaLyazVq6ZO8F0QAMUgw+edrht4x/vr0stNpLvL
 /htIhJVhRjWthQaSiUkmlHShNZCgvfKSRPG9OiaMo24QuqardB3Ieh90JTW8b5Iu
 63tQ1kKB++n1PnW7OYSxvUj2WNBHwHlI0ueNUiEFXoD9tQOWcyO63CxA/qnTJB1P
 gulI+R7gWafqD1UpH6kP/xCMn3V3PnCYVdX2LJHuEjQoCydsKgtc9dgvRi4e51kf
 OrxnKof6tAlkpNc1+ZmoQch4NS+jRWv90HGdrxOy/BESC0QYSyfrAwxXIVvkwBfu
 NsQg5oDCp4jlDQt9gjgABhPcWvSGc7JhSjqbXTnweRS4t7SOGGvb46HVrLMKZRID
 MFT8tbc1HXzAl2z2k0S0bRy9HNC4QZg5GJZ/MhXMKUQWZLZf/XPALgQ5q95hEjsc
 MUTudN7Xa8Ay+2BPTI34M3f1Mlzmm68N07rzjw8jmJzjVq8v7LJ1wbmpiswjkHiW
 6u6MSwY3NDhjNI0sA+h6ByX3KwYfBW9Y63QI/8Dld7jDFQ5onxYmkIP+fPe1vhBJ
 51gNm7bKXImEFpykAP5YddCnxMWtFUdIH/Wyu7quMpwv5YSIR+zVd3l4edqAKIgh
 YlJ9bLewWP+4B2L4raotPjdaOxDZfrW2nP0vmVcuQWEbMdANZ8XQTNPCULcoKGtY
 iBCDlWS7wAfH8PYnXlbp
 =/Zui
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.9-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs update from Ben Myers:
 "Primarily bugfixes and a few cleanups:

   - fix(es) for compound buffers

   - remove unused XFS_TRANS_DEBUG routines

   - fix for dquot soft timer asserts due to overflow of d_blk_softlimit

   - don't zero allocation args structure members after they are memset(0)

   - fix for regression in dir v2 code introduced in commit 20f7e9f3

   - remove obsolete simple_strto<foo>

   - fix return value when filesystem probe finds no XFS magic, a
     regression introduced in 9802182.

   - remove boolean_t typedef completely

   - fix stack switch in __xfs_bmapi_allocate by moving the check for
     stack switch up into xfs_bmapi_write.

   - fix build error due to incomplete boolean_t removal

   - fix oops in _xfs_buf_find by validating that the requested block is
     within the filesystem bounds.

   - limit speculative preallocation near ENOSPC.

   - fix an unmount hang in xfs_wait_buftarg by freeing the
     xfs_buf_log_item in xfs_buf_item_unlock.

   - fix a possible use after free with AIO.

   - fix xfs_swap_extents after removal of xfs_flushinval_pages, a
     regression introduced in fb59581404.

   - replace hardcoded 128 with log header size

   - add memory barrier before wake_up_bit in xfs_ifunlock

   - limit speculative preallocation on sparse files

   - fix xa_lock recursion bug introduced in 90810b9e82

   - fix write verifier for symlinks"

Fixed up conflicts in fs/xfs/xfs_buf_item.c (due to bli_format rename in
commit 0f22f9d0cd affecting the removed XFS_TRANS_DEBUG routines in
commit ec47eb6b0b).

* tag 'for-linus-v3.9-rc1' of git://oss.sgi.com/xfs/xfs: (36 commits)
  xfs: xfs_bmap_add_attrfork_local is too generic
  xfs: remove log force from xfs_buf_trylock()
  xfs: recheck buffer pinned status after push trylock failure
  xfs: limit speculative prealloc size on sparse files
  xfs: memory barrier before wake_up_bit()
  xfs: refactor space log reservation for XFS_TRANS_ATTR_SET
  xfs: make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy()
  xfs: make use of XFS_SB_LOG_RES() at xfs_mount_log_sb()
  xfs: make use of XFS_SB_LOG_RES() at xfs_log_sbcount()
  xfs: introduce XFS_SB_LOG_RES() for transactions that modify sb on disk
  xfs: calculate XFS_TRANS_QM_QUOTAOFF_END space log reservation at mount time
  xfs: calculate XFS_TRANS_QM_QUOTAOFF space log reservation at mount time
  xfs: calculate XFS_TRANS_QM_DQALLOC space log reservation at mount time
  xfs: calcuate XFS_TRANS_QM_SETQLIM space log reservation at mount time
  xfs: calculate xfs_qm_write_sb_changes() space log reservation at mount time
  xfs: calculate XFS_TRANS_QM_SBCHANGE space log reservation at mount time
  xfs: make use of xfs_calc_buf_res() in xfs_trans.c
  xfs: add a helper to figure out the space log reservation per item
  xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
  xfs: Fix possible use-after-free with AIO
  ...
2013-02-21 09:08:45 -08:00
Linus Torvalds
c4bc705e45 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
 "The biggest part of this pull request is a patch series from Maxim
  Patlasov to optimize scatter-gather direct IO.  There's also the
  addition of a "readdirplus" API, poll events and various fixes and
  cleanups.

  There's a one line change outside of fuse to mm/filemap.c which makes
  the argument of iov_iter_single_seg_count() const, required by Maxim's
  patches."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (22 commits)
  fuse: allow control of adaptive readdirplus use
  Synchronize fuse header with one used in library
  fuse: send poll events
  fuse: don't WARN when nlink is zero
  fuse: avoid out-of-scope stack access
  fuse: bump version for READDIRPLUS
  FUSE: Adapt readdirplus to application usage patterns
  Do not use RCU for current process credentials
  fuse: cleanup fuse_direct_io()
  fuse: optimize __fuse_direct_io()
  fuse: optimize fuse_get_user_pages()
  fuse: pass iov[] to fuse_get_user_pages()
  mm: minor cleanup of iov_iter_single_seg_count()
  fuse: use req->page_descs[] for argpages cases
  fuse: add per-page descriptor <offset, length> to fuse_req
  fuse: rework fuse_do_ioctl()
  fuse: rework fuse_perform_write()
  fuse: rework fuse_readpages()
  fuse: rework fuse_retrieve()
  fuse: categorize fuse_get_req()
  ...
2013-02-21 09:03:54 -08:00
Linus Torvalds
2608e3d0fa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull v9fs updates from Eric Van Hensbergen:
 "Just fixes and simplifications"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: Fix atomic_open
  fs/9p: Don't use O_TRUNC flag in TOPEN and TLOPEN request
  locking in fs/9p ->readdir()
2013-02-21 09:02:13 -08:00
Miao Xie
dc81cdc58a Btrfs: fix remount vs autodefrag
If we remount the fs to close the auto defragment or make the fs R/O,
we should stop the auto defragment.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-21 08:11:43 -05:00
Miao Xie
172a50497f Btrfs: fix wrong outstanding_extents when doing DIO write
When running the 083th case of xfstests on the filesystem with
"compress-force=lzo", the following WARNINGs were triggered.
  WARNING: at fs/btrfs/inode.c:7908
  WARNING: at fs/btrfs/inode.c:7909
  WARNING: at fs/btrfs/inode.c:7911
  WARNING: at fs/btrfs/extent-tree.c:4510
  WARNING: at fs/btrfs/extent-tree.c:4511

This problem was introduced by the patch "Btrfs: fix deadlock due
to unsubmitted". In this patch, there are two bugs which caused
the above problem.

The 1st one is a off-by-one bug, if the DIO write return 0, it is
also a short write, we need release the reserved space for it. But
we didn't do it in that patch. Fix it by change "ret > 0" to
"ret >= 0".

The 2nd one is ->outstanding_extents was increased twice when
a short write happened. As we know, ->outstanding_extents is
a counter to keep track of the number of extent items we may
use duo to delalloc, when we reserve the free space for a
delalloc write, we assume that the write will introduce just
one extent item, so we increase ->outstanding_extents by 1 at
that time. And then we will increase it every time we split the
write, it is done at the beginning of btrfs_get_blocks_direct().
So when a short write happens, we needn't increase
->outstanding_extents again. But this patch done.

In order to fix the 2nd problem, I re-write the logic for
->outstanding_extents operation. We don't increase it at the
beginning of btrfs_get_blocks_direct(), instead, we just
increase it when the split actually happens.

Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-21 08:11:43 -05:00
Linus Torvalds
a0b1c42951 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking update from David Miller:

 1) Checkpoint/restarted TCP sockets now can properly propagate the TCP
    timestamp offset.  From Andrey Vagin.

 2) VMWARE VM VSOCK layer, from Andy King.

 3) Much improved support for virtual functions and SR-IOV in bnx2x,
    from Ariel ELior.

 4) All protocols on ipv4 and ipv6 are now network namespace aware, and
    all the compatability checks for initial-namespace-only protocols is
    removed.  Thanks to Tom Parkin for helping deal with the last major
    holdout, L2TP.

 5) IPV6 support in netpoll and network namespace support in pktgen,
    from Cong Wang.

 6) Multiple Registration Protocol (MRP) and Multiple VLAN Registration
    Protocol (MVRP) support, from David Ward.

 7) Compute packet lengths more accurately in the packet scheduler, from
    Eric Dumazet.

 8) Use per-task page fragment allocator in skb_append_datato_frags(),
    also from Eric Dumazet.

 9) Add support for connection tracking labels in netfilter, from
    Florian Westphal.

10) Fix default multicast group joining on ipv6, and add anti-spoofing
    checks to 6to4 and 6rd.  From Hannes Frederic Sowa.

11) Make ipv4/ipv6 fragmentation memory limits more reasonable in modern
    times, rearrange inet frag datastructures for better cacheline
    locality, and move more operations outside of locking.  From Jesper
    Dangaard Brouer.

12) Instead of strict master <--> slave relationships, allow arbitrary
    scenerios with "upper device lists".  From Jiri Pirko.

13) Improve rate limiting accuracy in TBF and act_police, also from Jiri
    Pirko.

14) Add a BPF filter netfilter match target, from Willem de Bruijn.

15) Orphan and delete a bunch of pre-historic networking drivers from
    Paul Gortmaker.

16) Add TSO support for GRE tunnels, from Pravin B SHelar.  Although
    this still needs some minor bug fixing before it's %100 correct in
    all cases.

17) Handle unresolved IPSEC states like ARP, with a resolution packet
    queue.  From Steffen Klassert.

18) Remove TCP Appropriate Byte Count support (ABC), from Stephen
    Hemminger.  This was long overdue.

19) Support SO_REUSEPORT, from Tom Herbert.

20) Allow locking a socket BPF filter, so that it cannot change after a
    process drops capabilities.

21) Add VLAN filtering to bridge, from Vlad Yasevich.

22) Bring ipv6 on-par with ipv4 and do not cache neighbour entries in
    the ipv6 routes, from YOSHIFUJI Hideaki.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1538 commits)
  ipv6: fix race condition regarding dst->expires and dst->from.
  net: fix a wrong assignment in skb_split()
  ip_gre: remove an extra dst_release()
  ppp: set qdisc_tx_busylock to avoid LOCKDEP splat
  atl1c: restore buffer state
  net: fix a build failure when !CONFIG_PROC_FS
  net: ipv4: fix waring -Wunused-variable
  net: proc: fix build failed when procfs is not configured
  Revert "xen: netback: remove redundant xenvif_put"
  net: move procfs code to net/core/net-procfs.c
  qmi_wwan, cdc-ether: add ADU960S
  bonding: set sysfs device_type to 'bond'
  bonding: fix bond_release_all inconsistencies
  b44: use netdev_alloc_skb_ip_align()
  xen: netback: remove redundant xenvif_put
  net: fec: Do a sanity check on the gpio number
  ip_gre: propogate target device GSO capability to the tunnel device
  ip_gre: allow CSUM capable devices to handle packets
  bonding: Fix initialize after use for 3ad machine state spinlock
  bonding: Fix race condition between bond_enslave() and bond_3ad_update_lacp_rate()
  ...
2013-02-20 18:58:50 -08:00
Liu Bo
38c227d87c Btrfs: snapshot-aware defrag
This comes from one of btrfs's project ideas,
As we defragment files, we break any sharing from other snapshots.
The balancing code will preserve the sharing, and defrag needs to grow this
as well.

Now we're able to fill the blank with this patch, in which we make full use of
backref walking stuff.

Here is the basic idea,
o  set the writeback ranges started by defragment with flag EXTENT_DEFRAG
o  at endio, after we finish updating fs tree, we use backref walking to find
   all parents of the ranges and re-link them with the new COWed file layout by
   adding corresponding backrefs.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-20 18:06:09 -05:00
Chris Mason
86db25785a Btrfs: fix max chunk size on raid5/6
We try to limit the size of a chunk to 10GB, which keeps the unit of
work reasonable during balance and resize operations.  The limit checks
were taking into account the number of copies of the data we had but
what they really should be doing is comparing against the logical
size of the chunk we're creating.

This moves the code around a little to use the count of data stripes
from raid5/6.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-20 17:08:18 -05:00
Linus Torvalds
8793422fd9 ACPI and power management updates for 3.9-rc1
- Rework of the ACPI namespace scanning code from Rafael J. Wysocki
   with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
   Toshi Kani, and Yinghai Lu.
 
 - ACPI power resources handling and ACPI device PM update from
   Rafael J. Wysocki.
 
 - ACPICA update to version 20130117 from Bob Moore and Lv Zheng
   with contributions from Aaron Lu, Chao Guan, Jesper Juhl, and
   Tim Gardner.
 
 - Support for Intel Lynxpoint LPSS from Mika Westerberg.
 
 - cpuidle update from Len Brown including Intel Haswell support, C1
   state for intel_idle, removal of global pm_idle.
 
 - cpuidle fixes and cleanups from Daniel Lezcano.
 
 - cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri
   with contributions from Stratos Karafotis and Rickard Andersson.
 
 - Intel P-states driver for Sandy Bridge processors from
   Dirk Brandewie.
 
 - cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.
 
 - cpufreq fixes related to ordering issues between acpi-cpufreq and
   powernow-k8 from Borislav Petkov and Matthew Garrett.
 
 - cpufreq support for Calxeda Highbank processors from Mark Langsdorf
   and Rob Herring.
 
 - cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
   from Shawn Guo.
 
 - cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
   and Inderpal Singh.
 
 - Support for "lightweight suspend" from Zhang Rui.
 
 - Removal of the deprecated power trace API from Paul Gortmaker.
 
 - Assorted updates from Andreas Fleig, Colin Ian King,
   Davidlohr Bueso, Joseph Salisbury, Kees Cook, Li Fei,
   Nishanth Menon, ShuoX Liu, Srinivas Pandruvada, Tejun Heo,
   Thomas Renninger, and Yasuaki Ishimatsu.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJRIsArAAoJEKhOf7ml8uNsD6MP/j7C4NA+GTq6RdwoJt+Yki0K
 9Ep8I4pEuRFoN/oskv24EyQhpGJIk6UxWcJ/DWFBc+1VhmKORta7k2Idv/wlJA77
 s7AcDveA9xcDh+TVfbh87TeuiMSXiSdDZbiaQO+wMizWJAF3F84AnjiAqqqyQcSK
 bA5/Siz/vWlt9PyYDaQtHTVE4lpvPuVcQdYewsdaH2PsmUjvIg/TUzg28CTrdyvv
 eHOdBK9R0/OLQLhzRbL0VOGJ//wEl+HJRO0QEhTKPgdQ1e/VH/4Zu5WSzF8P/x4C
 s2f8U4IKQqulDuDHXtpMpelFm7hRWgsOqZLkcyXLs+0dvSM9CTPO6P0ZaImxUctk
 5daHWEsXUnCErDQawt1mcZP8l6qnxofMQIfLXyPVzvlSnHyToTmrtXa1v2u4AuL/
 hOo4MYWsFNUmRdtGFFGlExGgEDZ4G5NwiYjRBl/6XJ3v4nhnnMbuzxP8scpoe5m1
 8tjroJHZFUUs/mFU/H+oRbHzSzXPmp1sddNaTg4OpVmTn3DDh6ljnFhiItd1Ndw0
 5ldVbSe6ETq5RoK0TbzvQOeVpa9F3JfqbrXLQPqfd2iz/No41LQYG1uShRYuXKuA
 wfEcc+c9VMd3FILu05pGwBnU8VS9VbxTYMz7xDxg6b29Ywnb7u+Q1ycCk2gFYtkS
 E2oZDuyewTJxaskzYsNr
 =wijn
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI and power management updates from Rafael Wysocki:

 - Rework of the ACPI namespace scanning code from Rafael J.  Wysocki
   with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
   Toshi Kani, and Yinghai Lu.

 - ACPI power resources handling and ACPI device PM update from Rafael
   J Wysocki.

 - ACPICA update to version 20130117 from Bob Moore and Lv Zheng with
   contributions from Aaron Lu, Chao Guan, Jesper Juhl, and Tim Gardner.

 - Support for Intel Lynxpoint LPSS from Mika Westerberg.

 - cpuidle update from Len Brown including Intel Haswell support, C1
   state for intel_idle, removal of global pm_idle.

 - cpuidle fixes and cleanups from Daniel Lezcano.

 - cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri with
   contributions from Stratos Karafotis and Rickard Andersson.

 - Intel P-states driver for Sandy Bridge processors from Dirk
   Brandewie.

 - cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.

 - cpufreq fixes related to ordering issues between acpi-cpufreq and
   powernow-k8 from Borislav Petkov and Matthew Garrett.

 - cpufreq support for Calxeda Highbank processors from Mark Langsdorf
   and Rob Herring.

 - cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
   from Shawn Guo.

 - cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
   and Inderpal Singh.

 - Support for "lightweight suspend" from Zhang Rui.

 - Removal of the deprecated power trace API from Paul Gortmaker.

 - Assorted updates from Andreas Fleig, Colin Ian King, Davidlohr Bueso,
   Joseph Salisbury, Kees Cook, Li Fei, Nishanth Menon, ShuoX Liu,
   Srinivas Pandruvada, Tejun Heo, Thomas Renninger, and Yasuaki
   Ishimatsu.

* tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (267 commits)
  PM idle: remove global declaration of pm_idle
  unicore32 idle: delete stray pm_idle comment
  openrisc idle: delete pm_idle
  mn10300 idle: delete pm_idle
  microblaze idle: delete pm_idle
  m32r idle: delete pm_idle, and other dead idle code
  ia64 idle: delete pm_idle
  cris idle: delete idle and pm_idle
  ARM64 idle: delete pm_idle
  ARM idle: delete pm_idle
  blackfin idle: delete pm_idle
  sparc idle: rename pm_idle to sparc_idle
  sh idle: rename global pm_idle to static sh_idle
  x86 idle: rename global pm_idle to static x86_idle
  APM idle: register apm_cpu_idle via cpuidle
  cpufreq / intel_pstate: Add kernel command line option disable intel_pstate.
  cpufreq / intel_pstate: Change to disallow module build
  tools/power turbostat: display SMI count by default
  intel_idle: export both C1 and C1E
  ACPI / hotplug: Fix concurrency issues and memory leaks
  ...
2013-02-20 11:26:56 -08:00
Zach Brown
24542bf7ea btrfs: limit fallocate extent reservation to 256MB
Very large fallocate requests are cpu bound and result in extents with a
repeating pattern of ever decreasing size:

$ time fallocate -l 1T file
real	0m13.039s

( an excerpt of the extents from btrfs-debug-tree: )
  prealloc data disk byte 1536292564992 nr 397312
  prealloc data disk byte 1536292962304 nr 196608
  prealloc data disk byte 1536293158912 nr 98304
  prealloc data disk byte 1536293257216 nr 49152
  prealloc data disk byte 1536293306368 nr 24576
  prealloc data disk byte 1536293330944 nr 12288
  prealloc data disk byte 1536293343232 nr 8192
  prealloc data disk byte 1536293351424 nr 4096
  prealloc data disk byte 1536293355520 nr 4096
  prealloc data disk byte 1536293359616 nr 4096

The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
allocate the entire remaining size after each extent is allocated.
btrfs_reserve_extent() repeatedly cuts this requested size in half until
it gets down to the size that the allocators can return.  We limit the
problem for now by capping each reservation at 256 meg.

The small extents come from a masking bug when decreasing the requested
reservation size.  The high 32bits are cleared and the remaining low
bits might happen to reserve a small size.   Fix this by using
round_down() which properly casts the mask.

After these fixes huge fallocate requests are fast and result in nice
large extents:

$ time fallocate -l 1T file
real	0m0.082s

  prealloc data disk byte 1112425889792 nr 268435456
  prealloc data disk byte 1112694325248 nr 268435456
  prealloc data disk byte 1112962760704 nr 268435456

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-20 14:06:25 -05:00
Thomas Gleixner
1cba0cdf5e btrfs: Init io_lock after cloning btrfs device struct
__btrfs_close_devices() clones btrfs device structs with
memcpy(). Some of the fields in the clone are reinitialized, but it's
missing to init io_lock. In mainline this goes unnoticed, but on RT it
leaves the plist pointing to the original about to be freed lock
struct.

Initialize io_lock after cloning, so no references to the original
struct are left.

Reported-and-tested-by: Mike Galbraith <efault@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-20 14:06:20 -05:00
Chris Mason
e942f883bc Merge branch 'raid56-experimental' into for-linus-3.9
Signed-off-by: Chris Mason <chris.mason@fusionio.com>

Conflicts:
	fs/btrfs/ctree.h
	fs/btrfs/extent-tree.c
	fs/btrfs/inode.c
	fs/btrfs/volumes.c
2013-02-20 14:06:05 -05:00
Chris Mason
b2c6b3e061 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-linus-3.9
Signed-off-by: Chris Mason <chris.mason@fusionio.com>

Conflicts:
	fs/btrfs/disk-io.c
2013-02-20 14:05:45 -05:00
Miao Xie
272d26d0ad Btrfs: fix missing release of qgroup reservation in commit_transaction()
We forget to free qgroup reservation in commit_transaction(),fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 13:00:09 -05:00
Wang Shilong
683cebda90 Btrfs: fix missing check before disabling quota
The original code forget to check whether quota has been disabled firstly,
and it will return 'EINVAL' and return error to users if quota has been
disabled,it will be unfriendly and confusing for users to see that.
So just return directly if quota has been disabled.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 13:00:07 -05:00
Liu Bo
fa6ac8765c Btrfs: fix cleaner thread not working with inode cache option
Right now inode cache inode is treated as the same as space cache
inode, ie. keep inode in memory till putting super.

But this leads to an awkward situation.

If we're going to delete a snapshot/subvolume, btrfs will not
actually delete it and return free space, but will add it to dead
roots list until the last inode on this snap/subvol being destroyed.
Then we'll fetch deleted roots and cleanup them via cleaner thread.

So here is the problem, if we enable inode cache option, each
snap/subvol has a cached inode which is used to store inode allcation
information.  And this cache inode will be kept in memory, as the above
said.  So with inode cache, snap/subvol can only be added into
dead roots list during freeing roots stage in umount, so that we can
ONLY get space back after another remount(we cleanup dead roots on mount).

But the real thing is we'll no more use the snap/subvol if we mark it
deleted, so we can safely iput its cache inode when we delete snap/subvol.

Another thing is that we need to change the rules of droping inode, we
don't keep snap/subvol's cache inode in memory till end so that we can
add snap/subvol into dead roots list in time.

Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 13:00:06 -05:00
Miao Xie
d4edf39bd5 Btrfs: fix uncompleted transaction
In some cases, we need commit the current transaction, but don't want
to start a new one if there is no running transaction, so we introduce
the function - btrfs_attach_transaction(), which can catch the current
transaction, and return -ENOENT if there is no running transaction.

But no running transaction doesn't mean the current transction completely,
because we removed the running transaction before it completes. In some
cases, it doesn't matter. But in some special cases, such as freeze fs, we
hope the transaction is fully on disk, it will introduce some bugs, for
example, we may feeze the fs and dump the data in the disk, if the transction
doesn't complete, we would dump inconsistent data. So we need fix the above
problem for those cases.

We fixes this problem by introducing a function:
	btrfs_attach_transaction_barrier()
if we hope all the transaction is fully on the disk, even they are not
running, we can use this function.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 13:00:05 -05:00
Miao Xie
178260b2c1 Btrfs: fix the deadlock between the transaction start/attach and commit
Now btrfs_commit_transaction() does this

ret = btrfs_run_ordered_operations(root, 0)

which async flushes all inodes on the ordered operations list, it introduced
a deadlock that transaction-start task, transaction-commit task and the flush
workers waited for each other.
(See the following URL to get the detail
 http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)

As we know, if ->in_commit is set, it means someone is committing the
current transaction, we should not try to join it if we are not JOIN
or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
the above problem. In this way, there is another benefit: there is no new
transaction handle to block the transaction which is on the way of commit,
once we set ->in_commit.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 13:00:03 -05:00
Miao Xie
4b82490649 Btrfs: fix the qgroup reserved space is released prematurely
In start_transactio(), we will try to join the transaction again after
the current transaction is committed, so we should not release the
reserved space of the qgroup. Fix it.

Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 13:00:02 -05:00
Zach Brown
cdb4c5748c btrfs: define BTRFS_MAGIC as a u64 value
super.magic is an le64 but it's treated as an unterminated string when
compared against BTRFS_MAGIC which is defined as a string.  Instead
define BTRFS_MAGIC as a normal hex value and use endian helpers to
compare it to the super's magic.

I tested this by mounting an fs made before the change and made sure
that it didn't introduce sparse errors.  This matches a similar cleanup
that is pending in btrfs-progs.  David Sterba pointed out that we should
fix the kernel side as well :).

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 13:00:01 -05:00
jeff.liu
a8bfd4abea Btrfs: set/change the label of a mounted file system
With this new ioctl(2) BTRFS_IOC_SET_FSLABEL, we can set/change the label of a mounted file system.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:59 -05:00
jeff.liu
867ab667e7 Btrfs: Add a new ioctl to get the label of a mounted file system
Add a new ioctl(2) BTRFS_IOC_GET_FSLABLE, so that we can get the label upon a mounted filesystem.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Goffredo Baroncelli <kreijack@inwind.it>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:58 -05:00
Josef Bacik
569e0f358c Btrfs: place ordered operations on a per transaction list
Miao made the ordered operations stuff run async, which introduced a
deadlock where we could get somebody (sync) racing in and committing the
transaction while a commit was already happening.  The new committer would
try and flush ordered operations which would hang waiting for the commit to
finish because it is done asynchronously and no longer inherits the callers
trans handle.  To fix this we need to make the ordered operations list a per
transaction list.  We can get new inodes added to the ordered operation list
by truncating them and then having another process writing to them, so this
makes it so that anybody trying to add an ordered operation _must_ start a
transaction in order to add itself to the list, which will keep new inodes
from getting added to the ordered operations list after we start committing.
This should fix the deadlock and also keeps us from doing a lot more work
than we need to during commit.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:57 -05:00
Josef Bacik
dde5740fdd Btrfs: relax the block group size limit for bitmaps
Dave pointed out that xfstests 273 will tell you that it failed to load the
space cache for a block group when it remounts.  This is because we run out
of space writing out the block group cache.  This is ok and is working as it
should, but let's try to be a bit nicer.  This happens because the block
group was 100mb, but bitmap entries cover 128mb, so we were only getting
extent entries for this block group, which ended up being too many to fit in
the free space cache.  So relax the bitmap size requirements to block groups
that are at least half the size a bitmap will cover or larger, that way we
can still keep the amount of space used in the free space cache low enough
to be able to write it out.  With this patch I no longer fail to write out
the free space cache.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:55 -05:00
Ilya Dryomov
3e39cea61c Btrfs: allow for selecting only completely empty chunks
Enhance balance usage filter by making it possible to balance out only
completely empty chunks.  Today, usage filter properly acts on values
from 1 to 99 inclusive, usage=100 selects all chunks, and usage=0
selects no chunks.  This commit changes the usage=0 case: the new
meaning is to restripe only completely empty chunks and nothing else.

Suggested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:54 -05:00
Ilya Dryomov
bf023ecfca Btrfs: eliminate a use-after-free in btrfs_balance()
Commit 5af3e8cc introduced a use-after-free at volumes.c:3139: bctl is freed
above in __cancel_balance() in all cases except for balance pause.  Fix this
by moving the offending check a couple statements above, the meaning of the
check is preserved.

Reported-by: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:53 -05:00
Josef Bacik
c8f2f24bd5 Btrfs: remove unused extent io tree ops V2
Nobody uses these io tree ops anymore so just remove them and clean up the code
a bit.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:52 -05:00
David Sterba
210549ebe9 btrfs: add cancellation points to defrag
The defrag operation can take very long, we want to have a way how to
cancel it. The code checks for a pending signal at safe points in the
defrag loops and returns EAGAIN. This means a user can press ^C after
running 'btrfs fi defrag', woks for both defrag modes, files and root.

Returning from the command was instant in my light tests, but may take
longer depending on the aging factor of the filesystem.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:51 -05:00
David Sterba
b069e0c345 btrfs: put some enospc messages under enospc_debug
The warning in use_block_rsv is not useful for users and may fill
the logs unnecessarily.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:49 -05:00
Miao Xie
38851cc19a Btrfs: implement unlocked dio write
This idea is from ext4. By this patch, we can make the dio write parallel,
and improve the performance. But because we can not update isize without
i_mutex, the unlocked dio write just can be done in front of the EOF.

We needn't worry about the race between dio write and truncate, because the
truncate need wait untill all the dio write end.

And we also needn't worry about the race between dio write and punch hole,
because we have extent lock to protect our operation.

I ran fio to test the performance of this feature.

== Hardware ==
CPU: Intel(R) Core(TM)2 Duo CPU     E7500  @ 2.93GHz
Mem: 2GB
SSD: Intel X25-M 120GB (Test Partition: 60GB)

== config file ==
[global]
ioengine=psync
direct=1
bs=4k
size=32G
runtime=60
directory=/mnt/btrfs/
filename=testfile
group_reporting
thread

[file1]
numjobs=1 # 2 4
rw=randwrite

== result (KBps) ==
write	1	2	4
lock	24936	24738	24726
nolock	24962	30866	32101

== result (iops) ==
write	1	2	4
lock	6234	6184	6181
nolock	6240	7716	8025

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:48 -05:00
Miao Xie
2e60a51e62 Btrfs: serialize unlocked dio reads with truncate
Currently, we can do unlocked dio reads, but the following race
is possible:

dio_read_task			truncate_task
				->btrfs_setattr()
->btrfs_direct_IO
    ->__blockdev_direct_IO
      ->btrfs_get_block
				  ->btrfs_truncate()
				 #alloc truncated blocks
				 #to other inode
      ->submit_io()
     #INFORMATION LEAK

In order to avoid this problem, we must serialize unlocked dio reads with
truncate. There are two approaches:
- use extent lock to protect the extent that we truncate
- use inode_dio_wait() to make sure the truncating task will wait for
  the read DIO.

If we use the 1st one, we will meet the endless truncation problem due to
the nonlocked read DIO after we implement the nonlocked write DIO. It is
because we still need invoke inode_dio_wait() avoid the race between write
DIO and truncation. By that time, we have to introduce

  btrfs_inode_{block, resume}_nolock_dio()

again. That is we have to implement this patch again, so I choose the 2nd
way to fix the problem.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:47 -05:00
Miao Xie
0934856d46 Btrfs: fix deadlock due to unsubmitted
The deadlock problem happened when running fsstress(a test program in LTP).

Steps to reproduce:
 # mkfs.btrfs -b 100M <partition>
 # mount <partition> <mnt>
 # <Path>/fsstress -p 3 -n 10000000 -d <mnt>

The reason is:
btrfs_direct_IO()
 |->do_direct_IO()
     |->get_page()
     |->get_blocks()
     |	 |->btrfs_delalloc_resereve_space()
     |	 |->btrfs_add_ordered_extent() -------	Add a new ordered extent
     |->dio_send_cur_page(page0) --------------	We didn't submit bio here
     |->get_page()
     |->get_blocks()
	 |->btrfs_delalloc_resereve_space()
	     |->flush_space()
		 |->btrfs_start_ordered_extent()
		     |->wait_event() ----------	Wait the completion of
						the ordered extent that is
						mentioned above

But because we didn't submit the bio that is mentioned above, the ordered
extent can not complete, we would wait for its completion forever.

There are two methods which can fix this deadlock problem:
1. submit the bio before we invoke get_blocks()
2. reserve the space before we do dio

Though the 1st is the simplest way, we need modify the code of VFS, and it
is likely to break contiguous requests, and introduce performance regression
for the other filesystems.

So we have to choose the 2nd way.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:45 -05:00
Josef Bacik
4a7d0f6854 Btrfs: cleanup orphan reservation if truncate fails
I noticed we were getting lots of warnings with xfstest 83 because we have
reservations outstanding.  This is because we moved the orphan add outside
of the truncate, but we don't actually cleanup our reservation if something
fails.  This fixes the problem and I no longer see warnings.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:44 -05:00
Josef Bacik
5d80366e9b Btrfs: steal from global reserve if we are cleaning up orphans
Sometimes xfstest 83 will fail to remount the scratch device because we've
gotten ourselves so full that we cannot cleanup the orphan items.  In this
case check to see if we're doing the orphan cleanup and if we are allow us
to steal our reservation from the global block rsv.  With this patch I've
not been able to reproduce the failed mount problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:42 -05:00
Miao Xie
8696c53304 Btrfs: fix memory leak of pending_snapshot->inherit
The argument "inherit" of btrfs_ioctl_snap_create_transid() was assigned
to NULL during we created the snapshots, so we didn't free it though we
called kfree() in the caller.

But since we are sure the snapshot creation is done after the function -
btrfs_ioctl_snap_create_transid() - completes, it is safe that we don't
assign the pointer "inherit" to NULL, and just free it in the caller of
btrfs_ioctl_snap_create_transid(). In this way, the code can become more
readable.

Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:41 -05:00
Miao Xie
2b8195bb57 Btrfs: fix the race between bio and btrfs_stop_workers
open_ctree() need read the metadata to initialize the global information
of btrfs. But it may fail after it submit some bio, and then it will jump
to the error path. Unfortunately, it doesn't check if there are some bios
in flight, and just stop all the worker threads. As a result, when the
submitted bios end, they can not find any worker thread which can deal with
subsequent work, then oops happen.

kernel BUG at fs/btrfs/async-thread.c:605!

Fix this problem by invoking invalidate_inode_pages2() before we stop the
worker threads. This function will wait until the bio end because it need
lock the pages which are going to be invalidated, and if a page is under
disk read IO, it must be locked. invalidate_inode_pages2() need wait until
end bio handler to unlocked it.

Reported-and-Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:40 -05:00
Mark Fasheh
cb95e7bf7b btrfs: add "no file data" flag to btrfs send ioctl
This patch adds the flag, BTRFS_SEND_FLAG_NO_FILE_DATA to the btrfs send
ioctl code. When this flag is set, the btrfs send code will never write file
data into the stream (thus also avoiding expensive reads of that data in the
first place). BTRFS_SEND_C_UPDATE_EXTENT commands will be sent (instead of
BTRFS_SEND_C_WRITE) with an offset, length pair indicating the extent in
question.

This patch does not affect the operation of BTRFS_SEND_C_CLONE commands -
they will continue to be sent when a search finds an appropriate extent to
clone from.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:39 -05:00
Liu Bo
2f697dc6a6 Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.

When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.

The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.

Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.

That means what we've reserved for updating checksum tree is NOT enough
indeed.

The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.

I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.

The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:37 -05:00
Eric Sandeen
de78b51a28 btrfs: remove cache only arguments from defrag path
The entry point at the defrag ioctl always sets "cache only" to 0;
the codepaths haven't run for a long time as far as I can
tell.  Chris says they're dead code, so remove them.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:36 -05:00
Josef Bacik
e4a2bcaca9 Btrfs: if we aren't committing just end the transaction if we error out
I hit a deadlock where transaction commit was waiting on num_writers to be
0.  This happened because somebody came into btrfs_commit_transaction and
noticed we had aborted and it went to cleanup_transaction.  This shouldn't
happen because cleanup_transaction is really to fixup a bad commit, it
doesn't do the normal trans handle cleanup things.  So if we have an error
just do the normal btrfs_end_transaction dance and return.  Once we are in
the actual commit path we can use cleanup_transaction and be good to go.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:35 -05:00
Josef Bacik
3e04e7f10b Btrfs: handle errors in compression submission path
I noticed we would deadlock if we aborted a transaction while doing
compressed io.  This is because we don't unlock our pages if something goes
horribly wrong.  To fix this we need to make sure that we call
extent_clear_unlock_delalloc in order to unlock all the pages.  If we have
to cow in the async submission thread we need to make sure to unlock our
locked_page as the cow error path will not unlock the locked page as it
depends on the caller to unlock that page.  With this patch we no longer
deadlock on the page lock when we have an aborted transaction.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:33 -05:00
Josef Bacik
70afa3998c Btrfs: rework the overcommit logic to be based on the total size
People have been complaining about random ENOSPC errors that will clear up
after a umount or just a given amount of time.  Chris was able to reproduce
this with stress.sh and lots of processes and so was I.  Basically the
overcommit stuff would really let us get out of hand, in my tests I saw up
to 30 gigs of outstanding reservations with only 2 gigs total of metadata
space.  This usually worked out fine but with so much outstanding
reservation the flushing stuff short circuits to make sure we don't hang
forever flushing when we really need ENOSPC.  Plus we allocate chunks in
order to alleviate the pressure, but this doesn't actually help us since we
only use the non-allocated area in our over commit logic.

So instead of basing overcommit on the amount of non-allocated space,
instead just do it based on how much total space we have, and then limit it
to the non-allocated space in case we are short on space to spill over into.
This allows us to have the same performance as well as no longer giving
random ENOSPC.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:32 -05:00
Josef Bacik
925396ecf2 Btrfs: account for orphan inodes properly during cleanup
Dave sent me a panic where we were doing the orphan cleanup and panic'ed
trying to release our reservation from the orphan block rsv.  The reason for
this is because our orphan block rsv had been free'd out from underneath us
because the transaction commit found that there were no orphan inodes
according to its count and decided to free it.  This is incorrect so make
sure we inc the orphan inodes count so the accounting is all done properly.
This would also cause the warning in the orphan commit code normally if you
had any orphans to cleanup as they would only decrement the orphan count so
you'd get a negative orphan count which could cause problems during runtime.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:31 -05:00
Josef Bacik
0bec9ef525 Btrfs: unreserve space if our ordered extent fails to work
When a transaction aborts or there's an EIO on an ordered extent or any
error really we will not free up the space we reserved for this ordered
extent.  This results in warnings from the block group cache cleanup in the
case of a transaction abort, or leaking space in the case of EIO on an
ordered extent.  Fix this up by free'ing the reserved space if we have an
error at all trying to complete an ordered extent.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:29 -05:00
Josef Bacik
779880ef35 Btrfs: fix how we discard outstanding ordered extents on abort
When we abort we've been just free'ing up all the ordered extents and
hoping for the best.  This results in lots of warnings from various places,
warnings from btrfs_destroy_inode() because it's ENOSPC accounting isn't
fixed.  It will also screw up lots of pages who have been set private but
never get cleared because the ordered extents are never allowed to be
submitted.  This patch fixes those warnings.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:28 -05:00
Josef Bacik
eb12db690c Btrfs: fix freeing delayed ref head while still holding its mutex
I hit this error when reproducing a bug that would end in a transaction
abort.  We take the delayed ref head's mutex to keep anybody from processing
it while we're destroying it, but we fail to drop the mutex before we carry
on and free the damned thing.  Fix this by doing the remove logic for the
head ourselves and unlock the mutex, that way we can avoid use after free's
or hung tasks waiting on that mutex to come back so they know the delayed
ref completed.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:27 -05:00
Eric Sandeen
063d006fa0 btrfs: ensure we don't overrun devices_info[] in __btrfs_alloc_chunk
WARN_ON isn't enough, we need to stop the loop if for any reason
we would overrun the devices_info array.

I tried to track down the connection between the length of
the alloc_devices list and the rw_devices counter but
it wasn't immediately obvious, so be defensive about it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:26 -05:00