Commit graph

32630 commits

Author SHA1 Message Date
Linus Torvalds
04bbc8e1f6 Fixes for pstore for 3.11 merge window
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRzg7SAAoJEKurIx+X31iBYsEP/A06TUjFMx8ZmPBdya1u0thq
 4fMD8hUFv/vG90Opk1PBBCLWwAJJ6wwrQ8EQd5ieBInYSYW4bDlDFqCk7X0cZ7LE
 tlRNsUVDS7LpscWEXoYshRzquj7xyCg0nJMBlMSca3vPYQ82iUD0v9vp5/TKwOgu
 ti0mpj7kRj18FPkhokl+ja6fuUgUICPiRBvNR8DPskrnJ0lXv2Zyo0CKER47Wfa4
 +fVqXZL0Ih1tivqH/3abA1MYXjxh9k9Db+USupYpQnPwZwDTWap0z/OYU9Swni6I
 /XW5c5vrAX9LyMPxMenD0RUWMZXmRDc9ddseRBwVSzQtowQxFGChgwz1+41nFers
 KTSLlrU1kij7+axBsjgPMli0SUoPlspv7Lt3JzEoJb0rtkOZmCDXu254qbDy7zxH
 Q06Zmb86XtJRGQH43lxEUDZ4QSg+OHA4Uskp6s12Ly1KNvLpk5KOkbuM1DT2jPJb
 6KbfOJNcATeCAoax/qJky++qh4E42okG5gOeqC1zquWcfRoVuS+nWBbu4kubKlNs
 Qe3IjmybR2S6c1pidDViTnnR/FiununknZaBPavqvuqaf5wBaBwDKiBi973TYFyu
 D4g+KyaMwvicGp59Z+u9QLGTaeVxHlVv1l3oSd3x7SVhJBwVLihgz7tULS4+X+YX
 csTLhy2n8wE0aOHWH5su
 =Xcer
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull pstore update from Tony Luck:
 "Fixes for pstore for 3.11 merge window"

* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  efivars: If pstore_register fails, free unneeded pstore buffer
  acpi: Eliminate console msg if pstore.backend excludes ERST
  pstore: Return unique error if backend registration excluded by kernel param
  pstore: Fail to unlink if a driver has not defined pstore_erase
  pstore/ram: remove the power of buffer size limitation
  pstore/ram: avoid atomic accesses for ioremapped regions
  efi, pstore: Cocci spatch "memdup.spatch"
2013-07-03 11:14:44 -07:00
Linus Torvalds
790eac5640 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second set of VFS changes from Al Viro:
 "Assorted f_pos race fixes, making do_splice_direct() safe to call with
  i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
  ->d_hash/->d_compare calling conventions changes from Linus, misc
  stuff all over the place."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  Document ->tmpfile()
  ext4: ->tmpfile() support
  vfs: export lseek_execute() to modules
  lseek_execute() doesn't need an inode passed to it
  block_dev: switch to fixed_size_llseek()
  cpqphp_sysfs: switch to fixed_size_llseek()
  tile-srom: switch to fixed_size_llseek()
  proc_powerpc: switch to fixed_size_llseek()
  ubi/cdev: switch to fixed_size_llseek()
  pci/proc: switch to fixed_size_llseek()
  isapnp: switch to fixed_size_llseek()
  lpfc: switch to fixed_size_llseek()
  locks: give the blocked_hash its own spinlock
  locks: add a new "lm_owner_key" lock operation
  locks: turn the blocked_list into a hashtable
  locks: convert fl_link to a hlist_node
  locks: avoid taking global lock if possible when waking up blocked waiters
  locks: protect most of the file_lock handling with i_lock
  locks: encapsulate the fl_link list handling
  locks: make "added" in __posix_lock_file a bool
  ...
2013-07-03 09:10:19 -07:00
Al Viro
af51a2ac36 ext4: ->tmpfile() support
very similar to ext3 counterpart...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-03 16:23:28 +04:00
Jie Liu
46a1c2c7ae vfs: export lseek_execute() to modules
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().

To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.

Thanks Dave Chinner for this suggestion.

[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-03 16:23:27 +04:00
Eliezer Tamir
1bc2774d86 net: convert lls to use time_in_range()
Time in range will fail safely if we move to a different cpu with an
extremely large clock skew.
Add time_in_range64() and convert lls to use it.

changelog:
v2
- fixed double call to sched_clock in can_poll_ll
- fixed checkpatchisms

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-02 15:53:53 -07:00
Linus Torvalds
fc76a258d4 Driver core patches for 3.11-rc1
Here's the big driver core merge for 3.11-rc1
 
 Lots of little things, and larger firmware subsystem updates, all
 described in the shortlog.  Nice thing here is that we finally get rid
 of CONFIG_HOTPLUG, after 10+ years, thanks to Stephen Rohtwell (it had
 been always on for a number of kernel releases, now it's just removed.)
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlHRsGMACgkQMUfUDdst+ylIIACfW8lLxOPVK+iYG699TWEBAkp0
 LFEAnjlpAMJ1JnoZCuWDZObNCev93zGB
 =020+
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here's the big driver core merge for 3.11-rc1

  Lots of little things, and larger firmware subsystem updates, all
  described in the shortlog.  Nice thing here is that we finally get rid
  of CONFIG_HOTPLUG, after 10+ years, thanks to Stephen Rohtwell (it had
  been always on for a number of kernel releases, now it's just
  removed)"

* tag 'driver-core-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (27 commits)
  driver core: device.h: fix doc compilation warnings
  firmware loader: fix another compile warning with PM_SLEEP unset
  build some drivers only when compile-testing
  firmware loader: fix compile warning with PM_SLEEP set
  kobject: sanitize argument for format string
  sysfs_notify is only possible on file attributes
  firmware loader: simplify holding module for request_firmware
  firmware loader: don't export cache_firmware and uncache_firmware
  drivers/base: Use attribute groups to create sysfs memory files
  firmware loader: fix compile warning
  firmware loader: fix build failure with !CONFIG_FW_LOADER_USER_HELPER
  Documentation: Updated broken link in HOWTO
  Finally eradicate CONFIG_HOTPLUG
  driver core: firmware loader: kill FW_ACTION_NOHOTPLUG requests before suspend
  driver core: firmware loader: don't cache FW_ACTION_NOHOTPLUG firmware
  Documentation: Tidy up some drivers/base/core.c kerneldoc content.
  platform_device: use a macro instead of platform_driver_register
  firmware: move EXPORT_SYMBOL annotations
  firmware: Avoid deadlock of usermodehelper lock at shutdown
  dell_rbu: Select CONFIG_FW_LOADER_USER_HELPER explicitly
  ...
2013-07-02 11:44:19 -07:00
Linus Torvalds
bcd7351e83 FS-Cache patches 2013-07-02
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIVAwUAUdLdUxOxKuMESys7AQK1kQ//W7fgFXCG+5XVk4ECHGN5tqRn4tU69DY0
 9nYU2/y1wbqV5cTO36XTcFPQK1qbW2ZdyvEZ2CF8OfwtQpLmcALGtpBIgJwYs+4H
 DMkgO06zdk4caxc0C4JBIGs+MDeLNk2SQObqblGl1BAQKQ5cqsCLsIZ/rxln999m
 ufuobfns1YvuHkzMtswUDmm3zWMpwqqPAbbl+fTwPU683a/AleckG2ACyFvKZAxA
 OyI8kJR4e33a3/BGo/5OFb3qI1+Z25EOWdvdnM+r4hdKJZF9ZySlyc640GZHAO2J
 wKj5lYp1nBpyNPvYvly174s2MxPju1CRHb7gxcV4LX3vtEY4/MCg7m6P46EUfC6R
 C3V7PMMCjZXEQ01MKEmGig47EJKIiecCQUZupJnP7HFKPzeJR9mQZFd68WqzswAM
 w9hcCw9hQ9y/kTDVrTVCHs0Q9iTxShfrJyfRJnQ1VcoT+1dieruTa9am9OBKiEw6
 CQrPjq9RZZfsZHYr6RlGZHGJyzjrTzrf6EhxwmgaCxWycpvCuV7z76YgAVZI7V4r
 qnJmH8dXWdoSA7nZ6sgsb5TRCLT9wu1nNId0DMpAGB1cDGga/55AZtqxdoJLnlkj
 y/4wQavIrkfHHuS8c3gzVXPtYmM19CHgcKRFydXD0uGobzfxwYKTKMH+Gviu1NnH
 /pGNNY2vVGI=
 =Wjhu
 -----END PGP SIGNATURE-----

Merge tag 'fscache-20130702' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull FS-Cache updates from David Howells:
 "This contains a number of fixes for various FS-Cache issues plus some
  cleanups.  The commits are, in order:

   1) Provide a system wait_on_atomic_t() and wake_up_atomic_t() sharing
      the bit-wait table (enhancement for #8).

   2) Don't put spin_lock() in a while-condition as spin_lock() may have
      a do {} while(0) wrapper (cleanup).

   3) Symbolically name i_mutex lock classes rather than using numbers
      in CacheFiles (cleanup).

   4) Don't sleep in page release if __GFP_FS is not set (deadlock vs
      ext4).

   5) Uninline fscache_object_init() (cleanup for #7).

   6) Wrap checks on object state (cleanup for #7).

   7) Simplify the object state machine by separating work states from
      wait states.

   8) Simplify cookie retention by objects (NULL pointer deref fix).

   9) Remove unused list_to_page() macro (cleanup).

  10) Make the remaining-pages counter in the retrieval op atomic
      (assertion failure fix).

  11) Don't use spin_is_locked() in assertions (assertion failure fix)"

* tag 'fscache-20130702' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  FS-Cache: Don't use spin_is_locked() in assertions
  FS-Cache: The retrieval remaining-pages counter needs to be atomic_t
  cachefiles: remove unused macro list_to_page()
  FS-Cache: Simplify cookie retention for fscache_objects, fixing oops
  FS-Cache: Fix object state machine to have separate work and wait states
  FS-Cache: Wrap checks on object state
  FS-Cache: Uninline fscache_object_init()
  FS-Cache: Don't sleep in page release if __GFP_FS is not set
  CacheFiles: name i_mutex lock class explicitly
  fs/fscache: remove spin_lock() from the condition in while()
  Add wait_on_atomic_t() and wake_up_atomic_t()
2013-07-02 09:52:47 -07:00
Linus Torvalds
6072a93b98 dlm for 3.11
This set includes a number of SCTP related fixes in the dlm,
 and a few other minor fixes and changes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0eE4AAoJEDgbc8f8gGmqmE4QAKgEdiuaLzb1Y76ng7GU9UMz
 6+RVgjDr+fix5pHrs/KgTMNyfP8QKUYgoaspTTgH1YzyjL0Fvx7SjVt8MIs4694j
 OT1sPQhwWcCaeRxK9ss4qQxexN25N69iv/A6W1o+RPbL7VqXkvHfyCTePUEMlX4y
 55AXcSMlCUoYXKq3CMEkTG/25rK6L3fMopzLDKMyH+bbeOtQwHmTL9pHKj3VRwv2
 iV3A5U0pOdbGeWilTzhsLz+iF0oZsfjsNADrrAwuYaAbK5fcXQDDkO5hGkDwa/dd
 nDR4qs4TBIGUu8J8wgK4dz8Zhcm6mEo5A5/ORgbKvRo3dClLlhX/78xaYtMp/RGG
 8v+u2HVwiMd/Ma029m3SxgtjI6cH6XFtZ1ncnL8A72h2Ey1bPEbxmMOQFkSU48Ve
 ZZyjxLYhtaDxl0xzfqLhh8tWAk8JnOAKu2T5bgO2ShLFvTU6yMmyKwovPdFgZctf
 qV81wQbuXeqF4xi/uwNLh7UBIPnzXS3OYNhfgzTuawMSwSP2ix+rouqozv6c5804
 Ay7/cv+cVJItR0f7GKny0U/tEc5BXkngbhIAG3vWQfOjmU5WMuDMvAxzA9i4CH6h
 P1pgnoh0VoxoaX1meu0H7XnREAdBo1my9iaJE3Tzk4sjV341JjF9BAuYraMB6krK
 zHiQjNdDDBy/F8P0YCYt
 =/kqw
 -----END PGP SIGNATURE-----

Merge tag 'dlm-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This set includes a number of SCTP related fixes in the dlm, and a few
  other minor fixes and changes."

* tag 'dlm-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: Avoid LVB truncation
  dlm: log an error for unmanaged lockspaces
  dlm: config: using strlcpy instead of strncpy
  dlm: remove duplicated include from lowcomms.c
  dlm: disable nagle for SCTP
  dlm: retry failed SCTP sends
  dlm: try other IPs when sctp init assoc fails
  dlm: clear correct bit during sctp init failure handling
  dlm: set sctp assoc id during setup
  dlm: clear correct init bit during sctp setup
2013-07-02 09:52:07 -07:00
Linus Torvalds
3f490f7f99 This patch-set includes the following major enhancement patches.
o remount_fs callback function
  o restore parent inode number to enhance the fsync performance
  o xattr security labels
  o reduce the number of redundant lock/unlock data pages
  o avoid frequent write_inode calls
 
 The other minor bug fixes are as follows.
  o endian conversion bugs
  o various bugs in the roll-forward recovery routine
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0h8YAAoJEEAUqH6CSFDSJNMP/1V+e/PaLa8YsHuw5eWPT3Fe
 RYFXJv7SXadHqgQInjwD7JOF8BC9DlUyknYUiXFUZgvKsgMHz+OTCNl+1hLTbUXt
 e6Rhn7vU/fMu3TEtj+v6g8JbpXdXH8TSbtvh9LpoczRS98GYpZuckP0BtQsVkTL3
 jIq6WD8JRkb2DvpDl7viTT0Eq0T61CSmtOwOIvfhhiVxggvDWcnR1mTM9Tymdi4J
 NKtFkjCsKP0Z/7IZZUJAczGEHjsT+O6JDwp8+KVWuZT4BSuchoX4MYAY5wX527Ne
 rZvkolbbfnAFCC3BtETr0DPOkpxnHmJ6dEveIxjZ9cux12CAFA/Ww1QAL4ygiDkd
 Avn8BBEJnfnuzeOUkE0by+9hdF9LOU3CVSxiDhWJegYB16z+c9pSBD9xvlKhKk5B
 QNsjptB6m0CftAq7vIDsryL60uJ2cSHFxPqfwAHEpNngiU/NohTFSZE0sUMbLUNh
 FI6NrHoT7yld1HcB6cvL1lnUqIENFbNhDSSDcTdlN49IJJap4oqtgCnmMMIwbUCO
 vR2/26k5W7NwG+K6XN2IX13AsayzQahxTv8in/+LV0bfjHo6/1VzzGcqAmXJbDQw
 dLrNAeWaaIJi8J/zJiENMbFKXTj8Wax9jxKsW+4towQuyEt4ADvyt1c5gX3VR42T
 x8+YEargsdBf6FAhtN+H
 =qFcy
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "This patch-set includes the following major enhancement patches:
   - remount_fs callback function
   - restore parent inode number to enhance the fsync performance
   - xattr security labels
   - reduce the number of redundant lock/unlock data pages
   - avoid frequent write_inode calls

  The other minor bug fixes are as follows.
   - endian conversion bugs
   - various bugs in the roll-forward recovery routine"

* tag 'for-f2fs-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (56 commits)
  f2fs: fix to recover i_size from roll-forward
  f2fs: remove the unused argument "sbi" of func destroy_fsync_dnodes()
  f2fs: remove reusing any prefree segments
  f2fs: code cleanup and simplify in func {find/add}_gc_inode
  f2fs: optimize the init_dirty_segmap function
  f2fs: fix an endian conversion bug detected by sparse
  f2fs: fix crc endian conversion
  f2fs: add remount_fs callback support
  f2fs: recover wrong pino after checkpoint during fsync
  f2fs: optimize do_write_data_page()
  f2fs: make locate_dirty_segment() as static
  f2fs: remove unnecessary parameter "offset" from __add_sum_entry()
  f2fs: avoid freqeunt write_inode calls
  f2fs: optimise the truncate_data_blocks_range() range
  f2fs: use the F2FS specific flags in f2fs_ioctl()
  f2fs: sync dir->i_size with its block allocation
  f2fs: fix i_blocks translation on various types of files
  f2fs: set sb->s_fs_info before calling parse_options()
  f2fs: support xattr security labels
  f2fs: fix iget/iput of dir during recovery
  ...
2013-07-02 09:42:38 -07:00
Linus Torvalds
c4eb1b0730 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull GFS2 updates from Steven Whitehouse:
 "There are a few bug fixes for various, mostly very minor corner cases,
  plus some interesting new features.

  The new features include atomic_open whose main benefit will be the
  reduction in locking overhead in case of combined lookup/create and
  open operations, sorting the log buffer lists by block number to
  improve the efficiency of AIL writeback, and aggressively issuing
  revokes in gfs2_log_flush to reduce overhead when dropping glocks."

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
  GFS2: Reserve journal space for quota change in do_grow
  GFS2: Fix fstrim boundary conditions
  GFS2: fix warning message
  GFS2: aggressively issue revokes in gfs2_log_flush
  GFS2: fix regression in dir_double_exhash
  GFS2: Add atomic_open support
  GFS2: Only do one directory search on create
  GFS2: fix error propagation in init_threads()
  GFS2: Remove no-op wrapper function
  GFS2: Cocci spatch "ptr_ret.spatch"
  GFS2: Eliminate gfs2_rg_lops
  GFS2: Sort buffer lists by inplace block number
2013-07-02 09:41:18 -07:00
Linus Torvalds
9e239bb939 Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
 block size is smaller than the page size (i.e., file systems 1k blocks
 on x86, or more interestingly file systems with 4k blocks on Power or
 ia64 systems.)
 
 In the cleanup category, the ext4's punch hole implementation was
 significantly improved by Lukas Czerner, and now supports bigalloc
 file systems.  In addition, Jan Kara significantly cleaned up the
 write submission code path.  We also improved error checking and added
 a few sanity checks.
 
 In the optimizations category, two major optimizations deserve
 mention.  The first is that ext4_writepages() is now used for
 nodelalloc and ext3 compatibility mode.  This allows writes to be
 submitted much more efficiently as a single bio request, instead of
 being sent as individual 4k writes into the block layer (which then
 relied on the elevator code to coalesce the requests in the block
 queue).  Secondly, the extent cache shrink mechanism, which was
 introduce in 3.9, no longer has a scalability bottleneck caused by the
 i_es_lru spinlock.  Other optimizations include some changes to reduce
 CPU usage and to avoid issuing empty commits unnecessarily.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
 0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
 yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
 NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
 rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
 ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
 AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
 zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
 U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
 vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
 +TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
 XrddRLGlk4Hf+2o7/D7B
 =SwaI
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Lots of bug fixes, cleanups and optimizations.  In the bug fixes
  category, of note is a fix for on-line resizing file systems where the
  block size is smaller than the page size (i.e., file systems 1k blocks
  on x86, or more interestingly file systems with 4k blocks on Power or
  ia64 systems.)

  In the cleanup category, the ext4's punch hole implementation was
  significantly improved by Lukas Czerner, and now supports bigalloc
  file systems.  In addition, Jan Kara significantly cleaned up the
  write submission code path.  We also improved error checking and added
  a few sanity checks.

  In the optimizations category, two major optimizations deserve
  mention.  The first is that ext4_writepages() is now used for
  nodelalloc and ext3 compatibility mode.  This allows writes to be
  submitted much more efficiently as a single bio request, instead of
  being sent as individual 4k writes into the block layer (which then
  relied on the elevator code to coalesce the requests in the block
  queue).  Secondly, the extent cache shrink mechanism, which was
  introduce in 3.9, no longer has a scalability bottleneck caused by the
  i_es_lru spinlock.  Other optimizations include some changes to reduce
  CPU usage and to avoid issuing empty commits unnecessarily."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
  ext4: optimize starting extent in ext4_ext_rm_leaf()
  jbd2: invalidate handle if jbd2_journal_restart() fails
  ext4: translate flag bits to strings in tracepoints
  ext4: fix up error handling for mpage_map_and_submit_extent()
  jbd2: fix theoretical race in jbd2__journal_restart
  ext4: only zero partial blocks in ext4_zero_partial_blocks()
  ext4: check error return from ext4_write_inline_data_end()
  ext4: delete unnecessary C statements
  ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
  jbd2: move superblock checksum calculation to jbd2_write_superblock()
  ext4: pass inode pointer instead of file pointer to punch hole
  ext4: improve free space calculation for inline_data
  ext4: reduce object size when !CONFIG_PRINTK
  ext4: improve extent cache shrink mechanism to avoid to burn CPU time
  ext4: implement error handling of ext4_mb_new_preallocation()
  ext4: fix corruption when online resizing a fs with 1K block size
  ext4: delete unused variables
  ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
  jbd2: remove debug dependency on debug_fs and update Kconfig help text
  jbd2: use a single printk for jbd_debug()
  ...
2013-07-02 09:39:34 -07:00
Linus Torvalds
63580e51bb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS patches (part 1) from Al Viro:
 "The major change in this pile is ->readdir() replacement with
  ->iterate(), dealing with ->f_pos races in ->readdir() instances for
  good.

  There's a lot more, but I'd prefer to split the pull request into
  several stages and this is the first obvious cutoff point."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (67 commits)
  [readdir] constify ->actor
  [readdir] ->readdir() is gone
  [readdir] convert ecryptfs
  [readdir] convert coda
  [readdir] convert ocfs2
  [readdir] convert fatfs
  [readdir] convert xfs
  [readdir] convert btrfs
  [readdir] convert hostfs
  [readdir] convert afs
  [readdir] convert ncpfs
  [readdir] convert hfsplus
  [readdir] convert hfs
  [readdir] convert befs
  [readdir] convert cifs
  [readdir] convert freevxfs
  [readdir] convert fuse
  [readdir] convert hpfs
  reiserfs: switch reiserfs_readdir_dentry to inode
  reiserfs: is_privroot_deh() needs only directory inode, actually
  ...
2013-07-02 09:28:37 -07:00
Dave Chinner
7747bd4bce sync: don't block the flusher thread waiting on IO
When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.

We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.

Effect of sync on fsmark before the patch:

FSUse%        Count         Size    Files/sec     App Overhead
.....
     0       640000         4096      35154.6          1026984
     0       720000         4096      36740.3          1023844
     0       800000         4096      36184.6           916599
     0       880000         4096       1282.7          1054367
     0       960000         4096       3951.3           918773
     0      1040000         4096      40646.2           996448
     0      1120000         4096      43610.1           895647
     0      1200000         4096      40333.1           921048

And a single sync pass took:

  real    0m52.407s
  user    0m0.000s
  sys     0m0.090s

After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:

  real    0m6.930s
  user    0m0.000s
  sys     0m0.039s

IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-02 09:16:42 -07:00
Josef Bacik
0e267c44c3 Btrfs: wait ordered range before doing direct io
My recent truncate patch uncovered this bug, but I can reproduce it without the
truncate patch.  If you mount with -o compress-force, do a direct write to some
area, do a buffered write to some other area, and then do a direct read you will
get the wrong data for where you did the buffered write.  This is because the
generic direct io helpers only call filemap_write_and_wait once, and for
compression we need it twice.  So to be safe add the btrfs_wait_ordered_range to
the start of the direct io function to make sure any compressed writes have
truly been written.  This patch makes xfstests 130 pass when you mount with -o
compress-force=lzo.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:49 -04:00
Josef Bacik
7fb7d76f96 Btrfs: only do the tree_mod_log_free_eb if this is our last ref
There is another bug in the tree mod log stuff in that we're calling
tree_mod_log_free_eb every single time a block is cow'ed.  The problem with this
is that if this block is shared by multiple snapshots we will call this multiple
times per block, so if we go to rewind the mod log for this block we'll BUG_ON()
in __tree_mod_log_rewind because we try to rewind a free twice.  We only want to
call tree_mod_log_free_eb if we are actually freeing the block.  With this patch
I no longer hit the panic in __tree_mod_log_rewind.  Thanks,

Cc: stable@vger.kernel.org
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:20 -04:00
Josef Bacik
f1ca7e98a6 Btrfs: hold the tree mod lock in __tree_mod_log_rewind
We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
forward in the tree mod entries, otherwise we'll end up with random entries and
trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the panics
people were seeing when running

find /whatever -type f -exec btrfs fi defrag {} \;

Thansk,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:18 -04:00
Josef Bacik
261c84b662 Btrfs: make backref walking code handle skinny metadata
I missed fixing the backref stuff when I introduced the skinny metadata.  If you
try and do things like snapshot aware defrag with skinny metadata you are going
to see tons of warnings related to the backref count being less than 0.  This is
because the delayed refs will be found for stuff just fine, but it won't find
the skinny metadata extent refs.  With this patch I'm not seeing warnings
anymore.  Thanks,

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:02 -04:00
Liu Bo
35f0399db6 Btrfs: fix crash regarding to ulist_add_merge
Several users reported this crash of NULL pointer or general protection,
the story is that we add a rbtree for speedup ulist iteration, and we
use krealloc() to address ulist growth, and krealloc() use memcpy to copy
old data to new memory area, so it's OK for an array as it doesn't use
pointers while it's not OK for a rbtree as it uses pointers.

So krealloc() will mess up our rbtree and it ends up with crash.

Reviewed-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:59 -04:00
Miao Xie
edd1400be9 Btrfs: fix several potential problems in copy_nocow_pages_for_inode
- It makes no sense that we deal with a inode in the dead tree.
- fix the race between dio and page copy by waiting the dio completion
- avoid the page copy vs truncate/punch hole
- check if the page is in the page cache or not

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:58 -04:00
Miao Xie
826aa0a82c Btrfs: cleanup the code of copy_nocow_pages_for_inode()
- It make no sense that we continue to do something after the error
  happened, just go back with this patch.
- remove some check of copy_nocow_pages_for_inode(), such as page check
  after write, inode check in the end of the function, because we are
  sure they exist.
- remove the unnecessary goto in the return value check of the write

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:56 -04:00
Miao Xie
26b2589190 Btrfs: fix oops when recovering the file data by scrub function
We get oops while running btrfs replace start test,
------------[ cut here ]------------
kernel BUG at mm/filemap.c:608!
[SNIP]
Call Trace:
  [<ffffffffa04b36c7>] copy_nocow_pages_for_inode+0x217/0x3f0 [btrfs]
  [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
  [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
  [<ffffffffa04bb8ce>] iterate_extent_inodes+0x1ae/0x300 [btrfs]
  [<ffffffffa04bbab2>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
  [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
  [<ffffffffa04b3b07>] copy_nocow_pages_worker+0x97/0x150 [btrfs]
  [<ffffffffa048eed4>] worker_loop+0x134/0x540 [btrfs]
  [<ffffffff816274ea>] ? __schedule+0x3ca/0x7f0
  [<ffffffffa048eda0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
  [<ffffffff8106f2f0>] kthread+0xc0/0xd0
  [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
  [<ffffffff8163181c>] ret_from_fork+0x7c/0xb0
  [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
[SNIP]
 RIP  [<ffffffff8111f4c5>] unlock_page+0x35/0x40
  RSP <ffff88010316bb98>
 ---[ end trace 421e79ad0dd72c7d ]---

it is because we forgot to lock the page again after we read data to
the page. Fix it.

Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:55 -04:00
Josef Bacik
6df9a95e63 Btrfs: make the chunk allocator completely tree lockless
When adjusting the enospc rules for relocation I ran into a deadlock because we
were relocating the only system chunk and that forced us to try and allocate a
new system chunk while holding locks in the chunk tree, which caused us to
deadlock.  To fix this I've moved all of the dev extent addition and chunk
addition out to the delayed chunk completion stuff.  We still keep the in-memory
stuff which makes sure everything is consistent.

One change I had to make was to search the commit root of the device tree to
find a free dev extent, and hold onto any chunk em's that we allocated in that
transaction so we do not allocate the same dev extent twice.  This has the side
effect of fixing a bug with balance that has been there ever since balance
existed.  Basically you can free a block group and it's dev extent and then
immediately allocate that dev extent for a new block group and write stuff to
that dev extent, all within the same transaction.  So if you happen to crash
during a balance you could come back to a completely broken file system.  This
patch should keep these sort of things from happening in the future since we
won't be able to allocate free'd dev extents until after the transaction
commits.  This has passed all of the xfstests and my super annoying stress test
followed by a balance.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:53 -04:00
Josef Bacik
68a7342c51 Btrfs: cleanup orphaned root orphan item
I hit a weird problem were my root item had been deleted but the orphan item had
not.  This isn't necessarily a problem, but it keeps the file system from being
mounted.  To fix this we just need to axe the orphan item if we can't find the
fs root when we're putting them altogether.  With this patch I was able to
successfully mount my file system.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:52 -04:00
Miao Xie
a70c6172e7 Btrfs: fix wrong mirror number tuning
Now reading the data from the target device of the replace operation is allowed,
so the mirror number that is greater than the stripes number of a chunk is valid,
we will tune it when we find there is no target device later. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:50 -04:00
Miao Xie
e6da5d2ec9 Btrfs: cleanup redundant code in btrfs_submit_direct()
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:48 -04:00
Miao Xie
f51a4a1826 Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.

By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).

test command:
 # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:47 -04:00
Josef Bacik
7ee9e4405f Btrfs: check if we can nocow if we don't have data space
We always just try and reserve data space when we write, but if we are out of
space but have prealloc'ed extents we should still successfully write.  This
patch will try and see if we can write to prealloc'ed space and if we can go
ahead and allow the write to continue.  With this patch we now pass xfstests
generic/274.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:45 -04:00
Josef Bacik
925a6efb8f Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc
try_to_writeback_inodes_sb_nr returns 1 if writeback is already underway, which
is completely fraking useless for us as we need to make sure pages are actually
written before we go and check if there are ordered extents.  So replace this
with an open coding of try_to_writeback_inodes_sb_nr minus the writeback
underway check so that we are sure to actually have flushed some dirty pages out
and will have ordered extents to use.  With this patch xfstests generic/273 now
passes.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:43 -04:00
Josef Bacik
b150a4f10d Btrfs: use a percpu to keep track of possibly pinned bytes
There are all of these checks in the ENOSPC code to see if committing the
transaction would free up enough space to make the allocation.  This is because
early on we just committed the transaction and hoped and prayed, which resulted
in cases where it took _forever_ to get an ENOSPC when we really were out of
space.  So we check space_info->bytes_pinned, except this isn't completely true
because it doesn't account for space we may free but are stuck in delayed refs.
So tests like xfstests 226 would fail because we wouldn't commit the transaction
to free up the data space.  So instead add a percpu counter that will be a
little fuzzier, it will add bytes as soon as we try to free up the space, and
remove any space it doesn't actually free up when we get around to doing the
actual free.  We then 0 out this counter every transaction period so we have a
better idea of how much space we will actually free up by committing this
transaction.  With this patch we now pass xfstests 226.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:42 -04:00
Josef Bacik
f23b5a5995 Btrfs: check for actual acls rather than just xattrs when caching no acl
We have an optimization that will go ahead and cache no acls on an inode if
there are no xattrs on the inode.  This saves us a lookup later to check the
acls for writes or any other access.  The problem is I use selinux so I always
have an xattr on inodes, so make this test a little smarter and check for the
actual acl hash on the key and if it isn't there then we still get to cache no
acl which makes everybody who uses selinux a little happier.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:40 -04:00
Aruna Balakrishnaiah
1d8b368ab4 pstore: Add hsize argument in write_buf call of pstore_ftrace_call
Incorporate the addition of hsize argument in write_buf callback
of pstore. This was forgotten in

    6bbbca7359
    pstore: Pass header size in the pstore write callback

Causing a build failure when ftrace and pstore are enabled.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-02 18:39:37 +10:00
Jaegeuk Kim
a1dd3c13ce f2fs: fix to recover i_size from roll-forward
If user requests many data writes and fsync together, the last updated i_size
should be stored to the inode block consistently.

But, previous write_end just marks the inode as dirty and doesn't update its
metadata into its inode block.
After that, fsync just writes the inode block with newly updated data index
excluding inode metadata updates.

So, this patch introduces write_end in which updates inode block too when the
i_size is changed.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:16 +09:00
Gu Zheng
5ebefc5b40 f2fs: remove the unused argument "sbi" of func destroy_fsync_dnodes()
As destroy_fsync_dnodes() is a simple list-cleanup func, so delete the unused
and unrelated f2fs_sb_info argument of it.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:15 +09:00
Jaegeuk Kim
763bfe1bc5 f2fs: remove reusing any prefree segments
This patch removes check_prefree_segments initially designed to enhance the
performance by narrowing the range of LBA usage across the whole block device.

When allocating a new segment, previous f2fs tries to find proper prefree
segments, and then, if finds a segment, it reuses the segment for further
data or node block allocation.

However, I found that this was totally wrong approach since the prefree segments
have several data or node blocks that will be used by the roll-forward mechanism
operated after sudden-power-off.

Let's assume the following scenario.

/* write 8MB with fsync */
for (i = 0; i < 2048; i++) {
	offset = i * 4096;
	write(fd, offset, 4KB);
	fsync(fd);
}

In this case, naive segment allocation sequence will be like:
 data segment: x, x+1, x+2, x+3
 node segment: y, y+1, y+2, y+3.

But, if we can reuse prefree segments, the sequence can be like:
 data segment: x, x+1, y, y+1
 node segment: y, y+1, y+2, y+3.
Because, y, y+1, and y+2 became prefree segments one by one, and those are
reused by data allocation.

After conducting this workload, we should consider how to recover the latest
inode with its data.
If we reuse the prefree segments such as y or y+1, we lost the old node blocks
so that f2fs even cannot start roll-forward recovery.

Therefore, I suggest that we should remove reusing prefree segments.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:15 +09:00
Gu Zheng
6cc4af5606 f2fs: code cleanup and simplify in func {find/add}_gc_inode
This patch simplifies list operations in find_gc_inode and add_gc_inode.
Just simple code cleanup.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:14 +09:00
Namjae Jeon
8736fbf003 f2fs: optimize the init_dirty_segmap function
Optimize the while loop condition

Since this condition will always be true and while loop will
be terminated by the following condition in code:

if (segno >= TOTAL_SEGS(sbi))
    break;
Hence we can replace the while loop condition with while(1)
instead of always checking for segno to be less than Total segs.

Also we do not need to use TOTAL_SEGS() everytime. We can store
this value in a local variable since this value is constant.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:14 +09:00
Jaegeuk Kim
060dd67b3c f2fs: fix an endian conversion bug detected by sparse
This patch should fix the following bug reported by kbuild test robot.

fs/f2fs/recovery.c:233:33: sparse: incorrect type in assignment
(different base types)

parse warnings: (new ones prefixed by >>)

>> recovery.c:233: sparse: incorrect type in assignment (different base types)
   recovery.c:233:    expected unsigned int [unsigned] [assigned] ofs_in_node
   recovery.c:233:    got restricted __le16 [assigned] [usertype] ofs_in_node
>> recovery.c:238: sparse: incorrect type in assignment (different base types)
   recovery.c:238:    expected unsigned int [unsigned] ofs_in_node
   recovery.c:238:    got restricted __le16 [assigned] [usertype] ofs_in_node

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:13 +09:00
Jaegeuk Kim
7e586fa024 f2fs: fix crc endian conversion
While calculating CRC for the checkpoint block, we use __u32, but when storing
the crc value to the disk, we use __le32.

Let's fix the inconsistency.

Reported-and-Tested-by: Oded Gabbay <ogabbay@advaoptical.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:47:35 +09:00
J. Bruce Fields
d08d32e6e5 nfsd4: return delegation immediately if lease fails
This case shouldn't happen--the administrator shouldn't really allow
other applications access to the export until clients have had the
chance to reclaim their state--but if it does then we should set the
"return this lease immediately" bit on the reply.  That still leaves
some small races, but it's the best the protocol allows us to do in the
case a lease is ripped out from under us....

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:32:07 -04:00
J. Bruce Fields
0a262ffb75 nfsd4: do not throw away 4.1 lock state on last unlock
This reverts commit eb2099f31b "nfsd4:
release lockowners on last unlock in 4.1 case".  Trond identified
language in rfc 5661 section 8.2.4 which forbids this behavior:

	Stateids associated with byte-range locks are an exception.
	They remain valid even if a LOCKU frees all remaining locks, so
	long as the open file with which they are associated remains
	open, unless the client frees the stateids via the FREE_STATEID
	operation.

And bakeathon 2013 testing found a 4.1 freebsd client was getting an
incorrect BAD_STATEID return from a FREE_STATEID in the above situation
and then failing.

The spec language honestly was probably a mistake but at this point with
implementations already following it we're probably stuck with that.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:32:06 -04:00
J. Bruce Fields
89f6c3362c nfsd4: delegation-based open reclaims should bypass permissions
We saw a v4.0 client's create fail as follows:

	- open create succeeds and gets a read delegation
	- client attempts to set mode on new file, gets DELAY while
	  server recalls delegation.
	- client attempts a CLAIM_DELEGATE_CUR open using the
	  delegation, gets error because of new file mode.

This probably can't happen on a recent kernel since we're no longer
giving out delegations on create opens.  Nevertheless, it's a
bug--reclaim opens should bypass permission checks.

Reported-by: Steve Dickson <steved@redhat.com>
Reported-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:32:05 -04:00
J. Bruce Fields
590b743143 nfsd4: minor read_buf cleanup
The code to step to the next page seems reasonably self-contained.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:32:03 -04:00
J. Bruce Fields
247500820e nfsd4: fix decoding of compounds across page boundaries
A freebsd NFSv4.0 client was getting rare IO errors expanding a tarball.
A network trace showed the server returning BAD_XDR on the final getattr
of a getattr+write+getattr compound.  The final getattr started on a
page boundary.

I believe the Linux client ignores errors on the post-write getattr, and
that that's why we haven't seen this before.

Cc: stable@vger.kernel.org
Reported-by: Rick Macklem <rmacklem@uoguelph.ca>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:29:40 -04:00
J. Bruce Fields
99c415156c nfsd4: clean up nfs4_open_delegation
The nfs4_open_delegation logic is unecessarily baroque.

Also stop pretending we support write delegations in several places.

Some day we will support write delegations, but when that happens adding
back in these flag parameters will be the easy part.  For now they're
just confusing.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:07 -04:00
Steve Dickson
9a0590aec3 NFSD: Don't give out read delegations on creates
When an exclusive create is done with the mode bits
set (aka open(testfile, O_CREAT | O_EXCL, 0777)) this
causes a OPEN op followed by a SETATTR op. When a
read delegation is given in the OPEN, it causes
the SETATTR to delay with EAGAIN until the
delegation is recalled.

This patch caused exclusive creates to give out
a write delegation (which turn into no delegation)
which allows the SETATTR seamlessly succeed.

Signed-off-by: Steve Dickson <steved@redhat.com>
[bfields: do this for any CREATE, not just exclusive; comment]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:07 -04:00
J. Bruce Fields
57569a7070 nfsd4: allow client to send no cb_sec flavors
In testing I notice that some of the pynfs tests forget to send any
cb_sec flavors, and that we haven't necessarily errored out in that case
before.

I'll fix pynfs, but am also inclined to default to trying AUTH_NONE in
that case in case this is something clients actually do.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:07 -04:00
J. Bruce Fields
b78724b705 nfsd4: fail attempts to request gss on the backchannel
We don't support gss on the backchannel.  We should state that fact up
front rather than just letting things continue and later making the
client try to figure out why the backchannel isn't working.

Trond suggested instead returning NFS4ERR_NOENT.  I think it would be
tricky for the client to distinguish between the case "I don't support
gss on the backchannel" and "I can't find that in my cache, please
create another context and try that instead", and I'd prefer something
that currently doesn't have any other meaning for this operation, hence
the (somewhat arbitrary) NFS4ERR_ENCR_ALG_UNSUPP.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:06 -04:00
J. Bruce Fields
57266a6e91 nfsd4: implement minimal SP4_MACH_CRED
Do a minimal SP4_MACH_CRED implementation suggested by Trond, ignoring
the client-provided spo_must_* arrays and just enforcing credential
checks for the minimum required operations.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:06 -04:00
J. Bruce Fields
0dc1531aca svcrpc: store gss mech in svc_cred
Store a pointer to the gss mechanism used in the rq_cred and cl_cred.
This will make it easier to enforce SP4_MACH_CRED, which needs to
compare the mechanism used on the exchange_id with that used on
protected operations.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:06 -04:00
Eliezer Tamir
91e2fd3378 net: avoid calling sched_clock when LLS is off
Change Low Latency Sockets code for select and poll so that
when LLS is disabled sched_clock() is never called.

Also, avoid sending POLL_LL to sockets if disabled.

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01 14:06:47 -07:00