Commit graph

44562 commits

Author SHA1 Message Date
Eric Biggers
89b7fb8298 fscrypt: remove stale comment from fscrypt_d_revalidate()
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-06-28 12:18:52 -07:00
Eric Biggers
d56de4e926 fscrypt: remove error messages for skcipher_request_alloc() failure
skcipher_request_alloc() can only fail due to lack of memory, and in
that case the memory allocator will have already printed a detailed
error message.  Thus, remove the redundant error messages from fscrypt.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-06-28 12:18:52 -07:00
Eric Biggers
f68d3b84ae fscrypt: remove unnecessary NULL check when allocating skcipher
crypto_alloc_skcipher() returns an ERR_PTR() on failure, not NULL.
Remove the unnecessary check for NULL.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-06-28 12:18:52 -07:00
Eric Biggers
fb10231825 fscrypt: clean up after fscrypt_prepare_lookup() conversions
Now that all filesystems have been converted to use
fscrypt_prepare_lookup(), we can remove the fscrypt_set_d_op() and
fscrypt_set_encrypted_dentry() functions as well as un-export
fscrypt_d_ops.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-06-28 12:18:52 -07:00
Eric Biggers
39b1444906 fscrypt: use unbound workqueue for decryption
Improve fscrypt read performance by switching the decryption workqueue
from bound to unbound.  With the bound workqueue, when multiple bios
completed on the same CPU, they were decrypted on that same CPU.  But
with the unbound queue, they are now decrypted in parallel on any CPU.

Although fscrypt read performance can be tough to measure due to the
many sources of variation, this change is most beneficial when
decryption is slow, e.g. on CPUs without AES instructions.  For example,
I timed tarring up encrypted directories on f2fs.  On x86 with AES-NI
instructions disabled, the unbound workqueue improved performance by
about 25-35%, using 1 to NUM_CPUs jobs with 4 or 8 CPUs available.  But
with AES-NI enabled, performance was unchanged to within ~2%.

I also did the same test on a quad-core ARM CPU using xts-speck128-neon
encryption.  There performance was usually about 10% better with the
unbound workqueue, bringing it closer to the unencrypted speed.

The unbound workqueue may be worse in some cases due to worse locality,
but I think it's still the better default.  dm-crypt uses an unbound
workqueue by default too, so this change makes fscrypt match.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-06-28 12:18:23 -07:00
Lianjun Huang
226f96b03d ANDROID: sdcardfs: fix potential crash when reserved_mb is not zero
sdcardfs_mkdir() calls check_min_free_space(). When reserved_mb is not zero, a negative dentry will be passed to
ext4_statfs() at last and ext4_statfs() will crash. The parent dentry is positive. So we use the parent dentry to
check free space.

Change-Id: I80ab9623fe59ba911f4cc9f0e029a1c6f7ee421b
Signed-off-by: Lianjun Huang <huanglianjun@vivo.com>
2018-06-18 15:08:48 -07:00
Greg Kroah-Hartman
07c01385fb This is the 4.4.138 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlskwjMACgkQONu9yGCS
 aT51vg//RP9YDjok+nD9DZuaPilTlplnPGSGyWQM9ZwlwnTle1bQa2WJ4cFTORPI
 G8I/LBNebwSLA/4IuHqM0nTydDD6Csm4hAsZR1REZWu3zB2GObHSzr2ijc/sHh7R
 dKtpa0N5WE0TQVtEn96V/o6uB2ohEWyS1RvyCB35PYfhpDzEdHnBDKq//Phvoj8q
 2nQPGq65ivCP/peUsfd9rIdT9fkRIP0hAbybjmO+Ie2aJB1vI0ieRz/mxmD6XRYC
 w0pUQtoFopsZG16VvXgtxqsOMu7Wg8P8bGM0kDruiov5vGDOHQnLc7uTNKNKsJV4
 vPzdBlLI8D4Ro1QBJ1q9oxKBKeeMwT9Y7uQNCDcLukRMol6RstYMMakEZh7GVhGp
 GYs6ATCzUT5NXRHz95wuepiMUfKvrh3G4+OBhL6rtaOshWNHrT7PlSnzJ0S1hZ+n
 jF0DT0JRmCrdjK8OXalECfODaWik9m3g6zYEpGu7R8uY/Pbsw5Uqq23rzUu0hFfN
 a8ktz6FSxCMpNXW36tm01/pvwQpmDJzsZXKW1F8V58WPo4d5RN/MQHV7LMUb8cH8
 +va6t4y1gA2mV5zBChQ7lHbqlR6xpwnIjy6y8w+/tYiIiYdN8VwrTno2XMoVcPf4
 a96qv1INo6kxuBlD1KkgGcRD28USXZXrrBWuZ+oL6Yab0waL5hk=
 =pbuo
 -----END PGP SIGNATURE-----

Merge 4.4.138 into android-4.4

Changes in 4.4.138
	x86/fpu: Fix early FPU command-line parsing
	x86: Remove unused function cpu_has_ht_siblings()
	x86/cpufeature: Remove unused and seldomly used cpu_has_xx macros
	x86/fpu: Disable MPX when eagerfpu is off
	x86/fpu: Disable AVX when eagerfpu is off
	x86/fpu: Default eagerfpu=on on all CPUs
	x86/fpu: Fix 'no387' regression
	x86/fpu: Revert ("x86/fpu: Disable AVX when eagerfpu is off")
	x86/fpu: Fix eager-FPU handling on legacy FPU machines
	x86/fpu: Hard-disable lazy FPU mode
	x86/fpu: Fix FNSAVE usage in eagerfpu mode
	x86/fpu: Fix math emulation in eager fpu mode
	af_key: Always verify length of provided sadb_key
	x86/crypto, x86/fpu: Remove X86_FEATURE_EAGER_FPU #ifdef from the crc32c code
	gpio: No NULL owner
	Clarify (and fix) MAX_LFS_FILESIZE macros
	KVM: x86: introduce linear_{read,write}_system
	KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system
	serial: samsung: fix maxburst parameter for DMA transactions
	vmw_balloon: fixing double free when batching mode is off
	kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access
	Input: goodix - add new ACPI id for GPD Win 2 touch screen
	Input: elan_i2c - add ELAN0612 (Lenovo v330 14IKB) ACPI ID
	crypto: vmx - Remove overly verbose printk from AES init routines
	Linux 4.4.138

Change-Id: I443664406b736e16c77c99d83e9bdf02f4511ee4
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-06-16 10:33:03 +02:00
Borislav Petkov
082efbb048 x86/cpufeature: Remove unused and seldomly used cpu_has_xx macros
commit 362f924b64ba0f4be2ee0cb697690c33d40be721 upstream.

Those are stupid and code should use static_cpu_has_safe() or
boot_cpu_has() instead. Kill the least used and unused ones.

The remaining ones need more careful inspection before a conversion can
happen. On the TODO.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1449481182-27541-4-git-send-email-bp@alien8.de
Cc: David Sterba <dsterba@suse.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-16 09:54:24 +02:00
Greg Kroah-Hartman
a2e2217bd8 This is the 4.4.137 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlshJwIACgkQONu9yGCS
 aT79XQ//S3GUXgBKG87+4HlPW0bebORgAbMTw47VFAm09jKRlaI65TXpT+EaXy+R
 iuHeyrie9SmN6eG2P5o25txl0NRL/WoZcKHBRdT3P8of19iG64g5zPd2meWgO8vO
 TtGy2fTNnmYoUEcCtDIQayRTPkeWbLUNE56grA1/LtxEeCmLTsm7tzhxXy+qzQSg
 YdlpKeTMTy9yUWc8Dt7Mt7Njmq+hj7HUVs42fvfddcW2y4jZl2rKnZ6HN2uADIJx
 sKyzeN1i8LMBdP216kg785jeBaaMnk01hhd/F+D+S+qTBrHbacO+reO9CkEEhovM
 UaneMG2j3t3THdaPs+amx/39IX3t+duiyyz+zI6kKlYJ9WrhbBuOJhG97B6jbnN2
 2QNs7Ll6cucRwcOY6pMbIzh46bIsUtUODB1/gLn4ALiB8OBWh78qH5CISBwpzGVY
 UNtBvV+nk9/aoo1BwdD2IZ6QQtafVwK3pzsb3bdGVJWZPkeGKbsQ3Y8ffGIdAbQg
 6MsBOCyF9t+dSk9ShGwdNr5vOp7VfmbOoUtHBp+czYFObWJsMzN5bG2dfvqQU0Sd
 LH0AkA4j4DWJQpl9OVhBCuUJ/SMk693b6wlO2dMnLsQg/I8j2eihtc5CTcJ6pH1V
 BjO6Uc3MwqmyR4/vF8acYKhL1zY6wtZtVkTTiKDoEuyDtZTTaqs=
 =Gmn/
 -----END PGP SIGNATURE-----

Merge 4.4.137 into android-4.4

Changes in 4.4.137
	tpm: do not suspend/resume if power stays on
	tpm: self test failure should not cause suspend to fail
	mmap: introduce sane default mmap limits
	mmap: relax file size limit for regular files
	kconfig: Avoid format overflow warning from GCC 8.1
	xfs: fix incorrect log_flushed on fsync
	drm: set FMODE_UNSIGNED_OFFSET for drm files
	brcmfmac: Fix check for ISO3166 code
	bnx2x: use the right constant
	dccp: don't free ccid2_hc_tx_sock struct in dccp_disconnect()
	enic: set DMA mask to 47 bit
	ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
	ipv4: remove warning in ip_recv_error
	isdn: eicon: fix a missing-check bug
	netdev-FAQ: clarify DaveM's position for stable backports
	net/packet: refine check for priv area size
	net: usb: cdc_mbim: add flag FLAG_SEND_ZLP
	packet: fix reserve calculation
	qed: Fix mask for physical address in ILT entry
	net/mlx4: Fix irq-unsafe spinlock usage
	team: use netdev_features_t instead of u32
	rtnetlink: validate attributes in do_setlink()
	net: phy: broadcom: Fix bcm_write_exp()
	net: metrics: add proper netlink validation
	Linux 4.4.137

Change-Id: I247cc9905e330810546f7105bdf723bf84c3308f
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-06-13 16:36:26 +02:00
Amir Goldstein
66824bdf4c xfs: fix incorrect log_flushed on fsync
commit 47c7d0b19502583120c3f396c7559e7a77288a68 upstream.

When calling into _xfs_log_force{,_lsn}() with a pointer
to log_flushed variable, log_flushed will be set to 1 if:
1. xlog_sync() is called to flush the active log buffer
AND/OR
2. xlog_wait() is called to wait on a syncing log buffers

xfs_file_fsync() checks the value of log_flushed after
_xfs_log_force_lsn() call to optimize away an explicit
PREFLUSH request to the data block device after writing
out all the file's pages to disk.

This optimization is incorrect in the following sequence of events:

 Task A                    Task B
 -------------------------------------------------------
 xfs_file_fsync()
   _xfs_log_force_lsn()
     xlog_sync()
        [submit PREFLUSH]
                           xfs_file_fsync()
                             file_write_and_wait_range()
                               [submit WRITE X]
                               [endio  WRITE X]
                             _xfs_log_force_lsn()
                               xlog_wait()
        [endio  PREFLUSH]

The write X is not guarantied to be on persistent storage
when PREFLUSH request in completed, because write A was submitted
after the PREFLUSH request, but xfs_file_fsync() of task A will
be notified of log_flushed=1 and will skip explicit flush.

If the system crashes after fsync of task A, write X may not be
present on disk after reboot.

This bug was discovered and demonstrated using Josef Bacik's
dm-log-writes target, which can be used to record block io operations
and then replay a subset of these operations onto the target device.
The test goes something like this:
- Use fsx to execute ops of a file and record ops on log device
- Every now and then fsync the file, store md5 of file and mark
  the location in the log
- Then replay log onto device for each mark, mount fs and compare
  md5 of file to stored value

Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <jbacik@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-13 16:15:27 +02:00
Greg Kroah-Hartman
fb7e319634 This is the 4.4.136 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlsX88AACgkQONu9yGCS
 aT4fEBAAygf8GZqR8ql76DdEBREkgTgGrne2+Rq56eylWZyycU2FpZVLe2ct7yjf
 rbF2XCtxdPmia++z0WvmslDbtUeqSSPOz1jZBEERmyZpjpOkDTwsMUfz75Gvpi83
 ZJS4KXseL9W/jrSyIAbHJ4Fq1ffmoWzN8mEepde26Ic2DJ/3mB2Dphgg95UjI7rw
 KGg3+Jjr21ojrEmI1BOVItgZ6iU0jTgCkwrYrP1eI+OzRjasGMMJRh/HYBfr3GEY
 N6Ggi5PyIWF/DOeTp53hajOAFbt5WTFK6hiiwLqz+6XQuhY45N1YuXgT/vszZmKz
 nngD5p5+GWKZoXtRXoLMXts8EdZ55yoyj6dkIOM5W62C3HhxjqpPrLXJMdtm5eO/
 tL8/vbB6AzniFB/hQS4IqfqQ6sizcAzGi/vP0eOW2I7K9WIsbXR9vt1BcvVaIrRF
 O/9xX4QJrceNIUzq25sdS7vv4fk7O0AUq/bZtYWWjKY+4E2LhAPoHgmB7cF/M8jJ
 K8BtMtClyDqfpIhJiH3PDYdY6jRfYKcNUhMZLBYN9uRwa/5l8cC4AIKBEY8IyhgB
 i05G8YadInSSqf2eRGZ97Qpn5MVYm2G/r2BtpNLbCfIYUfvnHD7mWfteVjVw4Yjh
 Q6ERVHkvjEFsn1BPBd34OMVJlDz0oqNT92NwiAlXiA4Sxizvvh4=
 =0oNX
 -----END PGP SIGNATURE-----

Merge 4.4.136 into android-4.4

Changes in 4.4.136
	arm64: lse: Add early clobbers to some input/output asm operands
	powerpc/64s: Clear PCR on boot
	USB: serial: cp210x: use tcflag_t to fix incompatible pointer type
	sh: New gcc support
	xfs: detect agfl count corruption and reset agfl
	Revert "ima: limit file hash setting by user to fix and log modes"
	Input: elan_i2c_smbus - fix corrupted stack
	tracing: Fix crash when freeing instances with event triggers
	selinux: KASAN: slab-out-of-bounds in xattr_getsecurity
	cfg80211: further limit wiphy names to 64 bytes
	rtlwifi: rtl8192cu: Remove variable self-assignment in rf.c
	ASoC: Intel: sst: remove redundant variable dma_dev_name
	irda: fix overly long udelay()
	tcp: avoid integer overflows in tcp_rcv_space_adjust()
	i2c: rcar: make sure clocks are on when doing clock calculation
	i2c: rcar: rework hw init
	i2c: rcar: remove unused IOERROR state
	i2c: rcar: remove spinlock
	i2c: rcar: refactor setup of a msg
	i2c: rcar: init new messages in irq
	i2c: rcar: don't issue stop when HW does it automatically
	i2c: rcar: check master irqs before slave irqs
	i2c: rcar: revoke START request early
	dmaengine: usb-dmac: fix endless loop in usb_dmac_chan_terminate_all()
	iio:kfifo_buf: check for uint overflow
	MIPS: ptrace: Fix PTRACE_PEEKUSR requests for 64-bit FGRs
	MIPS: prctl: Disallow FRE without FR with PR_SET_FP_MODE requests
	scsi: scsi_transport_srp: Fix shost to rport translation
	stm class: Use vmalloc for the master map
	hwtracing: stm: fix build error on some arches
	drm/i915: Disable LVDS on Radiant P845
	Kbuild: change CC_OPTIMIZE_FOR_SIZE definition
	fix io_destroy()/aio_complete() race
	mm: fix the NULL mapping case in __isolate_lru_page()
	sparc64: Fix build warnings with gcc 7.
	Linux 4.4.136

Change-Id: I3457f995cf22c65952271ecd517a46144ac4dc79
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-06-06 18:53:06 +02:00
Al Viro
914812331d fix io_destroy()/aio_complete() race
commit 4faa99965e027cc057c5145ce45fa772caa04e8d upstream.

If io_destroy() gets to cancelling everything that can be cancelled and
gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
it becomes vulnerable to a race with IO completion.  At that point req
is already taken off the list and aio_complete() does *NOT* spin until
we (in free_ioctx_users()) releases ->ctx_lock.  As the result, it proceeds
to kiocb_free(), freing req just it gets passed to ->ki_cancel().

Fix is simple - remove from the list after the call of kiocb_cancel().  All
instances of ->ki_cancel() already have to cope with the being called with
iocb still on list - that's what happens in io_cancel(2).

Cc: stable@kernel.org
Fixes: 0460fef2a9 "aio: use cancellation list lazily"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-06 16:46:23 +02:00
Brian Foster
55ffb6acd2 xfs: detect agfl count corruption and reset agfl
commit a27ba2607e60312554cbcd43fc660b2c7f29dc9c upstream.

The struct xfs_agfl v5 header was originally introduced with
unexpected padding that caused the AGFL to operate with one less
slot than intended. The header has since been packed, but the fix
left an incompatibility for users who upgrade from an old kernel
with the unpacked header to a newer kernel with the packed header
while the AGFL happens to wrap around the end. The newer kernel
recognizes one extra slot at the physical end of the AGFL that the
previous kernel did not. The new kernel will eventually attempt to
allocate a block from that slot, which contains invalid data, and
cause a crash.

This condition can be detected by comparing the active range of the
AGFL to the count. While this detects a padding mismatch, it can
also trigger false positives for unrelated flcount corruption. Since
we cannot distinguish a size mismatch due to padding from unrelated
corruption, we can't trust the AGFL enough to simply repopulate the
empty slot.

Instead, avoid unnecessarily complex detection logic and and use a
solution that can handle any form of flcount corruption that slips
through read verifiers: distrust the entire AGFL and reset it to an
empty state. Any valid blocks within the AGFL are intentionally
leaked. This requires xfs_repair to rectify (which was already
necessary based on the state the AGFL was found in). The reset
mitigates the side effect of the padding mismatch problem from a
filesystem crash to a free space accounting inconsistency. The
generic approach also means that this patch can be safely backported
to kernels with or without a packed struct xfs_agfl.

Check the AGF for an invalid freelist count on initial read from
disk. If detected, set a flag on the xfs_perag to indicate that a
reset is required before the AGFL can be used. In the first
transaction that attempts to use a flagged AGFL, reset it to empty,
warn the user about the inconsistency and allow the freelist fixup
code to repopulate the AGFL with new blocks. The xfs_perag flag is
cleared to eliminate the need for repeated checks on each block
allocation operation.

This allows kernels that include the packing fix commit 96f859d52bcb
("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
to handle older unpacked AGFL formats without a filesystem crash.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-06 16:46:20 +02:00
Jaegeuk Kim
98b6097d0f FROMLIST: f2fs: run fstrim asynchronously if runtime discard is on
Cherry-picked from:
        origin/upstream-f2fs-stable-linux-4.4.y

We don't need to wait for whole bunch of discard candidates in fstrim, since
runtime discard will issue them in idle time.

Change-Id: I32602711842d603cca030765eab49b337789e8ad
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-05 14:00:40 -07:00
Jaegeuk Kim
73450231ff f2fs: run fstrim asynchronously if runtime discard is on
We don't need to wait for whole bunch of discard candidates in fstrim, since
runtime discard will issue them in idle time.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-04 21:18:51 -07:00
Jaegeuk Kim
b68f27c362 FROMLIST: f2fs: early updates queued for v4.18-rc1
Cherry-picked from:
  origin/upstream-f2fs-stable-linux-4.4.y

85d2070f60 ("f2fs: turn down IO priority of discard from background")
4738f527db ("f2fs: don't split checkpoint in fstrim")
31e2713935 ("f2fs: issue discard commands proactively in high fs utilization")
70676ef736 ("f2fs: add fsync_mode=nobarrier for non-atomic files")
bb53d06b5f ("f2fs: let fstrim issue discard commands in lower priority")

Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
2018-05-30 19:32:34 -07:00
Chao Yu
85d2070f60 f2fs: turn down IO priority of discard from background
In order to avoid interfering normal r/w IO, let's turn down IO
priority of discard issued from background.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 11:39:57 -07:00
Chao Yu
4738f527db f2fs: don't split checkpoint in fstrim
Now, we issue discard asynchronously in separated thread instead of in
checkpoint, after that, we won't encounter long latency in checkpoint
due to huge number of synchronous discard command handling, so, we don't
need to split checkpoint to do trim in batch, merge it and obsolete
related sysfs entry.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 11:39:54 -07:00
Jaegeuk Kim
31e2713935 f2fs: issue discard commands proactively in high fs utilization
In the high utilization like over 80%, we don't expect huge # of large discard
commands, but do many small pending discards which affects FTL GCs a lot.
Let's issue them in that case.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 11:39:51 -07:00
Jaegeuk Kim
70676ef736 f2fs: add fsync_mode=nobarrier for non-atomic files
For non-atomic files, this patch adds an option to give nobarrier which
doesn't issue flush commands to the device.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 11:39:48 -07:00
Jaegeuk Kim
bb53d06b5f f2fs: let fstrim issue discard commands in lower priority
The fstrim gathers huge number of large discard commands, and tries to issue
without IO awareness, which results in long user-perceive IO latencies on
READ, WRITE, and FLUSH in UFS. We've observed some of commands take several
seconds due to long discard latency.

This patch limits the maximum size to 2MB per candidate, and check IO congestion
when issuing them to disk.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 11:39:45 -07:00
Greg Kroah-Hartman
6e37ae0e7a This is the 4.4.134 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlsOO14ACgkQONu9yGCS
 aT4ulhAAhMVYSRa/cOFm0BHxSL/59WmJTa3Na8TJqkTrJy+LRluBiKCywyiMZknp
 4rIffv4jcxcFNCpqYTjNTSStGLWCCkBLNSzxuzFv5M89Jdx4Gz1Ww1hzMESP3gxK
 puHUewSJQm7qtVOiC2l4YcW3Q6nFK0kqbCWpSkHoGVfZoX9JS2P1V8n+KFZpUH1a
 UyhVW48ainUpXfhSKJZ5xABiWYM2hcSq52RW1edNZvwuKwulZ+2EME26HgGCK7ff
 WHzGHECE6Lem+iunR26J/QtbTo8LKEyU0F039X21E7FIxf33S0xyPx+MGjJfWBOo
 Q6A23mAEWwEhlMomNKzdd/iUzSVlWSzKe8LJa7GI5G6BxftN8Z0TGTnKzIDkw++M
 T6RfK03CP6c9rQ756d0fTPxdZh6ae9EN8WSot/Sbbc9SvGSfy6o4I8Y/uJygShmF
 j13JfMweC+t7/6fyUqc5dcgY0Xy7LUFiWqfPxQj6axDiT82Mx2AvQaczrPUAKr1K
 KQsetmyhHC+Cpy7ILrhUGYjEWlvQm11ZiFoX8BkocFLFWk736QA63iB7mOUpCOQR
 SKLK00dF163GJdQC6nb4wCtyBxnCg4pSoP/72Z1foPtaSd3ccJ4CLsIE6GY5sP/I
 sDlPnIlnzEDfDPIxtVfKC8e1JINP6awXwtoJJo6MnuCuP3LDb58=
 =ogZQ
 -----END PGP SIGNATURE-----

Merge 4.4.134 into android-4.4

Changes in 4.4.134
	MIPS: ptrace: Expose FIR register through FP regset
	MIPS: Fix ptrace(2) PTRACE_PEEKUSR and PTRACE_POKEUSR accesses to o32 FGRs
	KVM: Fix spelling mistake: "cop_unsuable" -> "cop_unusable"
	affs_lookup(): close a race with affs_remove_link()
	aio: fix io_destroy(2) vs. lookup_ioctx() race
	ALSA: timer: Fix pause event notification
	mmc: sdhci-iproc: fix 32bit writes for TRANSFER_MODE register
	libata: Blacklist some Sandisk SSDs for NCQ
	libata: blacklist Micron 500IT SSD with MU01 firmware
	xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent
	Revert "ipc/shm: Fix shmat mmap nil-page protection"
	ipc/shm: fix shmat() nil address after round-down when remapping
	kasan: fix memory hotplug during boot
	kernel/sys.c: fix potential Spectre v1 issue
	kernel/signal.c: avoid undefined behaviour in kill_something_info
	xfs: remove racy hasattr check from attr ops
	do d_instantiate/unlock_new_inode combinations safely
	firewire-ohci: work around oversized DMA reads on JMicron controllers
	NFSv4: always set NFS_LOCK_LOST when a lock is lost.
	ALSA: hda - Use IS_REACHABLE() for dependency on input
	ASoC: au1x: Fix timeout tests in au1xac97c_ac97_read()
	kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
	tracing/hrtimer: Fix tracing bugs by taking all clock bases and modes into account
	PCI: Add function 1 DMA alias quirk for Marvell 9128
	tools lib traceevent: Simplify pointer print logic and fix %pF
	perf callchain: Fix attr.sample_max_stack setting
	tools lib traceevent: Fix get_field_str() for dynamic strings
	dm thin: fix documentation relative to low water mark threshold
	nfs: Do not convert nfs_idmap_cache_timeout to jiffies
	watchdog: sp5100_tco: Fix watchdog disable bit
	kconfig: Don't leak main menus during parsing
	kconfig: Fix automatic menu creation mem leak
	kconfig: Fix expr_free() E_NOT leak
	mac80211_hwsim: fix possible memory leak in hwsim_new_radio_nl()
	ipmi/powernv: Fix error return code in ipmi_powernv_probe()
	Btrfs: set plug for fsync
	btrfs: Fix out of bounds access in btrfs_search_slot
	Btrfs: fix scrub to repair raid6 corruption
	scsi: fas216: fix sense buffer initialization
	HID: roccat: prevent an out of bounds read in kovaplus_profile_activated()
	jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
	powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes
	powerpc/numa: Ensure nodes initialized for hotplug
	RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
	ntb_transport: Fix bug with max_mw_size parameter
	ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid
	ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute
	ocfs2: return error when we attempt to access a dirty bh in jbd2
	mm/mempolicy: fix the check of nodemask from user
	mm/mempolicy: add nodes_empty check in SYSC_migrate_pages
	asm-generic: provide generic_pmdp_establish()
	mm: pin address_space before dereferencing it while isolating an LRU page
	IB/ipoib: Fix for potential no-carrier state
	x86/power: Fix swsusp_arch_resume prototype
	firmware: dmi_scan: Fix handling of empty DMI strings
	ACPI: processor_perflib: Do not send _PPC change notification if not ready
	bpf: fix selftests/bpf test_kmod.sh failure when CONFIG_BPF_JIT_ALWAYS_ON=y
	MIPS: TXx9: use IS_BUILTIN() for CONFIG_LEDS_CLASS
	xen-netfront: Fix race between device setup and open
	xen/grant-table: Use put_page instead of free_page
	RDS: IB: Fix null pointer issue
	arm64: spinlock: Fix theoretical trylock() A-B-A with LSE atomics
	proc: fix /proc/*/map_files lookup
	cifs: silence compiler warnings showing up with gcc-8.0.0
	bcache: properly set task state in bch_writeback_thread()
	bcache: fix for allocator and register thread race
	bcache: fix for data collapse after re-attaching an attached device
	bcache: return attach error when no cache set exist
	tools/libbpf: handle issues with bpf ELF objects containing .eh_frames
	locking/qspinlock: Ensure node->count is updated before initialising node
	irqchip/gic-v3: Change pr_debug message to pr_devel
	scsi: ufs: Enable quirk to ignore sending WRITE_SAME command
	scsi: bnx2fc: Fix check in SCSI completion handler for timed out request
	scsi: sym53c8xx_2: iterator underflow in sym_getsync()
	scsi: mptfusion: Add bounds check in mptctl_hp_targetinfo()
	scsi: qla2xxx: Avoid triggering undefined behavior in qla2x00_mbx_completion()
	ARC: Fix malformed ARC_EMUL_UNALIGNED default
	usb: gadget: f_uac2: fix bFirstInterface in composite gadget
	usb: gadget: fsl_udc_core: fix ep valid checks
	usb: dwc2: Fix dwc2_hsotg_core_init_disconnected()
	selftests: memfd: add config fragment for fuse
	scsi: storvsc: Increase cmd_per_lun for higher speed devices
	scsi: aacraid: fix shutdown crash when init fails
	scsi: qla4xxx: skip error recovery in case of register disconnect.
	ARM: OMAP2+: timer: fix a kmemleak caused in omap_get_timer_dt
	ARM: OMAP3: Fix prm wake interrupt for resume
	ARM: OMAP1: clock: Fix debugfs_create_*() usage
	NFC: llcp: Limit size of SDP URI
	mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4
	md raid10: fix NULL deference in handle_write_completed()
	drm/exynos: fix comparison to bitshift when dealing with a mask
	usb: musb: fix enumeration after resume
	locking/xchg/alpha: Add unconditional memory barrier to cmpxchg()
	md: raid5: avoid string overflow warning
	kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
	powerpc/bpf/jit: Fix 32-bit JIT for seccomp_data access
	s390/cio: fix return code after missing interrupt
	s390/cio: clear timer when terminating driver I/O
	ARM: OMAP: Fix dmtimer init for omap1
	smsc75xx: fix smsc75xx_set_features()
	regulatory: add NUL to request alpha2
	locking/xchg/alpha: Fix xchg() and cmpxchg() memory ordering bugs
	x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations
	media: dmxdev: fix error code for invalid ioctls
	md/raid1: fix NULL pointer dereference
	batman-adv: fix packet checksum in receive path
	batman-adv: invalidate checksum on fragment reassembly
	netfilter: ebtables: convert BUG_ONs to WARN_ONs
	nvme-pci: Fix nvme queue cleanup if IRQ setup fails
	clocksource/drivers/fsl_ftm_timer: Fix error return checking
	r8152: fix tx packets accounting
	virtio-gpu: fix ioctl and expose the fixed status to userspace.
	dmaengine: rcar-dmac: fix max_chunk_size for R-Car Gen3
	bcache: fix kcrashes with fio in RAID5 backend dev
	sit: fix IFLA_MTU ignored on NEWLINK
	gianfar: Fix Rx byte accounting for ndev stats
	net/tcp/illinois: replace broken algorithm reference link
	xen/pirq: fix error path cleanup when binding MSIs
	Btrfs: send, fix issuing write op when processing hole in no data mode
	selftests/powerpc: Skip the subpage_prot tests if the syscall is unavailable
	KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backing
	watchdog: f71808e_wdt: Fix magic close handling
	e1000e: Fix check_for_link return value with autoneg off
	e1000e: allocate ring descriptors with dma_zalloc_coherent
	usb: musb: call pm_runtime_{get,put}_sync before reading vbus registers
	scsi: mpt3sas: Do not mark fw_event workqueue as WQ_MEM_RECLAIM
	scsi: sd: Keep disk read-only when re-reading partition
	fbdev: Fixing arbitrary kernel leak in case FBIOGETCMAP_SPARC in sbusfb_ioctl_helper().
	xen: xenbus: use put_device() instead of kfree()
	USB: OHCI: Fix NULL dereference in HCDs using HCD_LOCAL_MEM
	netfilter: ebtables: fix erroneous reject of last rule
	bnxt_en: Check valid VNIC ID in bnxt_hwrm_vnic_set_tpa().
	workqueue: use put_device() instead of kfree()
	ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu
	sunvnet: does not support GSO for sctp
	net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off
	batman-adv: fix header size check in batadv_dbg_arp()
	vti4: Don't count header length twice on tunnel setup
	vti4: Don't override MTU passed on link creation via IFLA_MTU
	perf/cgroup: Fix child event counting bug
	RDMA/ucma: Correct option size check using optlen
	mm/mempolicy.c: avoid use uninitialized preferred_node
	selftests: ftrace: Add probe event argument syntax testcase
	selftests: ftrace: Add a testcase for string type with kprobe_event
	selftests: ftrace: Add a testcase for probepoint
	batman-adv: fix multicast-via-unicast transmission with AP isolation
	batman-adv: fix packet loss for broadcasted DHCP packets to a server
	ARM: 8748/1: mm: Define vdso_start, vdso_end as array
	net: qmi_wwan: add BroadMobi BM806U 2020:2033
	net/usb/qmi_wwan.c: Add USB id for lt4120 modem
	net-usb: add qmi_wwan if on lte modem wistron neweb d18q1
	llc: properly handle dev_queue_xmit() return value
	mm/kmemleak.c: wait for scan completion before disabling free
	net: Fix untag for vlan packets without ethernet header
	net: mvneta: fix enable of all initialized RXQs
	sh: fix debug trap failure to process signals before return to user
	x86/pgtable: Don't set huge PUD/PMD on non-leaf entries
	fs/proc/proc_sysctl.c: fix potential page fault while unregistering sysctl table
	swap: divide-by-zero when zero length swap file on ssd
	sr: get/drop reference to device in revalidate and check_events
	Force log to disk before reading the AGF during a fstrim
	cpufreq: CPPC: Initialize shared perf capabilities of CPUs
	scsi: aacraid: Insure command thread is not recursively stopped
	dp83640: Ensure against premature access to PHY registers after reset
	mm/ksm: fix interaction with THP
	mm: fix races between address_space dereference and free in page_evicatable
	Btrfs: bail out on error during replay_dir_deletes
	Btrfs: fix NULL pointer dereference in log_dir_items
	btrfs: Fix possible softlock on single core machines
	ocfs2/dlm: don't handle migrate lockres if already in shutdown
	sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
	KVM: VMX: raise internal error for exception during invalid protected mode state
	fscache: Fix hanging wait on page discarded by writeback
	sparc64: Make atomic_xchg() an inline function rather than a macro.
	rtc: snvs: Fix usage of snvs_rtc_enable
	net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
	Bluetooth: btusb: Add USB ID 7392:a611 for Edimax EW-7611ULB
	btrfs: tests/qgroup: Fix wrong tree backref level
	Btrfs: fix copy_items() return value when logging an inode
	btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers
	xen/acpi: off by one in read_acpi_id()
	ACPI: acpi_pad: Fix memory leak in power saving threads
	powerpc/mpic: Check if cpu_possible() in mpic_physmask()
	m68k: set dma and coherent masks for platform FEC ethernets
	parisc/pci: Switch LBA PCI bus from Hard Fail to Soft Fail mode
	hwmon: (nct6775) Fix writing pwmX_mode
	rtc: hctosys: Ensure system time doesn't overflow time_t
	powerpc/perf: Prevent kernel address leak to userspace via BHRB buffer
	powerpc/perf: Fix kernel address leak via sampling registers
	tools/thermal: tmon: fix for segfault
	selftests: Print the test we're running to /dev/kmsg
	net/mlx5: Protect from command bit overflow
	ath10k: Fix kernel panic while using worker (ath10k_sta_rc_update_wk)
	ima: Fix Kconfig to select TPM 2.0 CRB interface
	ima: Fallback to the builtin hash algorithm
	virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
	arm: dts: socfpga: fix GIC PPI warning
	usb: dwc3: Update DWC_usb31 GTXFIFOSIZ reg fields
	cpufreq: cppc_cpufreq: Fix cppc_cpufreq_init() failure path
	clk: Don't show the incorrect clock phase
	zorro: Set up z->dev.dma_mask for the DMA API
	bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
	ACPICA: Events: add a return on failure from acpi_hw_register_read
	ACPICA: acpi: acpica: fix acpi operand cache leak in nseval.c
	i2c: mv64xxx: Apply errata delay only in standard mode
	KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use
	xhci: zero usb device slot_id member when disabling and freeing a xhci slot
	MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset
	PCI: Restore config space on runtime resume despite being unbound
	ipmi_ssif: Fix kernel panic at msg_done_handler
	usb: dwc2: Fix interval type issue
	usb: gadget: ffs: Let setup() return USB_GADGET_DELAYED_STATUS
	usb: gadget: ffs: Execute copy_to_user() with USER_DS set
	powerpc: Add missing prototype for arch_irq_work_raise()
	ASoC: topology: create TLV data for dapm widgets
	perf/core: Fix perf_output_read_group()
	hwmon: (pmbus/max8688) Accept negative page register values
	hwmon: (pmbus/adm1275) Accept negative page register values
	cdrom: do not call check_disk_change() inside cdrom_open()
	gfs2: Fix fallocate chunk size
	usb: gadget: udc: change comparison to bitshift when dealing with a mask
	usb: gadget: composite: fix incorrect handling of OS desc requests
	x86/devicetree: Initialize device tree before using it
	x86/devicetree: Fix device IRQ settings in DT
	ALSA: vmaster: Propagate slave error
	media: cx23885: Override 888 ImpactVCBe crystal frequency
	media: cx23885: Set subdev host data to clk_freq pointer
	media: s3c-camif: fix out-of-bounds array access
	dmaengine: pl330: fix a race condition in case of threaded irqs
	media: em28xx: USB bulk packet size fix
	clk: rockchip: Prevent calculating mmc phase if clock rate is zero
	enic: enable rq before updating rq descriptors
	hwrng: stm32 - add reset during probe
	staging: rtl8192u: return -ENOMEM on failed allocation of priv->oldaddr
	rtc: tx4939: avoid unintended sign extension on a 24 bit shift
	serial: xuartps: Fix out-of-bounds access through DT alias
	serial: samsung: Fix out-of-bounds access through serial port index
	serial: mxs-auart: Fix out-of-bounds access through serial port index
	serial: imx: Fix out-of-bounds access through serial port index
	serial: fsl_lpuart: Fix out-of-bounds access through DT alias
	serial: arc_uart: Fix out-of-bounds access through DT alias
	PCI: Add function 1 DMA alias quirk for Marvell 88SE9220
	udf: Provide saner default for invalid uid / gid
	media: cx25821: prevent out-of-bounds read on array card
	clk: samsung: s3c2410: Fix PLL rates
	clk: samsung: exynos5260: Fix PLL rates
	clk: samsung: exynos5433: Fix PLL rates
	clk: samsung: exynos5250: Fix PLL rates
	clk: samsung: exynos3250: Fix PLL rates
	crypto: sunxi-ss - Add MODULE_ALIAS to sun4i-ss
	audit: return on memory error to avoid null pointer dereference
	MIPS: Octeon: Fix logging messages with spurious periods after newlines
	drm/rockchip: Respect page offset for PRIME mmap calls
	x86/apic: Set up through-local-APIC mode on the boot CPU if 'noapic' specified
	perf tests: Use arch__compare_symbol_names to compare symbols
	perf report: Fix memory corruption in --branch-history mode --branch-history
	selftests/net: fixes psock_fanout eBPF test case
	netlabel: If PF_INET6, check sk_buff ip header version
	scsi: lpfc: Fix issue_lip if link is disabled
	scsi: lpfc: Fix soft lockup in lpfc worker thread during LIP testing
	scsi: lpfc: Fix frequency of Release WQE CQEs
	regulator: of: Add a missing 'of_node_put()' in an error handling path of 'of_regulator_match()'
	ASoC: samsung: i2s: Ensure the RCLK rate is properly determined
	Bluetooth: btusb: Add device ID for RTL8822BE
	kdb: make "mdr" command repeat
	s390/ftrace: use expoline for indirect branches
	Linux 4.4.134

Change-Id: Iababaf9b89bc8d0437b95e1368d8b0a9126a178c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-05-30 13:25:24 +02:00
Jan Kara
25754be3ca udf: Provide saner default for invalid uid / gid
[ Upstream commit 116e5258e4115aca0c64ac0bf40ded3b353ed626 ]

Currently when UDF filesystem is recorded without uid / gid (ids are set
to -1), we will assign INVALID_[UG]ID to vfs inode unless user uses uid=
and gid= mount options. In such case filesystem could not be modified in
any way as VFS refuses to modify files with invalid ids (even by root).
This is confusing to users and not very useful default since such media
mode is generally used for removable media. Use overflow[ug]id instead
so that at least root can modify the filesystem.

Reported-by: Steve Kenton <skenton@ou.edu>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:15 +02:00
Andreas Gruenbacher
ad29babb08 gfs2: Fix fallocate chunk size
[ Upstream commit 174d1232ebc84fcde8f5889d1171c9c7e74a10a7 ]

The chunk size of allocations in __gfs2_fallocate is calculated
incorrectly.  The size can collapse, causing __gfs2_fallocate to
allocate one block at a time, which is very inefficient.  This needs
fixing in two places:

In gfs2_quota_lock_check, always set ap->allowed to UINT_MAX to indicate
that there is no quota limit.  This fixes callers that rely on
ap->allowed to be set even when quotas are off.

In __gfs2_fallocate, reset max_blks to UINT_MAX in each iteration of the
loop to make sure that allocation limits from one resource group won't
spill over into another resource group.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:13 +02:00
Jeff Mahoney
93c13652e0 btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers
[ Upstream commit 8a5a916d9a35e13576d79cc16e24611821b13e34 ]

While running btrfs/011, I hit the following lockdep splat.

This is the important bit:
   pcpu_alloc+0x1ac/0x5e0
   __percpu_counter_init+0x4e/0xb0
   btrfs_init_fs_root+0x99/0x1c0 [btrfs]
   btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
   resolve_indirect_refs+0x130/0x830 [btrfs]
   find_parent_nodes+0x69e/0xff0 [btrfs]
   btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
   btrfs_find_all_roots+0x50/0x70 [btrfs]
   btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
   btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]

The percpu_counter_init call in btrfs_alloc_subvolume_writers
uses GFP_KERNEL, which we can't do during transaction commit.

This switches it to GFP_NOFS.

========================================================
WARNING: possible irq lock inversion dependency detected
4.12.14-kvmsmall #8 Tainted: G        W
--------------------------------------------------------
kswapd0/50 just changed the state of lock:
 (&delayed_node->mutex){+.+.-.}, at: [<ffffffffc06994fa>] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
 (pcpu_alloc_mutex){+.+.+.}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
Chain exists of:
  &delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(pcpu_alloc_mutex);
                               local_irq_disable();
                               lock(&delayed_node->mutex);
                               lock(&found->groups_sem);
  <Interrupt>
    lock(&delayed_node->mutex);

 *** DEADLOCK ***

2 locks held by kswapd0/50:
 #0:  (shrinker_rwsem){++++..}, at: [<ffffffff811dc11f>] shrink_slab+0x7f/0x5b0
 #1:  (&type->s_umount_key#30){+++++.}, at: [<ffffffff8126dec6>] trylock_super+0x16/0x50

the shortest dependencies between 2nd lock and 1st lock:
   -> (pcpu_alloc_mutex){+.+.+.} ops: 4904 {
      HARDIRQ-ON-W at:
                          __mutex_lock+0x4e/0x8c0
                          pcpu_alloc+0x1ac/0x5e0
                          alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                          __do_tune_cpucache+0x2c/0x220
                          do_tune_cpucache+0x26/0xc0
                          enable_cpucache+0x6d/0xf0
                          kmem_cache_init_late+0x42/0x75
                          start_kernel+0x343/0x4cb
                          x86_64_start_kernel+0x127/0x134
                          secondary_startup_64+0xa5/0xb0
      SOFTIRQ-ON-W at:
                          __mutex_lock+0x4e/0x8c0
                          pcpu_alloc+0x1ac/0x5e0
                          alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                          __do_tune_cpucache+0x2c/0x220
                          do_tune_cpucache+0x26/0xc0
                          enable_cpucache+0x6d/0xf0
                          kmem_cache_init_late+0x42/0x75
                          start_kernel+0x343/0x4cb
                          x86_64_start_kernel+0x127/0x134
                          secondary_startup_64+0xa5/0xb0
      RECLAIM_FS-ON-W at:
                             __kmalloc+0x47/0x310
                             pcpu_extend_area_map+0x2b/0xc0
                             pcpu_alloc+0x3ec/0x5e0
                             alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                             __do_tune_cpucache+0x2c/0x220
                             do_tune_cpucache+0x26/0xc0
                             enable_cpucache+0x6d/0xf0
                             __kmem_cache_create+0x1bf/0x390
                             create_cache+0xba/0x1b0
                             kmem_cache_create+0x1f8/0x2b0
                             ksm_init+0x6f/0x19d
                             do_one_initcall+0x50/0x1b0
                             kernel_init_freeable+0x201/0x289
                             kernel_init+0xa/0x100
                             ret_from_fork+0x3a/0x50
      INITIAL USE at:
                         __mutex_lock+0x4e/0x8c0
                         pcpu_alloc+0x1ac/0x5e0
                         alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                         setup_cpu_cache+0x2f/0x1f0
                         __kmem_cache_create+0x1bf/0x390
                         create_boot_cache+0x8b/0xb1
                         kmem_cache_init+0xa1/0x19e
                         start_kernel+0x270/0x4cb
                         x86_64_start_kernel+0x127/0x134
                         secondary_startup_64+0xa5/0xb0
    }
    ... key      at: [<ffffffff821d8e70>] pcpu_alloc_mutex+0x70/0xa0
    ... acquired at:
   pcpu_alloc+0x1ac/0x5e0
   __percpu_counter_init+0x4e/0xb0
   btrfs_init_fs_root+0x99/0x1c0 [btrfs]
   btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
   resolve_indirect_refs+0x130/0x830 [btrfs]
   find_parent_nodes+0x69e/0xff0 [btrfs]
   btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
   btrfs_find_all_roots+0x50/0x70 [btrfs]
   btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
   btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
   transaction_kthread+0x176/0x1b0 [btrfs]
   kthread+0x102/0x140
   ret_from_fork+0x3a/0x50

  -> (&fs_info->commit_root_sem){++++..} ops: 1566382 {
     HARDIRQ-ON-W at:
                        down_write+0x3e/0xa0
                        cache_block_group+0x287/0x420 [btrfs]
                        find_free_extent+0x106c/0x12d0 [btrfs]
                        btrfs_reserve_extent+0xd8/0x170 [btrfs]
                        cow_file_range.isra.66+0x133/0x470 [btrfs]
                        run_delalloc_range+0x121/0x410 [btrfs]
                        writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
                        __extent_writepage+0x19a/0x360 [btrfs]
                        extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
                        extent_writepages+0x4d/0x60 [btrfs]
                        do_writepages+0x1a/0x70
                        __filemap_fdatawrite_range+0xa7/0xe0
                        btrfs_rename+0x5ee/0xdb0 [btrfs]
                        vfs_rename+0x52a/0x7e0
                        SyS_rename+0x351/0x3b0
                        do_syscall_64+0x79/0x1e0
                        entry_SYSCALL_64_after_hwframe+0x42/0xb7
     HARDIRQ-ON-R at:
                        down_read+0x35/0x90
                        caching_thread+0x57/0x560 [btrfs]
                        normal_work_helper+0x1c0/0x5e0 [btrfs]
                        process_one_work+0x1e0/0x5c0
                        worker_thread+0x44/0x390
                        kthread+0x102/0x140
                        ret_from_fork+0x3a/0x50
     SOFTIRQ-ON-W at:
                        down_write+0x3e/0xa0
                        cache_block_group+0x287/0x420 [btrfs]
                        find_free_extent+0x106c/0x12d0 [btrfs]
                        btrfs_reserve_extent+0xd8/0x170 [btrfs]
                        cow_file_range.isra.66+0x133/0x470 [btrfs]
                        run_delalloc_range+0x121/0x410 [btrfs]
                        writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
                        __extent_writepage+0x19a/0x360 [btrfs]
                        extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
                        extent_writepages+0x4d/0x60 [btrfs]
                        do_writepages+0x1a/0x70
                        __filemap_fdatawrite_range+0xa7/0xe0
                        btrfs_rename+0x5ee/0xdb0 [btrfs]
                        vfs_rename+0x52a/0x7e0
                        SyS_rename+0x351/0x3b0
                        do_syscall_64+0x79/0x1e0
                        entry_SYSCALL_64_after_hwframe+0x42/0xb7
     SOFTIRQ-ON-R at:
                        down_read+0x35/0x90
                        caching_thread+0x57/0x560 [btrfs]
                        normal_work_helper+0x1c0/0x5e0 [btrfs]
                        process_one_work+0x1e0/0x5c0
                        worker_thread+0x44/0x390
                        kthread+0x102/0x140
                        ret_from_fork+0x3a/0x50
     INITIAL USE at:
                       down_write+0x3e/0xa0
                       cache_block_group+0x287/0x420 [btrfs]
                       find_free_extent+0x106c/0x12d0 [btrfs]
                       btrfs_reserve_extent+0xd8/0x170 [btrfs]
                       cow_file_range.isra.66+0x133/0x470 [btrfs]
                       run_delalloc_range+0x121/0x410 [btrfs]
                       writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
                       __extent_writepage+0x19a/0x360 [btrfs]
                       extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
                       extent_writepages+0x4d/0x60 [btrfs]
                       do_writepages+0x1a/0x70
                       __filemap_fdatawrite_range+0xa7/0xe0
                       btrfs_rename+0x5ee/0xdb0 [btrfs]
                       vfs_rename+0x52a/0x7e0
                       SyS_rename+0x351/0x3b0
                       do_syscall_64+0x79/0x1e0
                       entry_SYSCALL_64_after_hwframe+0x42/0xb7
   }
   ... key      at: [<ffffffffc0729578>] __key.61970+0x0/0xfffffffffff9aa88 [btrfs]
   ... acquired at:
   cache_block_group+0x287/0x420 [btrfs]
   find_free_extent+0x106c/0x12d0 [btrfs]
   btrfs_reserve_extent+0xd8/0x170 [btrfs]
   btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
   btrfs_create_tree+0xbb/0x2a0 [btrfs]
   btrfs_create_uuid_tree+0x37/0x140 [btrfs]
   open_ctree+0x23c0/0x2660 [btrfs]
   btrfs_mount+0xd36/0xf90 [btrfs]
   mount_fs+0x3a/0x160
   vfs_kern_mount+0x66/0x150
   btrfs_mount+0x18c/0xf90 [btrfs]
   mount_fs+0x3a/0x160
   vfs_kern_mount+0x66/0x150
   do_mount+0x1c1/0xcc0
   SyS_mount+0x7e/0xd0
   do_syscall_64+0x79/0x1e0
   entry_SYSCALL_64_after_hwframe+0x42/0xb7

 -> (&found->groups_sem){++++..} ops: 2134587 {
    HARDIRQ-ON-W at:
                      down_write+0x3e/0xa0
                      __link_block_group+0x34/0x130 [btrfs]
                      btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
                      open_ctree+0x2054/0x2660 [btrfs]
                      btrfs_mount+0xd36/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      btrfs_mount+0x18c/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      do_mount+0x1c1/0xcc0
                      SyS_mount+0x7e/0xd0
                      do_syscall_64+0x79/0x1e0
                      entry_SYSCALL_64_after_hwframe+0x42/0xb7
    HARDIRQ-ON-R at:
                      down_read+0x35/0x90
                      btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
                      open_ctree+0x207b/0x2660 [btrfs]
                      btrfs_mount+0xd36/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      btrfs_mount+0x18c/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      do_mount+0x1c1/0xcc0
                      SyS_mount+0x7e/0xd0
                      do_syscall_64+0x79/0x1e0
                      entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-W at:
                      down_write+0x3e/0xa0
                      __link_block_group+0x34/0x130 [btrfs]
                      btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
                      open_ctree+0x2054/0x2660 [btrfs]
                      btrfs_mount+0xd36/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      btrfs_mount+0x18c/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      do_mount+0x1c1/0xcc0
                      SyS_mount+0x7e/0xd0
                      do_syscall_64+0x79/0x1e0
                      entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-R at:
                      down_read+0x35/0x90
                      btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
                      open_ctree+0x207b/0x2660 [btrfs]
                      btrfs_mount+0xd36/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      btrfs_mount+0x18c/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      do_mount+0x1c1/0xcc0
                      SyS_mount+0x7e/0xd0
                      do_syscall_64+0x79/0x1e0
                      entry_SYSCALL_64_after_hwframe+0x42/0xb7
    INITIAL USE at:
                     down_write+0x3e/0xa0
                     __link_block_group+0x34/0x130 [btrfs]
                     btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
                     open_ctree+0x2054/0x2660 [btrfs]
                     btrfs_mount+0xd36/0xf90 [btrfs]
                     mount_fs+0x3a/0x160
                     vfs_kern_mount+0x66/0x150
                     btrfs_mount+0x18c/0xf90 [btrfs]
                     mount_fs+0x3a/0x160
                     vfs_kern_mount+0x66/0x150
                     do_mount+0x1c1/0xcc0
                     SyS_mount+0x7e/0xd0
                     do_syscall_64+0x79/0x1e0
                     entry_SYSCALL_64_after_hwframe+0x42/0xb7
  }
  ... key      at: [<ffffffffc0729488>] __key.59101+0x0/0xfffffffffff9ab78 [btrfs]
  ... acquired at:
   find_free_extent+0xcb4/0x12d0 [btrfs]
   btrfs_reserve_extent+0xd8/0x170 [btrfs]
   btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
   __btrfs_cow_block+0x110/0x5b0 [btrfs]
   btrfs_cow_block+0xd7/0x290 [btrfs]
   btrfs_search_slot+0x1f6/0x960 [btrfs]
   btrfs_lookup_inode+0x2a/0x90 [btrfs]
   __btrfs_update_delayed_inode+0x65/0x210 [btrfs]
   btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs]
   btrfs_evict_inode+0x3fe/0x6a0 [btrfs]
   evict+0xc4/0x190
   __dentry_kill+0xbf/0x170
   dput+0x2ae/0x2f0
   SyS_rename+0x2a6/0x3b0
   do_syscall_64+0x79/0x1e0
   entry_SYSCALL_64_after_hwframe+0x42/0xb7

-> (&delayed_node->mutex){+.+.-.} ops: 5580204 {
   HARDIRQ-ON-W at:
                    __mutex_lock+0x4e/0x8c0
                    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
                    btrfs_update_inode+0x83/0x110 [btrfs]
                    btrfs_dirty_inode+0x62/0xe0 [btrfs]
                    touch_atime+0x8c/0xb0
                    do_generic_file_read+0x818/0xb10
                    __vfs_read+0xdc/0x150
                    vfs_read+0x8a/0x130
                    SyS_read+0x45/0xa0
                    do_syscall_64+0x79/0x1e0
                    entry_SYSCALL_64_after_hwframe+0x42/0xb7
   SOFTIRQ-ON-W at:
                    __mutex_lock+0x4e/0x8c0
                    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
                    btrfs_update_inode+0x83/0x110 [btrfs]
                    btrfs_dirty_inode+0x62/0xe0 [btrfs]
                    touch_atime+0x8c/0xb0
                    do_generic_file_read+0x818/0xb10
                    __vfs_read+0xdc/0x150
                    vfs_read+0x8a/0x130
                    SyS_read+0x45/0xa0
                    do_syscall_64+0x79/0x1e0
                    entry_SYSCALL_64_after_hwframe+0x42/0xb7
   IN-RECLAIM_FS-W at:
                       __mutex_lock+0x4e/0x8c0
                       __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
                       btrfs_evict_inode+0x22c/0x6a0 [btrfs]
                       evict+0xc4/0x190
                       dispose_list+0x35/0x50
                       prune_icache_sb+0x42/0x50
                       super_cache_scan+0x139/0x190
                       shrink_slab+0x262/0x5b0
                       shrink_node+0x2eb/0x2f0
                       kswapd+0x2eb/0x890
                       kthread+0x102/0x140
                       ret_from_fork+0x3a/0x50
   INITIAL USE at:
                   __mutex_lock+0x4e/0x8c0
                   btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
                   btrfs_update_inode+0x83/0x110 [btrfs]
                   btrfs_dirty_inode+0x62/0xe0 [btrfs]
                   touch_atime+0x8c/0xb0
                   do_generic_file_read+0x818/0xb10
                   __vfs_read+0xdc/0x150
                   vfs_read+0x8a/0x130
                   SyS_read+0x45/0xa0
                   do_syscall_64+0x79/0x1e0
                   entry_SYSCALL_64_after_hwframe+0x42/0xb7
 }
 ... key      at: [<ffffffffc072d488>] __key.56935+0x0/0xfffffffffff96b78 [btrfs]
 ... acquired at:
   __lock_acquire+0x264/0x11c0
   lock_acquire+0xbd/0x1e0
   __mutex_lock+0x4e/0x8c0
   __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
   btrfs_evict_inode+0x22c/0x6a0 [btrfs]
   evict+0xc4/0x190
   dispose_list+0x35/0x50
   prune_icache_sb+0x42/0x50
   super_cache_scan+0x139/0x190
   shrink_slab+0x262/0x5b0
   shrink_node+0x2eb/0x2f0
   kswapd+0x2eb/0x890
   kthread+0x102/0x140
   ret_from_fork+0x3a/0x50

stack backtrace:
CPU: 1 PID: 50 Comm: kswapd0 Tainted: G        W        4.12.14-kvmsmall #8 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 dump_stack+0x78/0xb7
 print_irq_inversion_bug.part.38+0x19f/0x1aa
 check_usage_forwards+0x102/0x120
 ? ret_from_fork+0x3a/0x50
 ? check_usage_backwards+0x110/0x110
 mark_lock+0x16c/0x270
 __lock_acquire+0x264/0x11c0
 ? pagevec_lookup_entries+0x1a/0x30
 ? truncate_inode_pages_range+0x2b3/0x7f0
 lock_acquire+0xbd/0x1e0
 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
 __mutex_lock+0x4e/0x8c0
 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
 ? btrfs_evict_inode+0x1f6/0x6a0 [btrfs]
 __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
 btrfs_evict_inode+0x22c/0x6a0 [btrfs]
 evict+0xc4/0x190
 dispose_list+0x35/0x50
 prune_icache_sb+0x42/0x50
 super_cache_scan+0x139/0x190
 shrink_slab+0x262/0x5b0
 shrink_node+0x2eb/0x2f0
 kswapd+0x2eb/0x890
 kthread+0x102/0x140
 ? mem_cgroup_shrink_node+0x2c0/0x2c0
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x3a/0x50

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:09 +02:00
Filipe Manana
b037a568e3 Btrfs: fix copy_items() return value when logging an inode
[ Upstream commit 8434ec46c6e3232cebc25a910363b29f5c617820 ]

When logging an inode, at tree-log.c:copy_items(), if we call
btrfs_next_leaf() at the loop which checks for the need to log holes, we
need to make sure copy_items() returns the value 1 to its caller and
not 0 (on success). This is because the path the caller passed was
released and is now different from what is was before, and the caller
expects a return value of 0 to mean both success and that the path
has not changed, while a return value of 1 means both success and
signals the caller that it can not reuse the path, it has to perform
another tree search.

Even though this is a case that should not be triggered on normal
circumstances or very rare at least, its consequences can be very
unpredictable (especially when replaying a log tree).

Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:09 +02:00
Qu Wenruo
a77b0cfe80 btrfs: tests/qgroup: Fix wrong tree backref level
[ Upstream commit 3c0efdf03b2d127f0e40e30db4e7aa0429b1b79a ]

The extent tree of the test fs is like the following:

 BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
  item 0 key (4096 168 4096) itemoff 3944 itemsize 51
          extent refs 1 gen 1 flags 2
          tree block key (68719476736 0 0) level 1
                                           ^^^^^^^
          ref#0: tree block backref root 5

And it's using an empty tree for fs tree, so there is no way that its
level can be 1.

For REAL (created by mkfs) fs tree backref with no skinny metadata, the
result should look like:

 item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
         refs 1 gen 4 flags TREE_BLOCK
         tree block key (256 INODE_ITEM 0) level 0
                                           ^^^^^^^
         tree block backref root 5

Fix the level to 0, so it won't break later tree level checker.

Fixes: faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting code")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:09 +02:00
David Howells
61b4c050ad fscache: Fix hanging wait on page discarded by writeback
[ Upstream commit 2c98425720233ae3e135add0c7e869b32913502f ]

If the fscache asynchronous write operation elects to discard a page that's
pending storage to the cache because the page would be over the store limit
then it needs to wake the page as someone may be waiting on completion of
the write.

The problem is that the store limit may be updated by a different
asynchronous operation - and so may miss the write - and that the store
limit may not even get updated until later by the netfs.

Fix the kernel hang by making fscache_write_op() mark as written any pages
that are over the limit.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:08 +02:00
Jun Piao
6b6933fdb1 ocfs2/dlm: don't handle migrate lockres if already in shutdown
[ Upstream commit bb34f24c7d2c98d0c81838a7700e6068325b17a0 ]

We should not handle migrate lockres if we are already in
'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after leaving
dlm domain.  At last other nodes will get stuck into infinite loop when
requsting lock from us.

The problem is caused by concurrency umount between nodes.  Before
receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as the
migrate target.  So N2 will continue sending lockres to N1 even though
N1 has left domain.

        N1                             N2 (owner)
                                       touch file

    access the file,
    and get pr lock

                                       begin leave domain and
                                       pick up N1 as new owner

    begin leave domain and
    migrate all lockres done

                                       begin migrate lockres to N1

    end leave domain, but
    the lockres left
    unexpectedly, because
    migrate task has passed

[piaojun@huawei.com: v3]
  Link: http://lkml.kernel.org/r/5A9CBD19.5020107@huawei.com
Link: http://lkml.kernel.org/r/5A99F028.2090902@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:08 +02:00
Nikolay Borisov
e1ccd597ff btrfs: Fix possible softlock on single core machines
[ Upstream commit 1e1c50a929bc9e49bc3f9935b92450d9e69f8158 ]

do_chunk_alloc implements a loop checking whether there is a pending
chunk allocation and if so causes the caller do loop. Generally this
loop is executed only once, however testing with btrfs/072 on a single
core vm machines uncovered an extreme case where the system could loop
indefinitely. This is due to a missing cond_resched when loop which
doesn't give a chance to the previous chunk allocator finish its job.

The fix is to simply add the missing cond_resched.

Fixes: 6d74119f1a ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:08 +02:00
Liu Bo
b487d1cd2c Btrfs: fix NULL pointer dereference in log_dir_items
[ Upstream commit 80c0b4210a963e31529e15bf90519708ec947596 ]

0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is
returned, path->nodes[0] could be NULL, log_dir_items lacks such a
check for <0 and we may run into a null pointer dereference panic.

Fixes: e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:08 +02:00
Liu Bo
0c3968cc74 Btrfs: bail out on error during replay_dir_deletes
[ Upstream commit b98def7ca6e152ee55e36863dddf6f41f12d1dc6 ]

If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
to bail out, otherwise @ret would be forced to be 0 after 'break;' and
the caller won't be aware of it.

Fixes: e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:08 +02:00
Carlos Maiolino
72422ef801 Force log to disk before reading the AGF during a fstrim
[ Upstream commit 8c81dd46ef3c416b3b95e3020fb90dbd44e6140b ]

Forcing the log to disk after reading the agf is wrong, we might be
calling xfs_log_force with XFS_LOG_SYNC with a metadata lock held.

This can cause a deadlock when racing a fstrim with a filesystem
shutdown.

The deadlock has been identified due a miscalculation bug in device-mapper
dm-thin, which returns lack of space to its users earlier than the device itself
really runs out of space, changing the device-mapper volume into an error state.

The problem happened while filling the filesystem with a single file,
triggering the bug in device-mapper, consequently causing an IO error
and shutting down the filesystem.

If such file is removed, and fstrim executed before the XFS finishes the
shut down process, the fstrim process will end up holding the buffer
lock, and going to sleep on the cil wait queue.

At this point, the shut down process will try to wake up all the threads
waiting on the cil wait queue, but for this, it will try to hold the
same buffer log already held my the fstrim, locking up the filesystem.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:07 +02:00
Danilo Krummrich
ea8a1f4a9a fs/proc/proc_sysctl.c: fix potential page fault while unregistering sysctl table
[ Upstream commit a0b0d1c345d0317efe594df268feb5ccc99f651e ]

proc_sys_link_fill_cache() does not take currently unregistering sysctl
tables into account, which might result into a page fault in
sysctl_follow_link() - add a check to fix it.

This bug has been present since v3.4.

Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de
Fixes: 0e47c99d7f ("sysctl: Replace root_list with links between sysctl_table_sets")
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:07 +02:00
Filipe Manana
8ee1d4d281 Btrfs: send, fix issuing write op when processing hole in no data mode
[ Upstream commit d4dfc0f4d39475ccbbac947880b5464a74c30b99 ]

When doing an incremental send of a filesystem with the no-holes feature
enabled, we end up issuing a write operation when using the no data mode
send flag, instead of issuing an update extent operation. Fix this by
issuing the update extent operation instead.

Trivial reproducer:

  $ mkfs.btrfs -f -O no-holes /dev/sdc
  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdc /mnt/sdc
  $ mount /dev/sdd /mnt/sdd

  $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
  $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1

  $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
  $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2

  $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
  $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
       | btrfs receive -vv /mnt/sdd

Before this change the output of the second receive command is:

  receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
  utimes
  write foobar, offset 8192, len 8192
  utimes foobar
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...

After this change it is:

  receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
  utimes
  update_extent foobar: offset=8192, len=8192
  utimes foobar
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:49:03 +02:00
Arnd Bergmann
9480368828 cifs: silence compiler warnings showing up with gcc-8.0.0
[ Upstream commit ade7db991b47ab3016a414468164f4966bd08202 ]

This bug was fixed before, but came up again with the latest
compiler in another function:

fs/cifs/cifssmb.c: In function 'CIFSSMBSetEA':
fs/cifs/cifssmb.c:6362:3: error: 'strncpy' offset 8 is out of the bounds [0, 4] [-Werror=array-bounds]
   strncpy(parm_data->list[0].name, ea_name, name_len);

Let's apply the same fix that was used for the other instances.

Fixes: b2a3ad9ca5 ("cifs: silence compiler warnings showing up with gcc-4.7.0")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:56 +02:00
Alexey Dobriyan
bccf7f7c30 proc: fix /proc/*/map_files lookup
[ Upstream commit ac7f1061c2c11bb8936b1b6a94cdb48de732f7a4 ]

Current code does:

	if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)

However sscanf() is broken garbage.

It silently accepts whitespace between format specifiers
(did you know that?).

It silently accepts valid strings which result in integer overflow.

Do not use sscanf() for any even remotely reliable parsing code.

	OK
	# readlink '/proc/1/map_files/55a23af39000-55a23b05b000'
	/lib/systemd/systemd

	broken
	# readlink '/proc/1/map_files/               55a23af39000-55a23b05b000'
	/lib/systemd/systemd

	broken
	# readlink '/proc/1/map_files/55a23af39000-55a23b05b000    '
	/lib/systemd/systemd

	very broken
	# readlink '/proc/1/map_files/1000000000000000055a23af39000-55a23b05b000'
	/lib/systemd/systemd

Andrei said:

: This patch breaks criu.  It was a bug in criu.  And this bug is on a minor
: path, which works when memfd_create() isn't available.  It is a reason why
: I ask to not backport this patch to stable kernels.
:
: In CRIU this bug can be triggered, only if this patch will be backported
: to a kernel which version is lower than v3.16.

Link: http://lkml.kernel.org/r/20171120212706.GA14325@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:56 +02:00
piaojun
d278dc22cc ocfs2: return error when we attempt to access a dirty bh in jbd2
[ Upstream commit d984187e3a1ad7d12447a7ab2c43ce3717a2b5b3 ]

We should not reuse the dirty bh in jbd2 directly due to the following
situation:

1. When removing extent rec, we will dirty the bhs of extent rec and
   truncate log at the same time, and hand them over to jbd2.

2. The bhs are submitted to jbd2 area successfully.

3. The write-back thread of device help flush the bhs to disk but
   encounter write error due to abnormal storage link.

4. After a while the storage link become normal. Truncate log flush
   worker triggered by the next space reclaiming found the dirty bh of
   truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
   __ocfs2_journal_access():

   ocfs2_truncate_log_worker
     ocfs2_flush_truncate_log
       __ocfs2_flush_truncate_log
         ocfs2_replay_truncate_records
           ocfs2_journal_access_di
             __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.

5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
   extent rec is still in error state, and unfortunately nobody will
   take care of it.

6. At last the space of extent rec was not reduced, but truncate log
   flush worker have given it back to globalalloc. That will cause
   duplicate cluster problem which could be identified by fsck.ocfs2.

Sadly we can hardly revert this but set fs read-only in case of ruining
atomicity and consistency of space reclaim.

Link: http://lkml.kernel.org/r/5A6E8092.8090701@huawei.com
Fixes: acf8fdbe6a ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:55 +02:00
piaojun
b35bb8a417 ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute
[ Upstream commit 16c8d569f5704a84164f30ff01b29879f3438065 ]

The race between *set_acl and *get_acl will cause getting incomplete
xattr data as below:

  processA                                    processB

  ocfs2_set_acl
    ocfs2_xattr_set
      __ocfs2_xattr_set_handle

                                              ocfs2_get_acl_nolock
                                                ocfs2_xattr_get_nolock:

processB may get incomplete xattr data if processA hasn't set_acl done.

So we should use 'ip_xattr_sem' to protect getting extended attribute in
ocfs2_get_acl_nolock(), as other processes could be changing it
concurrently.

Link: http://lkml.kernel.org/r/5A5DDCFF.7030001@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:55 +02:00
piaojun
d55cd57428 ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid
[ Upstream commit 025bcbde3634b2c9b316f227fed13ad6ad6817fb ]

If metadata is corrupted such as 'invalid inode block', we will get
failed by calling 'mount()' and then set filesystem readonly as below:

  ocfs2_mount
    ocfs2_initialize_super
      ocfs2_init_global_system_inodes
        ocfs2_iget
          ocfs2_read_locked_inode
            ocfs2_validate_inode_block
	      ocfs2_error
	        ocfs2_handle_error
	          ocfs2_set_ro_flag(osb, 0);  // set readonly

In this situation we need return -EROFS to 'mount.ocfs2', so that user
can fix it by fsck.  And then mount again.  In addition, 'mount.ocfs2'
should be updated correspondingly as it only return 1 for all errno.
And I will post a patch for 'mount.ocfs2' too.

Link: http://lkml.kernel.org/r/5A4302FA.2010606@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:55 +02:00
Jake Daryll Obina
b222aa1cbd jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
[ Upstream commit 5bdd0c6f89fba430e18d636493398389dadc3b17 ]

If jffs2_iget() fails for a newly-allocated inode, jffs2_do_clear_inode()
can get called twice in the error handling path, the first call in
jffs2_iget() itself and the second through iget_failed(). This can result
to a use-after-free error in the second jffs2_do_clear_inode() call, such
as shown by the oops below wherein the second jffs2_do_clear_inode() call
was trying to free node fragments that were already freed in the first
jffs2_do_clear_inode() call.

[   78.178860] jffs2: error: (1904) jffs2_do_read_inode_internal: CRC failed for read_inode of inode 24 at physical location 0x1fc00c
[   78.178914] Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b7b
[   78.185871] pgd = ffffffc03a567000
[   78.188794] [6b6b6b6b6b6b6b7b] *pgd=0000000000000000, *pud=0000000000000000
[   78.194968] Internal error: Oops: 96000004 [#1] PREEMPT SMP
...
[   78.513147] PC is at rb_first_postorder+0xc/0x28
[   78.516503] LR is at jffs2_kill_fragtree+0x28/0x90 [jffs2]
[   78.520672] pc : [<ffffff8008323d28>] lr : [<ffffff8000eb1cc8>] pstate: 60000105
[   78.526757] sp : ffffff800cea38f0
[   78.528753] x29: ffffff800cea38f0 x28: ffffffc01f3f8e80
[   78.532754] x27: 0000000000000000 x26: ffffff800cea3c70
[   78.536756] x25: 00000000dc67c8ae x24: ffffffc033d6945d
[   78.540759] x23: ffffffc036811740 x22: ffffff800891a5b8
[   78.544760] x21: 0000000000000000 x20: 0000000000000000
[   78.548762] x19: ffffffc037d48910 x18: ffffff800891a588
[   78.552764] x17: 0000000000000800 x16: 0000000000000c00
[   78.556766] x15: 0000000000000010 x14: 6f2065646f6e695f
[   78.560767] x13: 6461657220726f66 x12: 2064656c69616620
[   78.564769] x11: 435243203a6c616e x10: 7265746e695f6564
[   78.568771] x9 : 6f6e695f64616572 x8 : ffffffc037974038
[   78.572774] x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008
[   78.576775] x5 : 002f91d85bd44a2f x4 : 0000000000000000
[   78.580777] x3 : 0000000000000000 x2 : 000000403755e000
[   78.584779] x1 : 6b6b6b6b6b6b6b6b x0 : 6b6b6b6b6b6b6b6b
...
[   79.038551] [<ffffff8008323d28>] rb_first_postorder+0xc/0x28
[   79.042962] [<ffffff8000eb5578>] jffs2_do_clear_inode+0x88/0x100 [jffs2]
[   79.048395] [<ffffff8000eb9ddc>] jffs2_evict_inode+0x3c/0x48 [jffs2]
[   79.053443] [<ffffff8008201ca8>] evict+0xb0/0x168
[   79.056835] [<ffffff8008202650>] iput+0x1c0/0x200
[   79.060228] [<ffffff800820408c>] iget_failed+0x30/0x3c
[   79.064097] [<ffffff8000eba0c0>] jffs2_iget+0x2d8/0x360 [jffs2]
[   79.068740] [<ffffff8000eb0a60>] jffs2_lookup+0xe8/0x130 [jffs2]
[   79.073434] [<ffffff80081f1a28>] lookup_slow+0x118/0x190
[   79.077435] [<ffffff80081f4708>] walk_component+0xfc/0x28c
[   79.081610] [<ffffff80081f4dd0>] path_lookupat+0x84/0x108
[   79.085699] [<ffffff80081f5578>] filename_lookup+0x88/0x100
[   79.089960] [<ffffff80081f572c>] user_path_at_empty+0x58/0x6c
[   79.094396] [<ffffff80081ebe14>] vfs_statx+0xa4/0x114
[   79.098138] [<ffffff80081ec44c>] SyS_newfstatat+0x58/0x98
[   79.102227] [<ffffff800808354c>] __sys_trace_return+0x0/0x4
[   79.106489] Code: d65f03c0 f9400001 b40000e1 aa0103e0 (f9400821)

The jffs2_do_clear_inode() call in jffs2_iget() is unnecessary since
iget_failed() will eventually call jffs2_do_clear_inode() if needed, so
just remove it.

Fixes: 5451f79f5f ("iget: stop JFFS2 from using iget() and read_inode()")
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Jake Daryll Obina <jake.obina@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:54 +02:00
Liu Bo
95b286daf7 Btrfs: fix scrub to repair raid6 corruption
[ Upstream commit 762221f095e3932669093466aaf4b85ed9ad2ac1 ]

The raid6 corruption is that,
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,

- the 1st retry is to rebuild with all other stripes, it'll eventually
  be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
  that it will do raid6 style rebuild,

however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content.

We've fixed normal reads to rebuild raid6 correctly with more retries
in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
scrub to do the exactly same rebuild process.

[1]: https://patchwork.kernel.org/patch/10091755/

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:54 +02:00
Nikolay Borisov
72af2c8c7c btrfs: Fix out of bounds access in btrfs_search_slot
[ Upstream commit 9ea2c7c9da13c9073e371c046cbbc45481ecb459 ]

When modifying a tree where the root is at BTRFS_MAX_LEVEL - 1 then
the level variable is going to be 7 (this is the max height of the
tree). On the other hand btrfs_cow_block is always called with
"level + 1" as an index into the nodes and slots arrays. This leads to
an out of bounds access. Admittdely this will be benign since an OOB
access of the nodes array will likely read the 0th element from the
slots array, which in this case is going to be 0 (since we start CoW at
the top of the tree). The OOB access into the slots array in turn will
read the 0th and 1st values of the locks array, which would both be 0
at the time. However, this benign behavior relies on the fact that the
path being passed hasn't been initialised, if it has already been used to
query a btree then it could potentially have populated the nodes/slots arrays.

Fix it by explicitly checking if we are at level 7 (the maximum allowed
index in nodes/slots arrays) and explicitly call the CoW routine with
NULL for parent's node/slot.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Fixes-coverity-id: 711515
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:54 +02:00
Liu Bo
51f12a06dd Btrfs: set plug for fsync
[ Upstream commit 343e4fc1c60971b0734de26dbbd475d433950982 ]

Setting plug can merge adjacent IOs before dispatching IOs to the disk
driver.

Without plug, it'd not be a problem for single disk usecases, but for
multiple disks using raid profile, a large IO can be split to several
IOs of stripe length, and plug can be helpful to bring them together
for each disk so that we can save several disk access.

Moreover, fsync issues synchronous writes, so plug can really take
effect.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:54 +02:00
Jan Chochol
929d451024 nfs: Do not convert nfs_idmap_cache_timeout to jiffies
[ Upstream commit cbebc6ef4fc830f4040d4140bf53484812d5d5d9 ]

Since commit 57e62324e4 ("NFS: Store the legacy idmapper result in the
keyring") nfs_idmap_cache_timeout changed units from jiffies to seconds.
Unfortunately sysctl interface was not updated accordingly.

As a effect updating /proc/sys/fs/nfs/idmap_cache_timeout with some
value will incorrectly multiply this value by HZ.
Also reading /proc/sys/fs/nfs/idmap_cache_timeout will show real value
divided by HZ.

Fixes: 57e62324e4 ("NFS: Store the legacy idmapper result in the keyring")
Signed-off-by: Jan Chochol <jan@chochol.info>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:53 +02:00
NeilBrown
608848f0a6 NFSv4: always set NFS_LOCK_LOST when a lock is lost.
[ Upstream commit dce2630c7da73b0634686bca557cc8945cc450c8 ]

There are 2 comments in the NFSv4 code which suggest that
SIGLOST should possibly be sent to a process.  In these
cases a lock has been lost.
The current practice is to set NFS_LOCK_LOST so that
read/write returns EIO when a lock is lost.
So change these comments to code when sets NFS_LOCK_LOST.

One case is when lock recovery after apparent server restart
fails with NFS4ERR_DENIED, NFS4ERR_RECLAIM_BAD, or
NFS4ERRO_RECLAIM_CONFLICT.  The other case is when a lock
attempt as part of lease recovery fails with NFS4ERR_DENIED.

In an ideal world, these should not happen.  However I have
a packet trace showing an NFSv4.1 session getting
NFS4ERR_BADSESSION after an extended network parition.  The
NFSv4.1 client treats this like server reboot until/unless
it get NFS4ERR_NO_GRACE, in which case it switches over to
"nograce" recovery mode.  In this network trace, the client
attempts to recover a lock and the server (incorrectly)
reports NFS4ERR_DENIED rather than NFS4ERR_NO_GRACE.  This
leads to the ineffective comment and the client then
continues to write using the OPEN stateid.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:52 +02:00
Al Viro
03bb758894 do d_instantiate/unlock_new_inode combinations safely
commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream.

For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
	lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex.  Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.

	Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode().  All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.

Cc: stable@kernel.org	# 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:52 +02:00
Brian Foster
1a54258e10 xfs: remove racy hasattr check from attr ops
commit 5a93790d4e2df73e30c965ec6e49be82fc3ccfce upstream.

xfs_attr_[get|remove]() have unlocked attribute fork checks to optimize
away a lock cycle in cases where the fork does not exist or is otherwise
empty. This check is not safe, however, because an attribute fork short
form to extent format conversion includes a transient state that causes
the xfs_inode_hasattr() check to fail. Specifically,
xfs_attr_shortform_to_leaf() creates an empty extent format attribute
fork and then adds the existing shortform attributes to it.

This means that lookup of an existing xattr can spuriously return
-ENOATTR when racing against a setxattr that causes the associated
format conversion. This was originally reproduced by an untar on a
particularly configured glusterfs volume, but can also be reproduced on
demand with properly crafted xattr requests.

The format conversion occurs under the exclusive ilock. xfs_attr_get()
and xfs_attr_remove() already have the proper locking and checks further
down in the functions to handle this situation correctly. Drop the
unlocked checks to avoid the spurious failure and rely on the existing
logic.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Daniel Sangorrin <daniel.sangorrin@toshiba.co.jp>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:52 +02:00
Al Viro
1ac501e05f aio: fix io_destroy(2) vs. lookup_ioctx() race
commit baf10564fbb66ea222cae66fbff11c444590ffd9 upstream.

kill_ioctx() used to have an explicit RCU delay between removing the
reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
At some point that delay had been removed, on the theory that
percpu_ref_kill() itself contained an RCU delay.  Unfortunately, that was
the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
by lookup_ioctx().  As the result, we could get ctx freed right under
lookup_ioctx().  Tejun has fixed that in a6d7cff472e ("fs/aio: Add explicit
RCU grace period when freeing kioctx"); however, that fix is not enough.

Suppose io_destroy() from one thread races with e.g. io_setup() from another;
CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
has picked it (under rcu_read_lock()).  Then CPU1 proceeds to drop the
refcount, getting it to 0 and triggering a call of free_ioctx_users(),
which proceeds to drop the secondary refcount and once that reaches zero
calls free_ioctx_reqs().  That does
        INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
        queue_rcu_work(system_wq, &ctx->free_rwork);
and schedules freeing the whole thing after RCU delay.

In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
refcount from 0 to 1 and returned the reference to io_setup().

Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
freed until after percpu_ref_get().  Sure, we'd increment the counter before
ctx can be freed.  Now we are out of rcu_read_lock() and there's nothing to
stop freeing of the whole thing.  Unfortunately, CPU2 assumes that since it
has grabbed the reference, ctx is *NOT* going away until it gets around to
dropping that reference.

The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
It's not costlier than what we currently do in normal case, it's safe to
call since freeing *is* delayed and it closes the race window - either
lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
the object in question at all.

Cc: stable@kernel.org
Fixes: a6d7cff472e "fs/aio: Add explicit RCU grace period when freeing kioctx"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:51 +02:00
Al Viro
4d21e6e82c affs_lookup(): close a race with affs_remove_link()
commit 30da870ce4a4e007c901858a96e9e394a1daa74a upstream.

we unlock the directory hash too early - if we are looking at secondary
link and primary (in another directory) gets removed just as we unlock,
we could have the old primary moved in place of the secondary, leaving
us to look into freed entry (and leaving our dentry with ->d_fsdata
pointing to a freed entry).

Cc: stable@vger.kernel.org # 2.4.4+
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:51 +02:00