While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
we could have a concurrent call to btrfs_return_ino() that adds a new
entry to the root's free space cache of pinned inodes. This concurrent
call does not acquire the fs_info->commit_root_sem before adding a new
entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
because the caching kthread calls btrfs_unpin_free_ino() after setting
the caching state to BTRFS_CACHE_FINISHED and therefore races with
the task calling btrfs_return_ino(), which is adding a new entry, while
the former (caching kthread) is navigating the cache's rbtree, removing
and freeing nodes from the cache's rbtree without acquiring the spinlock
that protects the rbtree.
This race resulted in memory corruption due to double free of struct
btrfs_free_space objects because both tasks can end up doing freeing the
same objects. Note that adding a new entry can result in merging it with
other entries in the cache, in which case those entries are freed.
This is particularly important as btrfs_free_space structures are also
used for the block group free space caches.
This memory corruption can be detected by a debugging kernel, which
reports it with the following trace:
[132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
[132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[132408.505075] ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
[132408.505075] ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
[132408.505075] ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
[132408.505075] Call Trace:
[132408.505075] [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
[132408.505075] [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
[132408.505075] [<ffffffff81154e1e>] __slab_error.isra.28+0x25/0x36
[132408.505075] [<ffffffff81155733>] __cache_free+0xe2/0x4b6
[132408.505075] [<ffffffffa054a95a>] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
[132408.505075] [<ffffffffa0505b5f>] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075] [<ffffffff810f3b30>] ? time_hardirqs_off+0x15/0x28
[132408.505075] [<ffffffff81084d42>] ? trace_hardirqs_off+0xd/0xf
[132408.505075] [<ffffffff811563a1>] ? kfree+0xb6/0x14e
[132408.505075] [<ffffffff811563d0>] kfree+0xe5/0x14e
[132408.505075] [<ffffffffa0505b5f>] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075] [<ffffffffa0505e08>] caching_kthread+0x29e/0x2d9 [btrfs]
[132408.505075] [<ffffffffa0505b6a>] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
[132408.505075] [<ffffffff8106698f>] kthread+0xef/0xf7
[132408.505075] [<ffffffff810f3b08>] ? time_hardirqs_on+0x15/0x28
[132408.505075] [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075] [<ffffffff814653d2>] ret_from_fork+0x42/0x70
[132408.505075] [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
[132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
[132409.503355] ------------[ cut here ]------------
[132409.504241] kernel BUG at mm/slab.c:2571!
Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
that protects the rbtree while doing the searches and removing entries.
Fixes: 1c70d8fb4d ("Btrfs: fix inode caching vs tree log")
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The free space entries are allocated using kmem_cache_zalloc(),
through __btrfs_add_free_space(), therefore we should use
kmem_cache_free() and not kfree() to avoid any confusion and
any potential problem. Looking at the kfree() definition at
mm/slab.c it has the following comment:
/*
* (...)
*
* Don't free memory not originally allocated by kmalloc()
* or you will run into trouble.
*/
So better be safe and use kmem_cache_free().
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Although it is a rare case, we'd better free previous allocated
memory on error.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
It is introduced by:
c404e0dc2c
Btrfs: fix use-after-free in the finishing procedure of the device replace
But seems no relationship with that bug, this patch revirt these
code block for cleanup.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently, we can only set a limitation on a qgroup, but we
can not clear it.
This patch provide a choice to user to clear a limitation on
qgroup by passing a value of CLEAR_VALUE(-1) to kernel.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
In a dfs setup where the client transitions from a server which supports
posix paths to a server which doesn't support posix paths, the flag
CIFS_MOUNT_POSIX_PATHS is not reset. This leads to the wrong directory
separator being used causing smb commands to fail.
Consider the following case where a dfs share on a samba server points
to a share on windows smb server.
# mount -t cifs -o .. //vm140-31/dfsroot/testwin/
# ls -l /mnt; touch /mnt/a
total 0
touch: cannot touch ‘/mnt/a’: No such file or directory
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <steve.french@primarydata.com>
4 drivers / enabling modules:
NFIT:
Instantiates an "nvdimm bus" with the core and registers memory devices
(NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware Interface
table). After registering NVDIMMs the NFIT driver then registers
"region" devices. A libnvdimm-region defines an access mode and the
boundaries of persistent memory media. A region may span multiple
NVDIMMs that are interleaved by the hardware memory controller. In
turn, a libnvdimm-region can be carved into a "namespace" device and
bound to the PMEM or BLK driver which will attach a Linux block device
(disk) interface to the memory.
PMEM:
Initially merged in v4.1 this driver for contiguous spans of persistent
memory address ranges is re-worked to drive PMEM-namespaces emitted by
the libnvdimm-core. In this update the PMEM driver, on x86, gains the
ability to assert that writes to persistent memory have been flushed all
the way through the caches and buffers in the platform to persistent
media. See memcpy_to_pmem() and wmb_pmem().
BLK:
This new driver enables access to persistent memory media through "Block
Data Windows" as defined by the NFIT. The primary difference of this
driver to PMEM is that only a small window of persistent memory is
mapped into system address space at any given point in time. Per-NVDIMM
windows are reprogrammed at run time, per-I/O, to access different
portions of the media. BLK-mode, by definition, does not support DAX.
BTT:
This is a library, optionally consumed by either PMEM or BLK, that
converts a byte-accessible namespace into a disk with atomic sector
update semantics (prevents sector tearing on crash or power loss). The
sinister aspect of sector tearing is that most applications do not know
they have a atomic sector dependency. At least today's disk's rarely
ever tear sectors and if they do one almost certainly gets a CRC error
on access. NVDIMMs will always tear and always silently. Until an
application is audited to be robust in the presence of sector-tearing
the usage of BTT is recommended.
Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
Wysocki, and Bob Moore.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVjZGBAAoJEB7SkWpmfYgC4fkP/j+k6HmSRNU/yRYPyo7CAWvj
3P5P1i6R6nMZZbjQrQArAXaIyLlFk4sEQDYsciR6dmslhhFZAkR2eFwVO5rBOyx3
QN0yxEpyjJbroRFUrV/BLaFK4cq2oyJAFFHs0u7/pLHBJ4MDMqfRKAMtlnBxEkTE
LFcqXapSlvWitSbjMdIBWKFEvncaiJ2mdsFqT4aZqclBBTj00eWQvEG9WxleJLdv
+tj7qR/vGcwOb12X5UrbQXgwtMYos7A6IzhHbqwQL8IrOcJ6YB8NopJUpLDd7ZVq
KAzX6ZYMzNueN4uvv6aDfqDRLyVL7qoxM9XIjGF5R8SV9sF2LMspm1FBpfowo1GT
h2QMr0ky1nHVT32yspBCpE9zW/mubRIDtXxEmZZ53DIc4N6Dy9jFaNVmhoWtTAqG
b9pndFnjUzzieCjX5pCvo2M5U6N0AQwsnq76/CasiWyhSa9DNKOg8MVDRg0rbxb0
UvK0v8JwOCIRcfO3qiKcx+02nKPtjCtHSPqGkFKPySRvAdb+3g6YR26CxTb3VmnF
etowLiKU7HHalLvqGFOlDoQG6viWes9Zl+ZeANBOCVa6rL2O7ZnXJtYgXf1wDQee
fzgKB78BcDjXH4jHobbp/WBANQGN/GF34lse8yHa7Ym+28uEihDvSD1wyNLnefmo
7PJBbN5M5qP5tD0aO7SZ
=VtWG
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm
Pull libnvdimm subsystem from Dan Williams:
"The libnvdimm sub-system introduces, in addition to the
libnvdimm-core, 4 drivers / enabling modules:
NFIT:
Instantiates an "nvdimm bus" with the core and registers memory
devices (NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware
Interface table).
After registering NVDIMMs the NFIT driver then registers "region"
devices. A libnvdimm-region defines an access mode and the
boundaries of persistent memory media. A region may span multiple
NVDIMMs that are interleaved by the hardware memory controller. In
turn, a libnvdimm-region can be carved into a "namespace" device and
bound to the PMEM or BLK driver which will attach a Linux block
device (disk) interface to the memory.
PMEM:
Initially merged in v4.1 this driver for contiguous spans of
persistent memory address ranges is re-worked to drive
PMEM-namespaces emitted by the libnvdimm-core.
In this update the PMEM driver, on x86, gains the ability to assert
that writes to persistent memory have been flushed all the way
through the caches and buffers in the platform to persistent media.
See memcpy_to_pmem() and wmb_pmem().
BLK:
This new driver enables access to persistent memory media through
"Block Data Windows" as defined by the NFIT. The primary difference
of this driver to PMEM is that only a small window of persistent
memory is mapped into system address space at any given point in
time.
Per-NVDIMM windows are reprogrammed at run time, per-I/O, to access
different portions of the media. BLK-mode, by definition, does not
support DAX.
BTT:
This is a library, optionally consumed by either PMEM or BLK, that
converts a byte-accessible namespace into a disk with atomic sector
update semantics (prevents sector tearing on crash or power loss).
The sinister aspect of sector tearing is that most applications do
not know they have a atomic sector dependency. At least today's
disk's rarely ever tear sectors and if they do one almost certainly
gets a CRC error on access. NVDIMMs will always tear and always
silently. Until an application is audited to be robust in the
presence of sector-tearing the usage of BTT is recommended.
Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
Wysocki, and Bob Moore"
* tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm: (33 commits)
arch, x86: pmem api for ensuring durability of persistent memory updates
libnvdimm: Add sysfs numa_node to NVDIMM devices
libnvdimm: Set numa_node to NVDIMM devices
acpi: Add acpi_map_pxm_to_online_node()
libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only
pmem: flag pmem block devices as non-rotational
libnvdimm: enable iostat
pmem: make_request cleanups
libnvdimm, pmem: fix up max_hw_sectors
libnvdimm, blk: add support for blk integrity
libnvdimm, btt: add support for blk integrity
fs/block_dev.c: skip rw_page if bdev has integrity
libnvdimm: Non-Volatile Devices
tools/testing/nvdimm: libnvdimm unit test infrastructure
libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
nd_btt: atomic sector updates
libnvdimm: infrastructure for btt devices
libnvdimm: write blk label set
libnvdimm: write pmem label set
libnvdimm: blk labels and namespace instantiation
...
The only caller that cares about its return value can just
as easily pick it from nd->root_seq itself. We used to just
calculate it and return to caller, but these days we are
storing it in nd->root_seq in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It is possible to have an active open with one mode, and a delegation
for the same file with a different mode.
In particular, a WR_ONLY open and an RD_ONLY delegation.
This happens if a WR_ONLY open is followed by a RD_ONLY open which
provides a delegation, but is then close.
When returning the delegation, we currently try to claim opens for
every open type (n_rdwr, n_rdonly, n_wronly). As there is no harm
in claiming an open for a mode that we already have, this is often
simplest.
However if the delegation only provides a subset of the modes that we
currently have open, this will produce an error from the server.
So when claiming open modes prior to returning a delegation, skip the
open request if the mode is not covered by the delegation - the open_stateid
must already cover that mode, so there is nothing to do.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Send negotiate contexts when SMB3.11 dialect is negotiated
(ie the preauth and the encryption contexts) and
Initialize SMB3.11 preauth negotiate context salt to random bytes
Followon patch will update session setup and tree connect
Signed-off-by: Steve French <steve.french@primarydata.com>
set integrity increases reliability of files stored on SMB3 servers.
Add ioctl to allow setting this on files on SMB3 and later mounts.
Signed-off-by: Steve French <steve.french@primarydata.com>
Getting fantastic copy performance with cp --reflink over SMB3.11
using the new FSCTL_DUPLICATE_EXTENTS.
This FSCTL was added in the SMB3.11 dialect (testing was
against REFS file system) so have put it as a 3.11 protocol
specific operation ("vers=3.1.1" on the mount). Tested at
the SMB3 plugfest in Redmond.
It depends on the new FS Attribute (BLOCK_REFCOUNTING) which
is used to advertise support for the ability to do this ioctl
(if you can support multiple files pointing to the same block
than this refcounting ability or equivalent is needed to
support the new reflink-like duplicate extent SMB3 ioctl.
Signed-off-by: Steve French <steve.french@primarydata.com>
Pull UML updates from Richard Weinberger:
- remove hppfs ("HonePot ProcFS")
- initial support for musl libc
- uaccess cleanup
- random cleanups and bug fixes all over the place
* 'for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (21 commits)
um: Don't pollute kernel namespace with uapi
um: Include sys/types.h for makedev(), major(), minor()
um: Do not use stdin and stdout identifiers for struct members
um: Do not use __ptr_t type for stack_t's .ss pointer
um: Fix mconsole dependency
um: Handle tracehook_report_syscall_entry() result
um: Remove copy&paste code from init.h
um: Stop abusing __KERNEL__
um: Catch unprotected user memory access
um: Fix warning in setup_signal_stack_si()
um: Rework uaccess code
um: Add uaccess.h to ldt.c
um: Add uaccess.h to syscalls_64.c
um: Add asm/elf.h to vma.c
um: Cleanup mem_32/64.c headers
um: Remove hppfs
um: Move syscall() declaration into os.h
um: kernel: ksyms: Export symbol syscall() for fixing modpost issue
um/os-Linux: Use char[] for syscall_stub declarations
um: Use char[] for linker script address declarations
...
Most people think of SMB 3.1.1 as SMB version 3.11 so add synonym
for "vers=3.1.1" of "vers=3.11" on mount.
Also make sure that unlike SMB3.0 and 3.02 we don't send
validate negotiate on mount (it is handled by negotiate contexts) -
add list of SMB3.11 specific functions (distinct from 3.0 dialect).
Signed-off-by: Steve French <steve.french@primarydata.com>w
Add new structures and defines for SMB3.11 negotiate, session setup and tcon
See MS-SMB2-diff.pdf section 2.2.3 for additional protocol documentation.
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
Parses and recognizes "vers=3.1.1" on cifs mount and allows sending
0x0311 as a new CIFS/SMB3 dialect. Subsequent patches will add
the new negotiate contexts and updated session setup
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
[MS-SMB] 2.2.4.5.2.1 states:
"ChallengeLength (1 byte): When the CAP_EXTENDED_SECURITY bit is set,
the server MUST set this value to zero and clients MUST ignore this
value."
Signed-off-by: Noel Power <noel.power@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Pull security subsystem updates from James Morris:
"The main change in this kernel is Casey's generalized LSM stacking
work, which removes the hard-coding of Capabilities and Yama stacking,
allowing multiple arbitrary "small" LSMs to be stacked with a default
monolithic module (e.g. SELinux, Smack, AppArmor).
See
https://lwn.net/Articles/636056/
This will allow smaller, simpler LSMs to be incorporated into the
mainline kernel and arbitrarily stacked by users. Also, this is a
useful cleanup of the LSM code in its own right"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
tpm, tpm_crb: fix le64_to_cpu conversions in crb_acpi_add()
vTPM: set virtual device before passing to ibmvtpm_reset_crq
tpm_ibmvtpm: remove unneccessary message level.
ima: update builtin policies
ima: extend "mask" policy matching support
ima: add support for new "euid" policy condition
ima: fix ima_show_template_data_ascii()
Smack: freeing an error pointer in smk_write_revoke_subj()
selinux: fix setting of security labels on NFS
selinux: Remove unused permission definitions
selinux: enable genfscon labeling for sysfs and pstore files
selinux: enable per-file labeling for debugfs files.
selinux: update netlink socket classes
signals: don't abuse __flush_signals() in selinux_bprm_committed_creds()
selinux: Print 'sclass' as string when unrecognized netlink message occurs
Smack: allow multiple labels in onlycap
Smack: fix seq operations in smackfs
ima: pass iint to ima_add_violation()
ima: wrap event related data to the new ima_event_data structure
integrity: add validity checks for 'path' parameter
...
With gcc 3.4.6/4.1.2/4.2.4 (not with 4.4.7/4.6.4/4.8.4):
CC fs/block_dev.o
include/linux/fs.h:804: warning: ‘I_BDEV’ declared inline after being called
include/linux/fs.h:804: warning: previous declaration of ‘I_BDEV’ was here
Commit a212b105b0 ("bdi: make inode_to_bdi() inline") added a
caller of I_BDEV() in a header file, exposing the bogus "inline" on the
exported implementation.
Drop the "inline" keyword to fix this.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull nfsd updates from Bruce Fields:
"A relatively quiet cycle, with a mix of cleanup and smaller bugfixes"
* 'for-4.2' of git://linux-nfs.org/~bfields/linux: (24 commits)
sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key()
nfsd: wrap too long lines in nfsd4_encode_read
nfsd: fput rd_file from XDR encode context
nfsd: take struct file setup fully into nfs4_preprocess_stateid_op
nfsd: refactor nfs4_preprocess_stateid_op
nfsd: clean up raparams handling
nfsd: use swap() in sort_pacl_range()
rpcrdma: Merge svcrdma and xprtrdma modules into one
svcrdma: Add a separate "max data segs macro for svcrdma
svcrdma: Replace GFP_KERNEL in a loop with GFP_NOFAIL
svcrdma: Keep rpcrdma_msg fields in network byte-order
svcrdma: Fix byte-swapping in svc_rdma_sendto.c
nfsd: Update callback sequnce id only CB_SEQUENCE success
nfsd: Reset cb_status in nfsd4_cb_prepare() at retrying
svcrdma: Remove svc_rdma_xdr_decode_deferred_req()
SUNRPC: Move EXPORT_SYMBOL for svc_process
uapi/nfs: Add NFSv4.1 ACL definitions
nfsd: Remove dead declarations
nfsd: work around a gcc-5.1 warning
nfsd: Checking for acl support does not require fetching any acls
...
Here is a list of patches we've accumulated for GFS2 for the current upstream
merge window. We have a good mixture this time. Here are some of the features:
1. Fix a problem with RO mounts writing to the journal.
2. Further improvements to quotas on GFS2.
3. Added support for rename2 and RENAME_EXCHANGE on GFS2.
4. Increase performance by making glock lru_list less of a bottleneck.
5. Increase performance by avoiding unnecessary buffer_head releases.
6. Increase performance by using average glock round trip time from all CPUs.
7. Fixes for some compiler warnings and minor white space issues.
8. Other misc. bug fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEbBAABAgAGBQJVjFEXAAoJENeLYdPf93o7zqoH926XV0oddQCsGCYg6gq7OL+c
/4q2y9x31hkv5XbPTNcahqR6UsK8JcbcdZD+XAqTftL4Q789FDAdWbYS++45qz8D
YuUFgZd2bc75Ge2qqacgEdv85YRLtws1fRnI6DUpjfN1qJ9kJXX+gk+BSve6rh6V
Qyv8a4DRIw1En5fwt0R6yIg0LI/ywPhYeVlxo6WUoK8fiL/i3eNd57Jgv5YQ6ly+
ZyqH8w1m0kE4IkIiTFgmIpvepiWLBCA3mPOfHfE3QxbDKpXe4uFsMdnxghmP5bFN
3H9syJQNFqjs+ooKma/fE2VmpxQ5/5lAs0/ms+ECW3GvBseTll8Iln7y4NwgAQ==
=ITwD
-----END PGP SIGNATURE-----
Merge tag 'gfs2-merge-window' of git://git.kernel.org:/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"Here are the patches we've accumulated for GFS2 for the current
upstream merge window. We have a good mixture this time. Here are
some of the features:
- Fix a problem with RO mounts writing to the journal.
- Further improvements to quotas on GFS2.
- Added support for rename2 and RENAME_EXCHANGE on GFS2.
- Increase performance by making glock lru_list less of a bottleneck.
- Increase performance by avoiding unnecessary buffer_head releases.
- Increase performance by using average glock round trip time from all CPUs.
- Fixes for some compiler warnings and minor white space issues.
- Other misc bug fixes"
* tag 'gfs2-merge-window' of git://git.kernel.org:/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Don't brelse rgrp buffer_heads every allocation
GFS2: Don't add all glocks to the lru
gfs2: Don't support fallocate on jdata files
gfs2: s64 cast for negative quota value
gfs2: limit quota log messages
gfs2: fix quota updates on block boundaries
gfs2: fix shadow warning in gfs2_rbm_find()
gfs2: kerneldoc warning fixes
gfs2: convert simple_str to kstr
GFS2: make sure S_NOSEC flag isn't overwritten
GFS2: add support for rename2 and RENAME_EXCHANGE
gfs2: handle NULL rgd in set_rgrp_preferences
GFS2: inode.c: indent with TABs, not spaces
GFS2: mark the journal idle to fix ro mounts
GFS2: Average in only non-zero round-trip times for congestion stats
GFS2: Use average srttb value in congestion calculations
This reverts commit 2143c1965a.
This commit seems to be the cause of the following jbd2 assertion
failure:
------------[ cut here ]------------
kernel BUG at fs/jbd2/transaction.c:1325!
invalid opcode: 0000 [#1] SMP
Modules linked in: bnep bluetooth fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 ...
CPU: 7 PID: 5509 Comm: gcc Not tainted 4.1.0-10944-g2a298679b411 #1
Hardware name: /DH87RL, BIOS RLH8710H.86A.0327.2014.0924.1645 09/24/2014
task: ffff8803bf866040 ti: ffff880308528000 task.ti: ffff880308528000
RIP: jbd2_journal_dirty_metadata+0x237/0x290
Call Trace:
__ext4_handle_dirty_metadata+0x43/0x1f0
ext4_handle_dirty_dirent_node+0xde/0x160
? jbd2_journal_get_write_access+0x36/0x50
ext4_delete_entry+0x112/0x160
? __ext4_journal_start_sb+0x52/0xb0
ext4_unlink+0xfa/0x260
vfs_unlink+0xec/0x190
do_unlinkat+0x24a/0x270
SyS_unlink+0x11/0x20
entry_SYSCALL_64_fastpath+0x12/0x6a
---[ end trace ae033ebde8d080b4 ]---
which is not easily reproducible (I've seen it just once, and then Ted
was able to reproduce it once). Revert it while Ted and Jan try to
figure out what is wrong.
Cc: Jan Kara <jack@suse.cz>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it so, by checking the return value for NFS4ERR_MOTSUPP and
caching the information as a server capability.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
According to the spec, the server is only returning the status,
which we decode in the op header.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull cgroup updates from Tejun Heo:
- threadgroup_lock got reorganized so that its users can pick the
actual locking mechanism to use. Its only user - cgroups - is
updated to use a percpu_rwsem instead of per-process rwsem.
This makes things a bit lighter on hot paths and allows cgroups to
perform and fail multi-task (a process) migrations atomically.
Multi-task migrations are used in several places including the
unified hierarchy.
- Delegation rule and documentation added to unified hierarchy. This
will likely be the last interface update from the cgroup core side
for unified hierarchy before lifting the devel mask.
- Some groundwork for the pids controller which is scheduled to be
merged in the coming devel cycle.
* 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: add delegation section to unified hierarchy documentation
cgroup: require write perm on common ancestor when moving processes on the default hierarchy
cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
kernfs: make kernfs_get_inode() public
MAINTAINERS: add a cgroup core co-maintainer
cgroup: fix uninitialised iterator in for_each_subsys_which
cgroup: replace explicit ss_mask checking with for_each_subsys_which
cgroup: use bitmask to filter for_each_subsys
cgroup: add seq_file forward declaration for struct cftype
cgroup: simplify threadgroup locking
sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
sched, cgroup: reorganize threadgroup locking
cgroup: switch to unsigned long for bitmasks
cgroup: reorganize include/linux/cgroup.h
cgroup: separate out include/linux/cgroup-defs.h
cgroup: fix some comment typos
Here is the driver core / firmware changes for 4.2-rc1.
A number of small changes all over the place in the driver core, and in
the firmware subsystem. Nothing really major, full details in the
shortlog. Some of it is a bit of churn, given that the platform driver
probing changes was found to not work well, so they were reverted.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlWNoCQACgkQMUfUDdst+ym4JACdFrrXoMt2pb8nl5gMidGyM9/D
jg8AnRgdW8ArDA/xOarULd/X43eA3J3C
=Al2B
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the driver core / firmware changes for 4.2-rc1.
A number of small changes all over the place in the driver core, and
in the firmware subsystem. Nothing really major, full details in the
shortlog. Some of it is a bit of churn, given that the platform
driver probing changes was found to not work well, so they were
reverted.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (31 commits)
Revert "base/platform: Only insert MEM and IO resources"
Revert "base/platform: Continue on insert_resource() error"
Revert "of/platform: Use platform_device interface"
Revert "base/platform: Remove code duplication"
firmware: add missing kfree for work on async call
fs: sysfs: don't pass count == 0 to bin file readers
base:dd - Fix for typo in comment to function driver_deferred_probe_trigger().
base/platform: Remove code duplication
of/platform: Use platform_device interface
base/platform: Continue on insert_resource() error
base/platform: Only insert MEM and IO resources
firmware: use const for remaining firmware names
firmware: fix possible use after free on name on asynchronous request
firmware: check for file truncation on direct firmware loading
firmware: fix __getname() missing failure check
drivers: of/base: move of_init to driver_init
drivers/base: cacheinfo: fix annoying typo when DT nodes are absent
sysfs: disambiguate between "error code" and "failure" in comments
driver-core: fix build for !CONFIG_MODULES
driver-core: make __device_attach() static
...
hdr->good_bytes needs to be set to the length of the request, not
zero.
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch ensures that we record the value of 'ffl_flags' from
the layout, and then checks for the presence of the
FF_FLAGS_NO_LAYOUTCOMMIT flag before deciding whether or not to
call pnfs_set_layoutcommit().
The effect is that servers now can decide whether or not they want
the client to call layoutcommit before returning a writeable layout.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* layoutstats:
pnfs/flexfiles: protect ktime manipulation with mirror lock
nfs: provide pnfs_report_layoutstat when NFS42 is disabled
pnfs/flexfiles: report layoutstat regularly
nfs42: serialize LAYOUTSTATS calls of the same file
pnfs/flexfiles: encode LAYOUTSTATS flexfiles specific data
pnfs/flexfiles: add ff_layout_prepare_layoutstats
pNFS/flexfiles: track when layout is first used
pNFS/flexfiles: add layoutstats tracking
pNFS/flexfiles: Remove unused struct members user_name, group_name
pnfs: add pnfs_report_layoutstat helper function
pNFS: fill in nfs42_layoutstat_ops
NFSv.2/pnfs Add a LAYOUTSTATS rpc function
It looks as if xchg() and cmpxchg() are not available for 64-bit integers on sparc32:
> New breakage seen in linux-next today:
>
> ERROR: "__xchg_called_with_bad_pointer" [fs/nfs/flexfilelayout/nfs_layout_flexfiles.ko] undefined!
> ERROR: "__cmpxchg_called_with_bad_pointer" [fs/nfs/flexfilelayout/nfs_layout_flexfiles.ko] undefined!
> make[2]: *** [__modpost] Error 1
> make[1]: *** [modules] Error 2
Given that mirror ktime manipulation is already under mirror->lock, let's make use of the fact.
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
kbuild test robot reported:
fs/built-in.o: In function `pnfs_report_layoutstat':
>> (.text+0x151a1c): undefined reference to `nfs42_proc_layoutstats_generic'
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Merge second patchbomb from Andrew Morton:
- most of the rest of MM
- lots of misc things
- procfs updates
- printk feature work
- updates to get_maintainer, MAINTAINERS, checkpatch
- lib/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
exit,stats: /* obey this comment */
coredump: add __printf attribute to cn_*printf functions
coredump: use from_kuid/kgid when formatting corename
fs/reiserfs: remove unneeded cast
NILFS2: support NFSv2 export
fs/befs/btree.c: remove unneeded initializations
fs/minix: remove unneeded cast
init/do_mounts.c: add create_dev() failure log
kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
fs/efs: femove unneeded cast
checkpatch: emit "NOTE: <types>" message only once after multiple files
checkpatch: emit an error when there's a diff in a changelog
checkpatch: validate MODULE_LICENSE content
checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
checkpatch: fix processing of MEMSET issues
checkpatch: suggest using ether_addr_equal*()
checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
checkpatch: remove local from codespell path
checkpatch: add --showfile to allow input via pipe to show filenames
...
If a block device has bio integrity enabled, rw_page will bypass the
integrity payload, which is undesirable. Skip rw_page if this is the
case.
Currently brd and zram provide rw_page, and the proposed 'nd' drivers
will too.
Cc: Jens Axboe <axboe@fb.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Suggested-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This allows detecting improper format string at build time, like:
fs/coredump.c:225:5: warning: format '%ld' expects argument of type 'long int', but argument 3 has type 'int' [-Wformat=]
err = cn_printf(cn, "%ld", cprm->siginfo->si_signo);
^
As si_signo is always an int, the format should be %d here.
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When adding __printf attribute to cn_printf, gcc reports some issues:
fs/coredump.c:213:5: warning: format '%d' expects argument of type
'int', but argument 3 has type 'kuid_t' [-Wformat=]
err = cn_printf(cn, "%d", cred->uid);
^
fs/coredump.c:217:5: warning: format '%d' expects argument of type
'int', but argument 3 has type 'kgid_t' [-Wformat=]
err = cn_printf(cn, "%d", cred->gid);
^
These warnings come from the fact that the value of uid/gid needs to be
extracted from the kuid_t/kgid_t structure before being used as an
integer. More precisely, cred->uid and cred->gid need to be converted to
either user-namespace uid/gid or to init_user_ns uid/gid.
Use init_user_ns in order not to break existing ABI, and document this in
Documentation/sysctl/kernel.txt.
While at it, format uid and gid values with %u instead of %d because
uid_t/__kernel_uid32_t and gid_t/__kernel_gid32_t are unsigned int.
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "fh_len" passed to ->fh_to_* is not guaranteed to be that same as that
returned by encode_fh - it may be larger.
With NFSv2, the filehandle is fixed length, so it may appear longer than
expected and be zero-padded.
So we must test that fh_len is at least some value, not exactly equal to
it.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bh, od_sup and this_node are unconditionally initialized in
befs_bt_read_super() and befs_btree_find()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes a very large function a little smaller.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In one case, we eliminate a local variable; in the other a strlen()
call and some .text.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 818411616b ("fs, proc: introduce /proc/<pid>/task/<tid>/children
entry") introduced the children entry for checkpoint restore and the
file is only available on kernels configured with CONFIG_EXPERT and
CONFIG_CHECKPOINT_RESTORE.
This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.
However, the children proc file is useful outside of checkpoint restore.
I would like to use it in rkt. The rkt process exec() another program
it does not control, and that other program will fork()+exec() a child
process. I would like to find the pid of the child process from an
external tool without iterating in /proc over all processes to find
which one has a parent pid equal to rkt.
This commit introduces CONFIG_PROC_CHILDREN and makes
CONFIG_CHECKPOINT_RESTORE select it. This allows enabling
/proc/<pid>/task/<tid>/children without needing to enable
CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.
Alban tested that /proc/<pid>/task/<tid>/children is present when the
kernel is configured with CONFIG_PROC_CHILDREN=y but without
CONFIG_CHECKPOINT_RESTORE
Signed-off-by: Iago López Galeiras <iago@endocode.com>
Tested-by: Alban Crequy <alban@endocode.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Djalal Harouni <djalal@endocode.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with
$ cat /proc/self/cmdline $(seq 1037) 2>/dev/null
However, command line size was never limited to PAGE_SIZE but to 128 KB
and relatively recently limitation was removed altogether.
People noticed and ask questions:
http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit
seq file interface is not OK, because it kmalloc's for whole output and
open + read(, 1) + sleep will pin arbitrary amounts of kernel memory. To
not do that, limit must be imposed which is incompatible with arbitrary
sized command lines.
I apologize for hairy code, but this it direct consequence of command line
layout in memory and hacks to support things like "init [3]".
The loops are "unrolled" otherwise it is either macros which hide control
flow or functions with 7-8 arguments with equal line count.
There should be real setproctitle(2) or something.
[akpm@linux-foundation.org: fix a billion min() warnings]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Tested-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jarod Wilson <jarod@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 9597c13b forbade opens with O_APPEND|O_DIRECT for NFSv4:
nfs: verify open flags before allowing an atomic open
Currently, you can open a NFSv4 file with O_APPEND|O_DIRECT, but cannot
fcntl(F_SETFL,...) with those flags. This flag combination is explicitly
forbidden on NFSv3 opens, and it seems like it should also be on NFSv4.
However, you can still open a file with O_DIRECT|O_APPEND if there exists a
cached dentry for the file because nfs4_file_open() is used instead of
nfs_atomic_open() and the check is bypassed. Add the check in
nfs4_file_open() as well.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
A ds can be associated with more than one mirror, but we currently skip
setting a mirror's credentials if we find that it's already set up with
a connected client.
The upshot is that we can end up sending DS writes with MDS credentials
instead of properly setting them up. Fix nfs4_ff_layout_prepare_ds to
always verify that the mirror's credentials are set up, even when we
have a DS that's already connected.
Reported-by: Tom Haynes <thomas.haynes@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>