page allocated in fuse_dentry_canonical_path to be handled in
fuse_dev_do_write is allocated using __get_free_pages(GFP_KERNEL).
This may not return a page with data filled with 0. Now this
page may not have a null terminator at all.
If this happens and userspace fuse daemon screws up by passing a string
to kernel which is not NULL terminated (or did not fill anything),
then inside fuse driver in kernel when we try to do
strlen(fuse_dev_write->kern_path->getname_kernel)
on that page data -> it may give us issue with kernel paging request.
Unable to handle kernel paging request at virtual address
------------[ cut here ]------------
<..>
PC is at strlen+0x10/0x90
LR is at getname_kernel+0x2c/0xf4
<..>
strlen+0x10/0x90
kern_path+0x28/0x4c
fuse_dev_do_write+0x5b8/0x694
fuse_dev_write+0x74/0x94
do_iter_readv_writev+0x80/0xb8
do_readv_writev+0xec/0x1cc
vfs_writev+0x54/0x64
SyS_writev+0x64/0xe4
el0_svc_naked+0x24/0x28
To avoid this we should ensure in case of FUSE_CANONICAL_PATH,
the page is null terminated.
Change-Id: I33ca7cc76b4472eaa982c67bb20685df451121f5
Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
Bug: 75984715
[Daniel - small edit, using args size ]
Signed-off-by: Daniel Rosenberg <drosen@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAln62hMACgkQONu9yGCS
aT4KoRAAg9FasYL4oGtTiNglajWWP33eVv5NiwwdFdCfQGOEIzZPQy7ST/I/b1CB
Kql3D9oA7d8ZFtg05KyQb4csH+9WNjeLBr4G2NGZurEC0c6vbU/64ceADgzA4YKJ
RJ2iDWEQKq45RVH6BlrDyITu9H20TRgzcsQZ7fiswB3ZJsPTfdyvDlUlN3A4JhXY
2eTF3CNVPEGLnaT4PY5tuLZpLIZkQkZzw1xr9YCq9YA5aNYi2OywWbyQq27ruX2d
K238FiaYSN5LeUxU6JE2tk4CxrHJ0pOiw6kBiSgIv3MwDQa5iQypKVQA2tnAXHqL
rPb4cGAcDSQYzpCu4XimDlLEQhoAX2BceSakdYXoMu66AKewizSnopAljhPHp9uk
0GO6lSJv0f+NGoCpxOE2FDfMIwiPbLC9LfMDWqpFvPanMfMe156p6D+LL4GfTaus
x4oZZa61aPwjomobEM4hzZk5bp1AjkiDxKHCBvwpuVTOIFlxlVcuB4RyuY2VsuHN
4a/tw9iEHkyJYCt3tsePTltgrAws2j7KCWLx+F3LTXWzmZ9//9bFq63V6kIh0a2b
nPozkt0Xj7iygJwU1G2i5XAMTF5tPH8ELioGiakv0Rkj1ncMSXx1s2dO1uxR06a5
bx/MFLbo1AyZhE8Tk4LcT/rEHtjhj/24FX6sEq4xNjw/GvAzlp0=
=ScnL
-----END PGP SIGNATURE-----
Merge 4.4.96 into android-4.4
Changes in 4.4.96
workqueue: replace pool->manager_arb mutex with a flag
ALSA: hda/realtek - Add support for ALC236/ALC3204
ALSA: hda - fix headset mic problem for Dell machines with alc236
ceph: unlock dangling spinlock in try_flush_caps()
usb: xhci: Handle error condition in xhci_stop_device()
spi: uapi: spidev: add missing ioctl header
fuse: fix READDIRPLUS skipping an entry
xen/gntdev: avoid out of bounds access in case of partial gntdev_mmap()
Input: elan_i2c - add ELAN0611 to the ACPI table
Input: gtco - fix potential out-of-bound access
assoc_array: Fix a buggy node-splitting case
scsi: zfcp: fix erp_action use-before-initialize in REC action trace
scsi: sg: Re-fix off by one in sg_fill_request_table()
can: sun4i: fix loopback mode
can: kvaser_usb: Correct return value in printout
can: kvaser_usb: Ignore CMD_FLUSH_QUEUE_REPLY messages
regulator: fan53555: fix I2C device ids
x86/microcode/intel: Disable late loading on model 79
ecryptfs: fix dereference of NULL user_key_payload
Revert "drm: bridge: add DT bindings for TI ths8135"
Linux 4.4.96
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit c6cdd51404b7ac12dd95173ddfc548c59ecf037f upstream.
Marios Titas running a Haskell program noticed a problem with fuse's
readdirplus: when it is interrupted by a signal, it skips one directory
entry.
The reason is that fuse erronously updates ctx->pos after a failed
dir_emit().
The issue originates from the patch adding readdirplus support.
Reported-by: Jakob Unterwurzacher <jakobunt@gmail.com>
Tested-by: Marios Titas <redneb@gmx.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 0b05b18381 ("fuse: implement NFS-like readdirplus support")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlmUrdAACgkQONu9yGCS
aT49MBAAqKJXpRzIBnzLR45QLRU5jfkUhaCtCckBAbyLE+rUH0lE4L37JHYcZ9jr
79gG06QAuWJaTd4Nug7DocmPiqpPWi+PY46yjUQ1j3tllKWdp7b/PJXvYX3zbK+d
vDgn6T1AyAoCBKa2aLU26SAYmfLCT+jhHzbMaRQ4eAcYE8u8w8jrfngVmnunXVme
u6CkAZpPMXXm5jUpxgPguEOm2WMubPYEF2BMJIVuvYypeJGM0EYbOHNEUMK5jkPv
T17+4EzSCeqGaDZElxPW4NuaHMStZW36g9gQOti3o/8/5shNLyJK3vYzWG+06zfH
6CNElSk7Y3Fl6qALLWfd1dkjImtJvWKDVWTC43woFT/96DtXueGxrYJYRF+px9bq
dBWAW86g5Tp2JTM+6VhN0N/Z5ANK48Oi2NrzqJXK7DrmZbS5mxMIZw239QJnEOBh
hSxDbe9pkNJvSmR+yF+qxkz78XOOvBz4zIkGl6M70cRQWnJ0g4tCSyy2hrEooDzZ
sfaokSdClzt3qRoFwSZIGZLpvRp9vSepXNN/nvUTX3dOLcjproVYMZJWiAUqTUyD
/0gwrJTpDP3nZGrHdmeWL/erQDWP1aFiXlsJ0E87ymSt7KYNYFGH2ePv7Ujov/AH
dlmvQFhSW1v7xiuiiQo9gxIo8djHqZ8FLbTCznQcQ8Scm4cMNAM=
=riD5
-----END PGP SIGNATURE-----
Merge 4.4.83 into android-4.4
Changes in 4.4.83
cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
mm: ratelimit PFNs busy info message
iscsi-target: fix memory leak in iscsit_setup_text_cmd()
iscsi-target: Fix iscsi_np reset hung task during parallel delete
fuse: initialize the flock flag in fuse_file on allocation
nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays
USB: serial: option: add D-Link DWM-222 device ID
USB: serial: cp210x: add support for Qivicon USB ZigBee dongle
USB: serial: pl2303: add new ATEN device id
usb: musb: fix tx fifo flush handling again
USB: hcd: Mark secondary HCD as dead if the primary one died
staging:iio:resolver:ad2s1210 fix negative IIO_ANGL_VEL read
iio: accel: bmc150: Always restore device to normal mode after suspend-resume
iio: light: tsl2563: use correct event code
uas: Add US_FL_IGNORE_RESIDUE for Initio Corporation INIC-3069
USB: Check for dropped connection before switching to full speed
usb: core: unlink urbs from the tail of the endpoint's urb_list
usb: quirks: Add no-lpm quirk for Moshi USB to Ethernet Adapter
usb:xhci:Add quirk for Certain failing HP keyboard on reset after resume
iio: adc: vf610_adc: Fix VALT selection value for REFSEL bits
pnfs/blocklayout: require 64-bit sector_t
pinctrl: sunxi: add a missing function of A10/A20 pinctrl driver
pinctrl: samsung: Remove bogus irq_[un]mask from resource management
Linux 4.4.83
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 68227c03cba84a24faf8a7277d2b1a03c8959c2c upstream.
Before the patch, the flock flag could remain uninitialized for the
lifespan of the fuse_file allocation. Unless set to true in
fuse_file_flock(), it would remain in an indeterminate state until read in
an if statement in fuse_release_common(). This could consequently lead to
taking an unexpected branch in the code.
The bug was discovered by a runtime instrumentation designed to detect use
of uninitialized memory in the kernel.
Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Fixes: 37fb3a30b4 ("fuse: fix flock")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2e38bea99a80eab408adee27f873a188d57b76cb upstream.
fuse_file_put() was missing the "force" flag for the RELEASE request when
sending synchronously (fuseblk).
If this flag is not set, then a sync request may be interrupted before it
is dequeued by the userspace filesystem. In this case the OPEN won't be
balanced with a RELEASE.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 5a18ec176c ("fuse: fix hang of single threaded fuseblk filesystem")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6ba4d2722d06960102c981322035239cd66f7316 upstream.
There is a potential race between fuse_dev_do_write()
and request_wait_answer() contexts as shown below:
TASK 1:
__fuse_request_send():
|--spin_lock(&fiq->waitq.lock);
|--queue_request();
|--spin_unlock(&fiq->waitq.lock);
|--request_wait_answer():
|--if (test_bit(FR_SENT, &req->flags))
<gets pre-empted after it is validated true>
TASK 2:
fuse_dev_do_write():
|--clears bit FR_SENT,
|--request_end():
|--sets bit FR_FINISHED
|--spin_lock(&fiq->waitq.lock);
|--list_del_init(&req->intr_entry);
|--spin_unlock(&fiq->waitq.lock);
|--fuse_put_request();
|--queue_interrupt();
<request gets queued to interrupts list>
|--wake_up_locked(&fiq->waitq);
|--wait_event_freezable();
<as FR_FINISHED is set, it returns and then
the caller frees this request>
Now, the next fuse_dev_do_read(), see interrupts list is not empty
and then calls fuse_read_interrupt() which tries to access the request
which is already free'd and gets the below crash:
[11432.401266] Unable to handle kernel paging request at virtual address
6b6b6b6b6b6b6b6b
...
[11432.418518] Kernel BUG at ffffff80083720e0
[11432.456168] PC is at __list_del_entry+0x6c/0xc4
[11432.463573] LR is at fuse_dev_do_read+0x1ac/0x474
...
[11432.679999] [<ffffff80083720e0>] __list_del_entry+0x6c/0xc4
[11432.687794] [<ffffff80082c65e0>] fuse_dev_do_read+0x1ac/0x474
[11432.693180] [<ffffff80082c6b14>] fuse_dev_read+0x6c/0x78
[11432.699082] [<ffffff80081d5638>] __vfs_read+0xc0/0xe8
[11432.704459] [<ffffff80081d5efc>] vfs_read+0x90/0x108
[11432.709406] [<ffffff80081d67f0>] SyS_read+0x58/0x94
As FR_FINISHED bit is set before deleting the intr_entry with input
queue lock in request completion path, do the testing of this flag and
queueing atomically with the same lock in queue_interrupt().
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: fd22d62ed0 ("fuse: no fc->lock for iqueue parts")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a8a86d78d673b1c99fe9b0064739fde9e9774184 upstream.
fuse_abort_conn() moves requests from pending list to a temporary list
before canceling them. This operation races with request_wait_answer()
which also tries to remove the request after it gets a fatal signal. It
checks FR_PENDING flag to determine whether the request is still in the
pending list.
Make fuse_abort_conn() clear FR_PENDING flag so that request_wait_answer()
does not remove the request from temporary list.
This bug causes an Oops when trying to delete an already deleted list entry
in end_requests().
Fixes: ee314a870e ("fuse: abort: no fc->lock needed for request ending")
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 59c3b76cc61d1d676f965c192cc7969aa5cb2744 upstream.
If pos is at the beginning of a page and copied is zero then page is not
zeroed but is marked uptodate.
Fix by skipping everything except unlock/put of page if zero bytes were
copied.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 6b12c1b37e ("fuse: Implement write_begin/write_end callbacks")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a09f99eddef44035ec764075a37bace8181bec38 upstream.
Fuse allowed VFS to set mode in setattr in order to clear suid/sgid on
chown and truncate, and (since writeback_cache) write. The problem with
this is that it'll potentially restore a stale mode.
The poper fix would be to let the filesystems do the suid/sgid clearing on
the relevant operations. Possibly some are already doing it but there's no
way we can detect this.
So fix this by refreshing and recalculating the mode. Do this only if
ATTR_KILL_S[UG]ID is set to not destroy performance for writes. This is
still racy but the size of the window is reduced.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5e2b8828ff3d79aca8c3a1730652758753205b61 upstream.
Without "default_permissions" the userspace filesystem's lookup operation
needs to perform the check for search permission on the directory.
If directory does not allow search for everyone (this is quite rare) then
userspace filesystem has to set entry timeout to zero to make sure
permissions are always performed.
Changing the mode bits of the directory should also invalidate the
(previously cached) dentry to make sure the next lookup will have a chance
of updating the timeout, if needed.
Reported-by: Jean-Pierre André <jean-pierre.andre@wanadoo.fr>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cb3ae6d25a5471be62bfe6ac1fccc0e91edeaba0 upstream.
Make sure userspace filesystem is returning a well formed list of xattr
names (zero or more nonzero length, null terminated strings).
[Michael Theall: only verify in the nonzero size case]
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8fba54aebbdf1f999738121922e74bf796ad60ee upstream.
When reading from a loop device backed by a fuse file it deadlocks on
lock_page().
This is because the page is already locked by the read() operation done on
the loop device. In this case we don't want to either lock the page or
dirty it.
So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
Reported-by: Sheng Yang <sheng@yasker.org>
Fixes: aa4d86163e ("block: loop: switch to VFS ITER_BVEC")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Sheng Yang <sheng@yasker.org>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Tested-by: Sheng Yang <sheng@yasker.org>
Tested-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9446385f05c9af25fed53dbed3cc75763730be52 upstream.
FUSE_HAS_IOCTL_DIR should be assigned to ->flags, it may be a typo.
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 69fe05c90e ("fuse: add missing INIT flags")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9ebce595f63a407c5cec98f98f9da8459b73740a upstream.
fuse_flush() calls write_inode_now() that triggers writeback, but actual
writeback will happen later, on fuse_sync_writes(). If an error happens,
fuse_writepage_end() will set error bit in mapping->flags. So, we have to
check mapping->flags after fuse_sync_writes().
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4d99ff8f12 ("fuse: Turn writeback cache on")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ac7f052b9e1534c8248f814b6f0068ad8d4a06d2 upstream.
Due to implementation of fuse writeback filemap_write_and_wait_range() does
not catch errors. We have to do this directly after fuse_sync_writes()
Signed-off-by: Alexey Kuznetsov <kuznet@virtuozzo.com>
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4d99ff8f12 ("fuse: Turn writeback cache on")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Allows FUSE to report to inotify that it is acting
as a layered filesystem. The userspace component
returns a string representing the location of the
underlying file. If the string cannot be resolved
into a path, the top level path is returned instead.
bug: 23904372
Change-Id: Iabdca0bbedfbff59e9c820c58636a68ef9683d9f
Signed-off-by: Daniel Rosenberg <drosen@google.com>
commit 744742d692e37ad5c20630e57d526c8f2e2fe3c9 upstream.
The 'reqs' member of fuse_io_priv serves two purposes. First is to track
the number of oustanding async requests to the server and to signal that
the io request is completed. The second is to be a reference count on the
structure to know when it can be freed.
For sync io requests these purposes can be at odds. fuse_direct_IO() wants
to block until the request is done, and since the signal is sent when
'reqs' reaches 0 it cannot keep a reference to the object. Yet it needs to
use the object after the userspace server has completed processing
requests. This leads to some handshaking and special casing that it
needlessly complicated and responsible for at least one race condition.
It's much cleaner and safer to maintain a separate reference count for the
object lifecycle and to let 'reqs' just be a count of outstanding requests
to the userspace server. Then we can know for sure when it is safe to free
the object without any handshaking or special cases.
The catch here is that most of the time these objects are stack allocated
and should not be freed. Initializing these objects with a single reference
that is never released prevents accidental attempts to free the objects.
Fixes: 9d5722b777 ("fuse: handle synchronous iocbs internally")
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7cabc61e01a0a8b663bd2b4c982aa53048218734 upstream.
There's a race in fuse_direct_IO(), whereby is_sync_kiocb() is called on an
iocb that could have been freed if async io has already completed. The fix
in this case is simple and obvious: cache the result before starting io.
It was discovered by KASan:
kernel: ==================================================================
kernel: BUG: KASan: use after free in fuse_direct_IO+0xb1a/0xcc0 at addr ffff88036c414390
Signed-off-by: Robert Doebbelin <robert@quobyte.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: bcba24ccdc ("fuse: enable asynchronous processing direct IO")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suspend attempts can abort when the FUSE daemon is already frozen
and a client is waiting uninterruptibly for a response, causing
freezing of tasks to fail.
Use the freeze-friendly wait API, but disregard other signals.
Change-Id: Icefb7e4bbc718ccb76bf3c04daaa5eeea7e0e63c
Signed-off-by: Todd Poynor <toddpoynor@google.com>
I got a report about unkillable task eating CPU. Further
investigation shows, that the problem is in the fuse_fill_write_pages()
function. If iov's first segment has zero length, we get an infinite
loop, because we never reach iov_iter_advance() call.
Fix this by calling iov_iter_advance() before repeating an attempt to
copy data from userspace.
A similar problem is described in 124d3b7041 ("fix writev regression:
pan hanging unkillable and un-straceable"). If zero-length segmend
is followed by segment with invalid address,
iov_iter_fault_in_readable() checks only first segment (zero-length),
iov_iter_copy_from_user_atomic() skips it, fails at second and
returns zero -> goto again without skipping zero-length segment.
Patch calls iov_iter_advance() before goto again: we'll skip zero-length
segment at second iteraction and iov_iter_fault_in_readable() will detect
invalid address.
Special thanks to Konstantin Khlebnikov, who helped a lot with the commit
description.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: ea9b9907b8 ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org>
The problem is that fuse_dev_alloc() acquires an extra reference to cc.fc,
and the original ref count is never dropped.
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: cc080e9e9b ("fuse: introduce per-instance fuse_dev structure")
Cc: <stable@vger.kernel.org> # v4.2+
Instead of having users check for FL_POSIX or FL_FLOCK to call the correct
locks API function, use the check within locks_lock_inode_wait(). This
allows for some later cleanup.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
fuse_dev_ioctl() performed fuse_get_dev() on a user-supplied fd,
leading to a type confusion issue. Fix it by checking file->f_op.
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"
[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...
Pull user namespace updates from Eric Biederman:
"Long ago and far away when user namespaces where young it was realized
that allowing fresh mounts of proc and sysfs with only user namespace
permissions could violate the basic rule that only root gets to decide
if proc or sysfs should be mounted at all.
Some hacks were put in place to reduce the worst of the damage could
be done, and the common sense rule was adopted that fresh mounts of
proc and sysfs should allow no more than bind mounts of proc and
sysfs. Unfortunately that rule has not been fully enforced.
There are two kinds of gaps in that enforcement. Only filesystems
mounted on empty directories of proc and sysfs should be ignored but
the test for empty directories was insufficient. So in my tree
directories on proc, sysctl and sysfs that will always be empty are
created specially. Every other technique is imperfect as an ordinary
directory can have entries added even after a readdir returns and
shows that the directory is empty. Special creation of directories
for mount points makes the code in the kernel a smidge clearer about
it's purpose. I asked container developers from the various container
projects to help test this and no holes were found in the set of mount
points on proc and sysfs that are created specially.
This set of changes also starts enforcing the mount flags of fresh
mounts of proc and sysfs are consistent with the existing mount of
proc and sysfs. I expected this to be the boring part of the work but
unfortunately unprivileged userspace winds up mounting fresh copies of
proc and sysfs with noexec and nosuid clear when root set those flags
on the previous mount of proc and sysfs. So for now only the atime,
read-only and nodev attributes which userspace happens to keep
consistent are enforced. Dealing with the noexec and nosuid
attributes remains for another time.
This set of changes also addresses an issue with how open file
descriptors from /proc/<pid>/ns/* are displayed. Recently readlink of
/proc/<pid>/fd has been triggering a WARN_ON that has not been
meaningful since it was added (as all of the code in the kernel was
converted) and is not now actively wrong.
There is also a short list of issues that have not been fixed yet that
I will mention briefly.
It is possible to rename a directory from below to above a bind mount.
At which point any directory pointers below the renamed directory can
be walked up to the root directory of the filesystem. With user
namespaces enabled a bind mount of the bind mount can be created
allowing the user to pick a directory whose children they can rename
to outside of the bind mount. This is challenging to fix and doubly
so because all obvious solutions must touch code that is in the
performance part of pathname resolution.
As mentioned above there is also a question of how to ensure that
developers by accident or with purpose do not introduce exectuable
files on sysfs and proc and in doing so introduce security regressions
in the current userspace that will not be immediately obvious and as
such are likely to require breaking userspace in painful ways once
they are recognized"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
vfs: Remove incorrect debugging WARN in prepend_path
mnt: Update fs_fully_visible to test for permanently empty directories
sysfs: Create mountpoints with sysfs_create_mount_point
sysfs: Add support for permanently empty directories to serve as mount points.
kernfs: Add support for always empty directories.
proc: Allow creating permanently empty directories that serve as mount points
sysctl: Allow creating permanently empty directories that serve as mountpoints.
fs: Add helper functions for permanently empty directories.
vfs: Ignore unlocked mounts in fs_fully_visible
mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
mnt: Refactor the logic for mounting sysfs and proc in a user namespace
Pull fuse updates from Miklos Szeredi:
"This is the start of improving fuse scalability.
An input queue and a processing queue is split out from the monolithic
fuse connection, each of those having their own spinlock. The end of
the patchset adds the ability to clone a fuse connection. This means,
that instead of having to read/write requests/answers on a single fuse
device fd, the fuse daemon can have multiple distinct file descriptors
open. Each of those can be used to receive requests and send answers,
currently the only constraint is that a request must be answered on
the same fd as it was read from.
This can be extended further to allow binding a device clone to a
specific CPU or NUMA node.
Based on a patchset by Srinivas Eeda and Ashish Samant. Thanks to
Ashish for the review of this series"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (40 commits)
fuse: update MAINTAINERS entry
fuse: separate pqueue for clones
fuse: introduce per-instance fuse_dev structure
fuse: device fd clone
fuse: abort: no fc->lock needed for request ending
fuse: no fc->lock for pqueue parts
fuse: no fc->lock in request_end()
fuse: cleanup request_end()
fuse: request_end(): do once
fuse: add req flag for private list
fuse: pqueue locking
fuse: abort: group pqueue accesses
fuse: cleanup fuse_dev_do_read()
fuse: move list_del_init() from request_end() into callers
fuse: duplicate ->connected in pqueue
fuse: separate out processing queue
fuse: simplify request_wait()
fuse: no fc->lock for iqueue parts
fuse: allow interrupt queuing without fc->lock
fuse: iqueue locking
...
This allows for better documentation in the code and
it allows for a simpler and fully correct version of
fs_fully_visible to be written.
The mount points converted and their filesystems are:
/sys/hypervisor/s390/ s390_hypfs
/sys/kernel/config/ configfs
/sys/kernel/debug/ debugfs
/sys/firmware/efi/efivars/ efivarfs
/sys/fs/fuse/connections/ fusectl
/sys/fs/pstore/ pstore
/sys/kernel/tracing/ tracefs
/sys/fs/cgroup/ cgroup
/sys/kernel/security/ securityfs
/sys/fs/selinux/ selinuxfs
/sys/fs/smackfs/ smackfs
Cc: stable@vger.kernel.org
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Make each fuse device clone refer to a separate processing queue. The only
constraint on userspace code is that the request answer must be written to
the same device clone as it was read off.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Allow fuse device clones to refer to be distinguished. This patch just
adds the infrastructure by associating a separate "struct fuse_dev" with
each clone.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Allow an open fuse device to be "cloned". Userspace can create a clone by:
newfd = open("/dev/fuse", O_RDWR)
ioctl(newfd, FUSE_DEV_IOC_CLONE, &oldfd);
At this point newfd will refer to the same fuse connection as oldfd.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
In fuse_abort_conn() when all requests are on private lists we no longer
need fc->lock protection.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
No longer need to call request_end() with the connection lock held. We
still protect the background counters and queue with fc->lock, so acquire
it if necessary.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Now that we atomically test having already done everything we no longer
need other protection.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
When the connection is aborted it is possible that request_end() will be
called twice. Use atomic test and set to do the actual ending only once.
test_and_set_bit() also provides the necessary barrier semantics so no
explicit smp_wmb() is necessary.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
When an unlocked request is aborted, it is moved from fpq->io to a private
list. Then, after unlocking fpq->lock, the private list is processed and
the requests are finished off.
To protect the private list, we need to mark the request with a flag, so if
in the meantime the request is unlocked the list is not corrupted.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Add a fpq->lock for protecting members of struct fuse_pqueue and FR_LOCKED
request flag.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
- locked list_add() + list_del_init() cancel out
- common handling of case when request is ended here in the read phase
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>