Commit graph

18 commits

Author SHA1 Message Date
Greg Kroah-Hartman
79f138ac8c This is the 4.4.107 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlo6J9YACgkQONu9yGCS
 aT7Cdg/8D7+btjAJPjk/suKUBSOpfSkYIoakaVSGf7r7gFv7SkMF023SikLUK+vN
 xa0FL1bYASzXuKgcxY9vB7ZkCDShrglqTCIpbWwHJhwS0fRGTGrMN2MM+opVgeoG
 4ngnEPLue5TqZs3LVrpTySQFODlxnY3C4lpKopN7QNrcr1M5iiMELXCJu/qy6JhC
 ZBsRGUY8GHbouqC0YSqNlrv+C7zbfAlaawIBDSmYm0R4F+TuqoKZlBGAJ9lbALcZ
 pM8OaOXx9v471RhE7Tcsl3Eiz3vKHFKWxG/ZujkSqB21wPq6gd4VuP/wMuelX0GC
 rDTb/nn9Zhmv7UCOn62htlRLrAnSaJ9FlEK+u3TJ+XBGE9gmanH9IjIljCehEZeI
 55Vm7q6IwQT2WvgTzqUoco4AYI37T9pqJ++I1E3jY/zk+bfCIQ1ZMpMXmwAUx738
 m7boO38eRnyXMxqf4hfVQ4BFPkwaxdW/I3LDanE6U85Hw2nI2uZIPHRbVrC6gAPS
 aY9EUFEancxu4mW92mWWKrEnEWs5Jsb313ISAKSU75WwWmZ/tgsUtSzvY7XNPwYr
 G//HbPA5zNFdM1zSO4HQiLYjOqAiwNNAbDEWKoYr8MQpgYbTB4/SgT15mJzw6uTo
 WtKZsIWMYiXVW8cNhQiWAVUJVf66GMlc6kHcyHU7YHH3obZAYXU=
 =DNsf
 -----END PGP SIGNATURE-----

Merge 4.4.107 into android-4.4

Changes in 4.4.107
	crypto: hmac - require that the underlying hash algorithm is unkeyed
	crypto: salsa20 - fix blkcipher_walk API usage
	autofs: fix careless error in recent commit
	tracing: Allocate mask_str buffer dynamically
	USB: uas and storage: Add US_FL_BROKEN_FUA for another JMicron JMS567 ID
	USB: core: prevent malicious bNumInterfaces overflow
	usbip: fix stub_send_ret_submit() vulnerability to null transfer_buffer
	ceph: drop negative child dentries before try pruning inode's alias
	Bluetooth: btusb: driver to enable the usb-wakeup feature
	xhci: Don't add a virt_dev to the devs array before it's fully allocated
	sched/rt: Do not pull from current CPU if only one CPU to pull
	dmaengine: dmatest: move callback wait queue to thread context
	ext4: fix fdatasync(2) after fallocate(2) operation
	ext4: fix crash when a directory's i_size is too small
	KEYS: add missing permission check for request_key() destination
	mac80211: Fix addition of mesh configuration element
	usb: phy: isp1301: Add OF device ID table
	md-cluster: free md_cluster_info if node leave cluster
	userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE
	userfaultfd: selftest: vm: allow to build in vm/ directory
	net: initialize msg.msg_flags in recvfrom
	net: bcmgenet: correct the RBUF_OVFL_CNT and RBUF_ERR_CNT MIB values
	net: bcmgenet: correct MIB access of UniMAC RUNT counters
	net: bcmgenet: reserved phy revisions must be checked first
	net: bcmgenet: power down internal phy if open or resume fails
	net: bcmgenet: Power up the internal PHY before probing the MII
	NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
	NFSD: fix nfsd_reset_versions for NFSv4.
	Input: i8042 - add TUXEDO BU1406 (N24_25BU) to the nomux list
	drm/omap: fix dmabuf mmap for dma_alloc'ed buffers
	netfilter: bridge: honor frag_max_size when refragmenting
	writeback: fix memory leak in wb_queue_work()
	net: wimax/i2400m: fix NULL-deref at probe
	dmaengine: Fix array index out of bounds warning in __get_unmap_pool()
	net: Resend IGMP memberships upon peer notification.
	mlxsw: reg: Fix SPVM max record count
	mlxsw: reg: Fix SPVMLR max record count
	intel_th: pci: Add Gemini Lake support
	openrisc: fix issue handling 8 byte get_user calls
	scsi: hpsa: update check for logical volume status
	scsi: hpsa: limit outstanding rescans
	fjes: Fix wrong netdevice feature flags
	drm/radeon/si: add dpm quirk for Oland
	sched/deadline: Make sure the replenishment timer fires in the next period
	sched/deadline: Throttle a constrained deadline task activated after the deadline
	sched/deadline: Use deadline instead of period when calculating overflow
	mmc: mediatek: Fixed bug where clock frequency could be set wrong
	drm/radeon: reinstate oland workaround for sclk
	afs: Fix missing put_page()
	afs: Populate group ID from vnode status
	afs: Adjust mode bits processing
	afs: Flush outstanding writes when an fd is closed
	afs: Migrate vlocation fields to 64-bit
	afs: Prevent callback expiry timer overflow
	afs: Fix the maths in afs_fs_store_data()
	afs: Populate and use client modification time
	afs: Fix page leak in afs_write_begin()
	afs: Fix afs_kill_pages()
	net/mlx4_core: Avoid delays during VF driver device shutdown
	perf symbols: Fix symbols__fixup_end heuristic for corner cases
	efi/esrt: Cleanup bad memory map log messages
	NFSv4.1 respect server's max size in CREATE_SESSION
	btrfs: add missing memset while reading compressed inline extents
	target: Use system workqueue for ALUA transitions
	target: fix ALUA transition timeout handling
	target: fix race during implicit transition work flushes
	sfc: don't warn on successful change of MAC
	fbdev: controlfb: Add missing modes to fix out of bounds access
	video: udlfb: Fix read EDID timeout
	video: fbdev: au1200fb: Release some resources if a memory allocation fails
	video: fbdev: au1200fb: Return an error code if a memory allocation fails
	rtc: pcf8563: fix output clock rate
	dmaengine: ti-dma-crossbar: Correct am335x/am43xx mux value type
	PCI/PME: Handle invalid data when reading Root Status
	powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfo
	netfilter: ipvs: Fix inappropriate output of procfs
	powerpc/opal: Fix EBUSY bug in acquiring tokens
	powerpc/ipic: Fix status get and status clear
	target/iscsi: Fix a race condition in iscsit_add_reject_from_cmd()
	iscsi-target: fix memory leak in lio_target_tiqn_addtpg()
	target:fix condition return in core_pr_dump_initiator_port()
	target/file: Do not return error for UNMAP if length is zero
	arm-ccn: perf: Prevent module unload while PMU is in use
	crypto: tcrypt - fix buffer lengths in test_aead_speed()
	mm: Handle 0 flags in _calc_vm_trans() macro
	clk: mediatek: add the option for determining PLL source clock
	clk: imx6: refine hdmi_isfr's parent to make HDMI work on i.MX6 SoCs w/o VPU
	clk: tegra: Fix cclk_lp divisor register
	ppp: Destroy the mutex when cleanup
	thermal/drivers/step_wise: Fix temperature regulation misbehavior
	GFS2: Take inode off order_write list when setting jdata flag
	bcache: explicitly destroy mutex while exiting
	bcache: fix wrong cache_misses statistics
	l2tp: cleanup l2tp_tunnel_delete calls
	xfs: fix log block underflow during recovery cycle verification
	xfs: fix incorrect extent state in xfs_bmap_add_extent_unwritten_real
	PCI: Detach driver before procfs & sysfs teardown on device remove
	scsi: hpsa: cleanup sas_phy structures in sysfs when unloading
	scsi: hpsa: destroy sas transport properties before scsi_host
	powerpc/perf/hv-24x7: Fix incorrect comparison in memord
	tty fix oops when rmmod 8250
	usb: musb: da8xx: fix babble condition handling
	pinctrl: adi2: Fix Kconfig build problem
	raid5: Set R5_Expanded on parity devices as well as data.
	scsi: scsi_devinfo: Add REPORTLUN2 to EMC SYMMETRIX blacklist entry
	vt6655: Fix a possible sleep-in-atomic bug in vt6655_suspend
	scsi: sd: change manage_start_stop to bool in sysfs interface
	scsi: sd: change allow_restart to bool in sysfs interface
	scsi: bfa: integer overflow in debugfs
	udf: Avoid overflow when session starts at large offset
	macvlan: Only deliver one copy of the frame to the macvlan interface
	RDMA/cma: Avoid triggering undefined behavior
	IB/ipoib: Grab rtnl lock on heavy flush when calling ndo_open/stop
	ath9k: fix tx99 potential info leak
	Linux 4.4.107

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-12-20 10:49:07 +01:00
Andrea Arcangeli
ee9be99630 userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE
[ Upstream commit 6bbc4a4144b1a69743022ac68dfaf6e7d993abb9 ]

__do_fault assumes vmf->page has been initialized and is valid if
VM_FAULT_NOPAGE is not returned by vma->vm_ops->fault(vma, vmf).

handle_userfault() in turn should return VM_FAULT_NOPAGE if it doesn't
return VM_FAULT_SIGBUS or VM_FAULT_RETRY (the other two possibilities).

This VM_FAULT_NOPAGE case is only invoked when signal are pending and it
didn't matter for anonymous memory before.  It only started to matter
since shmem was introduced.  hugetlbfs also takes a different path and
doesn't exercise __do_fault.

Link: http://lkml.kernel.org/r/20170228154201.GH5816@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:04:53 +01:00
Dmitry Shmidt
b558f17a13 This is the 4.4.16 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXmOXmAAoJEDjbvchgkmk+QYIP/1S8oBZsvjfDzvH8t63HyLeH
 i43MFlYoFAqUIZc002XpluSvZ8uHoG+r7R8Hq3wmv48wxe3M6OBnMdBVTht6mPw+
 t5OLTZr40lWaJm2EIi4aekueMIrCgmL+Et+IFYv7ZVBuYLteVcfny+zdq4EqGmgj
 /a19+L/sTTr4SHtJIhHxWhiVJ9fVMgQk/N3VgQmIiNF2+lVbiFI7QQiDPLbFl0KK
 CM4ETO22HxHCYilGpzhpSMsHCxv12VqNaXNLAsPAepGGW7PqvUmrEWAqgwsbOfRc
 GxTLNk0dUgJqMrfEpQ8ZOMlgzvCAYG2jZuNSuT+nuzrWSUP+WOGRi9TTTxp1CYuZ
 PHlhNTH7ZnqosxJUUZS2d9N5ygpqD48Rhlfl824YzOWCy94VeUnedkVLb20uJwPF
 Y5aQ5WjktBC9why5e4OgGQERvx/U9KTk8E1zRfZZPc2oft9My0YxuemjjKAKZiYN
 ne4WhXbgOJTQkAoZwh2xqny3bWyEaoSrWpQ3R7bBJ9SIRLEOdCKzKpduDbAnbMP7
 QWgQOQC/6qA1mKqjrqF4KPA1Quo9PcUK2Ajh523ewMGCowgY90vyejAgh4Q8g0GC
 fKlx+jJDoKVDbQ8v4hc9PPHMsNNIKT9a1ptwVS3lE+bq1D5Ffm57A4/uOTMYHVab
 gKqu8h1CA0MCVBsH3nNA
 =nY8S
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.16' into android-4.4.y

This is the 4.4.16 stable release

Change-Id: Ibaf7b7e03695e1acebc654a2ca1a4bfcc48fcea4
2016-08-01 15:57:55 -07:00
Linus Torvalds
0b7f12be0d userfaultfd: don't block on the last VM updates at exit time
commit 39680f50ae54cbbb6e72ac38b8329dd3eb9105f4 upstream.

The exit path will do some final updates to the VM of an exiting process
to inform others of the fact that the process is going away.

That happens, for example, for robust futex state cleanup, but also if
the parent has asked for a TID update when the process exits (we clear
the child tid field in user space).

However, at the time we do those final VM accesses, we've already
stopped accepting signals, so the usual "stop waiting for userfaults on
signal" code in fs/userfaultfd.c no longer works, and the process can
become an unkillable zombie waiting for something that will never
happen.

To solve this, just make handle_userfault() abort any user fault
handling if we're already in the exit path past the signal handling
state being dead (marked by PF_EXITING).

This VM special case is pretty ugly, and it is possible that we should
look at finalizing signals later (or move the VM final accesses
earlier).  But in the meantime this is a fairly minimally intrusive fix.

Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-16 08:43:01 -07:00
Dmitry Shmidt
c30578d48a userfaultfd: Add missing vma_merge parameter
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2016-02-16 13:54:16 -08:00
Andrea Arcangeli
ac5be6b47e userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key"
This reverts commit 51360155ec and adapts
fs/userfaultfd.c to use the old version of that function.

It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what we
insert in the two waitqueue heads of the blocked userfaults.  No
exclusive waitqueue risks to be inserted into those two waitqueue heads
so we can as well stick to "nr == 1" of the old code and we can rely
purely on the fact no waitqueue inserted in one of the two waitqueue
heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-22 15:09:53 -07:00
Eric Biggers
c03e946fdd userfaultfd: add missing mmput() in error path
This fixes a memleak if anon_inode_getfile() fails in userfaultfd().

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-17 21:16:07 -07:00
Andrea Arcangeli
2c5b7e1be7 userfaultfd: avoid missing wakeups during refile in userfaultfd_read
During the refile in userfaultfd_read both waitqueues could look empty to
the lockless wake_userfault().  Use a seqcount to prevent this false
negative that could leave an userfault blocked.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
dfa37dc3fc userfaultfd: allow signals to interrupt a userfault
This is only simple to achieve if the userfault is going to return to
userland (not to the kernel) because we can avoid returning VM_FAULT_RETRY
despite we temporarily released the mmap_sem.  The fault would just be
retried by userland then.  This is safe at least on x86 and powerpc (the
two archs with the syscall implemented so far).

Hint to verify for which archs this is safe: after handle_mm_fault
returns, no access to data structures protected by the mmap_sem must be
done by the fault code in arch/*/mm/fault.c until up_read(&mm->mmap_sem)
is called.

This has two main benefits: signals can run with lower latency in
production (signals aren't blocked by userfaults and userfaults are
immediately repeated after signal processing) and gdb can then trivially
debug the threads blocked in this kind of userfaults coming directly from
userland.

On a side note: while gdb has a need to get signal processed, coredumps
always worked perfectly with userfaults, no matter if the userfault is
triggered by GUP a kernel copy_user or directly from userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
e6485a47b7 userfaultfd: require UFFDIO_API before other ioctls
UFFDIO_API was already forced before read/poll could work.  This makes the
code more strict to force it also for all other ioctls.

All users would already have been required to call UFFDIO_API before
invoking other ioctls but this makes it more explicit.

This will ensure we can change all ioctls (all but UFFDIO_API/struct
uffdio_api) with a bump of uffdio_api.api.

There's no actual plan or need to change the API or the ioctl, the current
API already should cover fine even the non cooperative usage, but this is
just for the longer term future just in case.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
ad465cae96 userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE
These two ioctl allows to either atomically copy or to map zeropages
into the virtual address space. This is used by the thread that opened
the userfaultfd to resolve the userfaults.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
8d2afd96c2 userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.

Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.

Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.

Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.

Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

                        postcopy_place_page
                        read old_state -> MISSING
                        UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

                                        poll() -> POLLIN
                                        read() -> 0x7fb76a139000
                                        postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED

                        tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
                        /* check that no userfault raced with UFFDIO_COPY */
                        if (old_state == MISSING && tmp_state == REQUESTED)
                                UFFDIO_WAKE from background thread

And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

                        postcopy_place_page
                        read old_state -> MISSING
                        UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
                        tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

                                        poll() -> POLLIN
                                        read() -> 0x7fb76a139000

                                        if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
                                                UFFDIO_WAKE from userfault thread

This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
3004ec9cab userfaultfd: allocate the userfaultfd_ctx cacheline aligned
Use proper slab to guarantee alignment.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
15b726ef04 userfaultfd: optimize read() and poll() to be O(1)
This makes read O(1) and poll that was already O(1) becomes lockless.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
ba85c702e4 userfaultfd: wake pending userfaults
This is an optimization but it's a userland visible one and it affects
the API.

The downside of this optimization is that if you call poll() and you
get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
may be waken by a different thread, before read(ufd) comes
around. This in short means that poll() isn't really usable if the
userfaultfd is opened in blocking mode.

userfaults won't wait in "pending" state to be read anymore and any
UFFDIO_WAKE or similar operations that has the objective of waking
userfaults after their resolution, will wake all blocked userfaults
for the resolved range, including those that haven't been read() by
userland yet.

The behavior of poll() becomes not standard, but this obviates the
need of "spurious" UFFDIO_WAKE and it lets the userland threads to
restart immediately without requiring an UFFDIO_WAKE. This is even
more significant in case of repeated faults on the same address from
multiple threads.

This optimization is justified by the measurement that the number of
spurious UFFDIO_WAKE accounts for 5% and 10% of the total
userfaults for heavy workloads, so it's worth optimizing those away.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
a9b85f9415 userfaultfd: change the read API to return a uffd_msg
I had requests to return the full address (not the page aligned one) to
userland.

It's not entirely clear how the page offset could be relevant because
userfaults aren't like SIGBUS that can sigjump to a different place and it
actually skip resolving the fault depending on a page offset.  There's
currently no real way to skip the fault especially because after a
UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the
kernel without having to return to userland first (not even self modifying
code replacing the .text that touched the faulting address would prevent
the fault to be repeated).  Userland cannot skip repeating the fault even
more so if the fault was triggered by a KVM secondary page fault or any
get_user_pages or any copy-user inside some syscall which will return to
kernel code.  The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading
to a SIGBUS being raised because the userfault can't wait if it cannot
release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for
that).

Still returning userland a proper structure during the read() on the uffd,
can allow to use the current UFFD_API for the future non-cooperative
extensions too and it looks cleaner as well.  Once we get additional
fields there's no point to return the fault address page aligned anymore
to reuse the bits below PAGE_SHIFT.

The only downside is that the read() syscall will read 32bytes instead of
8bytes but that's not going to be measurable overhead.

The total number of new events that can be extended or of new future bits
for already shipped events, is limited to 64 by the features field of the
uffdio_api structure.  If more will be needed a bump of UFFD_API will be
required.

[akpm@linux-foundation.org: use __packed]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Pavel Emelyanov
3f602d2724 userfaultfd: Rename uffd_api.bits into .features
This is (seems to be) the minimal thing that is required to unblock
standard uffd usage from the non-cooperative one.  Now more bits can be
added to the features field indicating e.g.  UFFD_FEATURE_FORK and others
needed for the latter use-case.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli
86039bd3b4 userfaultfd: add new syscall to provide memory externalization
Once an userfaultfd has been created and certain region of the process
virtual address space have been registered into it, the thread responsible
for doing the memory externalization can manage the page faults in
userland by talking to the kernel using the userfaultfd protocol.

poll() can be used to know when there are new pending userfaults to be
read (POLLIN).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00