* refs/heads/tmp-7af10f2
Linux 4.4.178
stm class: Hide STM-specific options if STM is disabled
coresight: removing bind/unbind options from sysfs
arm64: support keyctl() system call in 32-bit mode
Revert "USB: core: only clean up what we allocated"
xhci: Fix port resume done detection for SS ports with LPM enabled
KVM: Reject device ioctls from processes other than the VM's creator
x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y
perf intel-pt: Fix TSC slip
gpio: adnp: Fix testing wrong value in adnp_gpio_direction_input
fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links
Disable kgdboc failed by echo space to /sys/module/kgdboc/parameters/kgdboc
USB: serial: option: add Olicard 600
USB: serial: option: set driver_info for SIM5218 and compatibles
USB: serial: mos7720: fix mos_parport refcount imbalance on error path
USB: serial: ftdi_sio: add additional NovaTech products
USB: serial: cp210x: add new device id
serial: sh-sci: Fix setting SCSCR_TIE while transferring data
serial: max310x: Fix to avoid potential NULL pointer dereference
staging: vt6655: Fix interrupt race condition on device start up.
staging: vt6655: Remove vif check from vnt_interrupt
tty: atmel_serial: fix a potential NULL pointer dereference
scsi: zfcp: fix scsi_eh host reset with port_forced ERP for non-NPIV FCP devices
scsi: zfcp: fix rport unblock if deleted SCSI devices on Scsi_Host
scsi: sd: Fix a race between closing an sd device and sd I/O
ALSA: pcm: Don't suspend stream in unrecoverable PCM state
ALSA: pcm: Fix possible OOB access in PCM oss plugins
ALSA: seq: oss: Fix Spectre v1 vulnerability
ALSA: rawmidi: Fix potential Spectre v1 vulnerability
ALSA: compress: add support for 32bit calls in a 64bit kernel
ARM: imx6q: cpuidle: fix bug that CPU might not wake up at expected time
btrfs: raid56: properly unmap parity page in finish_parity_scrub()
btrfs: remove WARN_ON in log_dir_items
mac8390: Fix mmio access size probe
sctp: get sctphdr by offset in sctp_compute_cksum
vxlan: Don't call gro_cells_destroy() before device is unregistered
tcp: do not use ipv6 header for ipv4 flow
packets: Always register packet sk in the same order
Add hlist_add_tail_rcu() (Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net)
net: rose: fix a possible stack overflow
net/packet: Set __GFP_NOWARN upon allocation in alloc_pg_vec
mISDN: hfcpci: Test both vendor & device ID for Digium HFC4S
dccp: do not use ipv6 header for ipv4 flow
stmmac: copy unicast mac address to MAC registers
cfg80211: size various nl80211 messages correctly
mmc: mmc: fix switch timeout issue caused by jiffies precision
arm64: kconfig: drop CONFIG_RTC_LIB dependency
video: fbdev: Set pixclock = 0 in goldfishfb
cpu/hotplug: Handle unbalanced hotplug enable/disable
usb: gadget: rndis: free response queue during REMOTE_NDIS_RESET_MSG
usb: gadget: configfs: add mutex lock before unregister gadget
ipv6: fix endianness error in icmpv6_err
stm class: Fix stm device initialization order
stm class: Do not leak the chrdev in error path
PM / Hibernate: Call flush_icache_range() on pages restored in-place
arm64: kernel: Include _AC definition in page.h
perf/ring_buffer: Refuse to begin AUX transaction after rb->aux_mmap_count drops
mac80211: fix "warning: ‘target_metric’ may be used uninitialized"
arm64/kernel: fix incorrect EL0 check in inv_entry macro
ARM: 8510/1: rework ARM_CPU_SUSPEND dependencies
staging: goldfish: audio: fix compiliation on arm
staging: ion: Set minimum carveout heap allocation order to PAGE_SHIFT
staging: ashmem: Add missing include
staging: ashmem: Avoid deadlock with mmap/shrink
asm-generic: Fix local variable shadow in __set_fixmap_offset
coresight: etm4x: Check every parameter used by dma_xx_coherent.
coresight: "DEVICE_ATTR_RO" should defined as static.
stm class: Fix a race in unlinking
stm class: Fix unbalanced module/device refcounting
stm class: Guard output assignment against concurrency
stm class: Fix unlocking braino in the error path
stm class: Support devices with multiple instances
stm class: Prevent user-controllable allocations
stm class: Fix link list locking
stm class: Fix locking in unbinding policy path
coresight: remove csdev's link from topology
coresight: release reference taken by 'bus_find_device()'
coresight: coresight_unregister() function cleanup
coresight: fixing lockdep error
writeback: initialize inode members that track writeback history
Revert "mmc: block: don't use parameter prefix if built as module"
net: diag: support v4mapped sockets in inet_diag_find_one_icsk()
perf: Synchronously free aux pages in case of allocation failure
arm64: hide __efistub_ aliases from kallsyms
hid-sensor-hub.c: fix wrong do_div() usage
vmstat: make vmstat_updater deferrable again and shut down on idle
android: unconditionally remove callbacks in sync_fence_free()
ARM: 8494/1: mm: Enable PXN when running non-LPAE kernel on LPAE processor
ARM: 8458/1: bL_switcher: add GIC dependency
efi: stub: define DISABLE_BRANCH_PROFILING for all architectures
arm64: fix COMPAT_SHMLBA definition for large pages
mmc: block: Allow more than 8 partitions per card
sched/fair: Fix new task's load avg removed from source CPU in wake_up_new_task()
Bluetooth: Verify that l2cap_get_conf_opt provides large enough buffer
Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt
ath10k: avoid possible string overflow
rtc: Fix overflow when converting time64_t to rtc_time
USB: core: only clean up what we allocated
lib/int_sqrt: optimize small argument
serial: sprd: clear timeout interrupt only rather than all interrupts
usb: renesas_usbhs: gadget: fix unused-but-set-variable warning
arm64: traps: disable irq in die()
Hang/soft lockup in d_invalidate with simultaneous calls
serial: sprd: adjust TIMEOUT to a big value
tcp/dccp: drop SYN packets if accept queue is full
usb: gadget: Add the gserial port checking in gs_start_tx()
usb: gadget: composite: fix dereference after null check coverify warning
kbuild: setlocalversion: print error to STDERR
extcon: usb-gpio: Don't miss event during suspend/resume
mm/rmap: replace BUG_ON(anon_vma->degree) with VM_WARN_ON
mmc: core: fix using wrong io voltage if mmc_select_hs200 fails
arm64: mm: Add trace_irqflags annotations to do_debug_exception()
usb: dwc3: gadget: Fix suspend/resume during device mode
mmc: core: shut up "voltage-ranges unspecified" pr_info()
mmc: sanitize 'bus width' in debug output
mmc: make MAN_BKOPS_EN message a debug
mmc: debugfs: Add a restriction to mmc debugfs clock setting
mmc: pwrseq_simple: Make reset-gpios optional to match doc
ALSA: hda - Enforces runtime_resume after S3 and S4 for each codec
ALSA: hda - Record the current power state before suspend/resume calls
locking/lockdep: Add debug_locks check in __lock_downgrade()
media: v4l2-ctrls.c/uvc: zero v4l2_event
mmc: tmio_mmc_core: don't claim spurious interrupts
ext4: brelse all indirect buffer in ext4_ind_remove_space()
ext4: fix data corruption caused by unaligned direct AIO
ext4: fix NULL pointer dereference while journal is aborted
futex: Ensure that futex address is aligned in handle_futex_death()
MIPS: Fix kernel crash for R6 in jump label branch function
mips: loongson64: lemote-2f: Add IRQF_NO_SUSPEND to "cascade" irqaction.
udf: Fix crash on IO error during truncate
drm/vmwgfx: Don't double-free the mode stored in par->set_mode
mmc: pxamci: fix enum type confusion
ANDROID: drop CONFIG_INPUT_KEYCHORD from cuttlefish and ranchu
UPSTREAM: virt_wifi: Remove REGULATORY_WIPHY_SELF_MANAGED
UPSTREAM: net: socket: set sock->sk to NULL after calling proto_ops::release()
f2fs: set pin_file under CAP_SYS_ADMIN
f2fs: fix to avoid deadlock in f2fs_read_inline_dir()
f2fs: fix to adapt small inline xattr space in __find_inline_xattr()
f2fs: fix to do sanity check with inode.i_inline_xattr_size
f2fs: give some messages for inline_xattr_size
f2fs: don't trigger read IO for beyond EOF page
f2fs: fix to add refcount once page is tagged PG_private
f2fs: remove wrong comment in f2fs_invalidate_page()
f2fs: fix to use kvfree instead of kzfree
f2fs: print more parameters in trace_f2fs_map_blocks
f2fs: trace f2fs_ioc_shutdown
f2fs: fix to avoid deadlock of atomic file operations
f2fs: fix to dirty inode for i_mode recovery
f2fs: give random value to i_generation
f2fs: no need to take page lock in readdir
f2fs: fix to update iostat correctly in IPU path
f2fs: fix encrypted page memory leak
f2fs: make fault injection covering __submit_flush_wait()
f2fs: fix to retry fill_super only if recovery failed
f2fs: silence VM_WARN_ON_ONCE in mempool_alloc
f2fs: correct spelling mistake
f2fs: fix wrong #endif
f2fs: don't clear CP_QUOTA_NEED_FSCK_FLAG
f2fs: don't allow negative ->write_io_size_bits
f2fs: fix to check inline_xattr_size boundary correctly
Revert "f2fs: fix to avoid deadlock of atomic file operations"
Revert "f2fs: fix to check inline_xattr_size boundary correctly"
f2fs: do not use mutex lock in atomic context
f2fs: fix potential data inconsistence of checkpoint
f2fs: fix to avoid deadlock of atomic file operations
f2fs: fix to check inline_xattr_size boundary correctly
f2fs: jump to label 'free_node_inode' when failing from d_make_root()
f2fs: fix to document inline_xattr_size option
f2fs: fix to data block override node segment by mistake
f2fs: fix typos in code comments
f2fs: sync filesystem after roll-forward recovery
fs: export evict_inodes
f2fs: flush quota blocks after turnning it off
f2fs: avoid null pointer exception in dcc_info
f2fs: don't wake up too frequently, if there is lots of IOs
f2fs: try to keep CP_TRIMMED_FLAG after successful umount
f2fs: add quick mode of checkpoint=disable for QA
f2fs: run discard jobs when put_super
f2fs: fix to set sbi dirty correctly
f2fs: UBSAN: set boolean value iostat_enable correctly
f2fs: add brackets for macros
f2fs: check if file namelen exceeds max value
f2fs: fix to trigger fsck if dirent.name_len is zero
f2fs: no need to check return value of debugfs_create functions
f2fs: export FS_NOCOW_FL flag to user
f2fs: check inject_rate validity during configuring
f2fs: remove set but not used variable 'err'
f2fs: fix compile warnings: 'struct *' declared inside parameter list
f2fs: change error code to -ENOMEM from -EINVAL
Conflicts:
arch/arm/Kconfig
arch/arm64/kernel/traps.c
drivers/hwtracing/coresight/coresight-etm4x.c
drivers/hwtracing/coresight/coresight-tmc.c
drivers/hwtracing/stm/Kconfig
drivers/hwtracing/stm/core.c
drivers/mmc/core/mmc.c
drivers/usb/gadget/function/u_serial.c
kernel/events/ring_buffer.c
net/wireless/nl80211.c
sound/core/compress_offload.c
Change-Id: I33783dbd0a25d678d6c61204f9e67690e57bed8f
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
* origin/upstream-f2fs-stable-linux-4.4.y:
f2fs: set pin_file under CAP_SYS_ADMIN
f2fs: fix to avoid deadlock in f2fs_read_inline_dir()
f2fs: fix to adapt small inline xattr space in __find_inline_xattr()
f2fs: fix to do sanity check with inode.i_inline_xattr_size
f2fs: give some messages for inline_xattr_size
f2fs: don't trigger read IO for beyond EOF page
f2fs: fix to add refcount once page is tagged PG_private
f2fs: remove wrong comment in f2fs_invalidate_page()
f2fs: fix to use kvfree instead of kzfree
f2fs: print more parameters in trace_f2fs_map_blocks
f2fs: trace f2fs_ioc_shutdown
f2fs: fix to avoid deadlock of atomic file operations
f2fs: fix to dirty inode for i_mode recovery
f2fs: give random value to i_generation
f2fs: no need to take page lock in readdir
f2fs: fix to update iostat correctly in IPU path
f2fs: fix encrypted page memory leak
f2fs: make fault injection covering __submit_flush_wait()
f2fs: fix to retry fill_super only if recovery failed
f2fs: silence VM_WARN_ON_ONCE in mempool_alloc
f2fs: correct spelling mistake
f2fs: fix wrong #endif
f2fs: don't clear CP_QUOTA_NEED_FSCK_FLAG
f2fs: don't allow negative ->write_io_size_bits
f2fs: fix to check inline_xattr_size boundary correctly
Revert "f2fs: fix to avoid deadlock of atomic file operations"
Revert "f2fs: fix to check inline_xattr_size boundary correctly"
f2fs: do not use mutex lock in atomic context
f2fs: fix potential data inconsistence of checkpoint
f2fs: fix to avoid deadlock of atomic file operations
f2fs: fix to check inline_xattr_size boundary correctly
f2fs: jump to label 'free_node_inode' when failing from d_make_root()
f2fs: fix to document inline_xattr_size option
f2fs: fix to data block override node segment by mistake
f2fs: fix typos in code comments
f2fs: sync filesystem after roll-forward recovery
fs: export evict_inodes
f2fs: flush quota blocks after turnning it off
f2fs: avoid null pointer exception in dcc_info
f2fs: don't wake up too frequently, if there is lots of IOs
f2fs: try to keep CP_TRIMMED_FLAG after successful umount
f2fs: add quick mode of checkpoint=disable for QA
f2fs: run discard jobs when put_super
f2fs: fix to set sbi dirty correctly
f2fs: UBSAN: set boolean value iostat_enable correctly
f2fs: add brackets for macros
f2fs: check if file namelen exceeds max value
f2fs: fix to trigger fsck if dirent.name_len is zero
f2fs: no need to check return value of debugfs_create functions
f2fs: export FS_NOCOW_FL flag to user
f2fs: check inject_rate validity during configuring
f2fs: remove set but not used variable 'err'
f2fs: fix compile warnings: 'struct *' declared inside parameter list
f2fs: change error code to -ENOMEM from -EINVAL
Change-Id: Ie1d5c18c7df01ffbe084ec5f4f63951ae8dc3aa0
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
When umount of a partition fails with EBUSY there is
no indication as to what is keeping the mount point busy.
Add support to print a kernel log showing what files
are open on this mount point.
Also add a new new config option CONFIG_FILE_TABLE_DEBUG to
enable this feature.
Change-Id: Id7a3f5e7291b22ffd0f265848ec0a9757f713561
Signed-off-by: Nikhilesh Reddy <reddyn@codeaurora.org>
Signed-off-by: Ankit Jain <ankijain@codeaurora.org>
Now we pass the vfsmount when mounting and remounting.
This allows the filesystem to actually set up the mount
specific data, although we can't quite do anything with
it yet. show_options is expanded to include data that
lives with the mount.
To avoid changing existing filesystems, these have
been added as new vfs functions.
Change-Id: If80670bfad9f287abb8ac22457e1b034c9697097
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Now we pass the vfsmount when mounting and remounting.
This allows the filesystem to actually set up the mount
specific data, although we can't quite do anything with
it yet. show_options is expanded to include data that
lives with the mount.
To avoid changing existing filesystems, these have
been added as new vfs functions.
Change-Id: If80670bfad9f287abb8ac22457e1b034c9697097
Signed-off-by: Daniel Rosenberg <drosen@google.com>
There's a small consistency problem between the inode and writeback
naming. Writeback calls the "for IO" inode queues b_io and
b_more_io, but the inode calls these the "writeback list" or
i_wb_list. This makes it hard to an new "under writeback" list to
the inode, or call it an "under IO" list on the bdi because either
way we'll have writeback on IO and IO on writeback and it'll just be
confusing. I'm getting confused just writing this!
So, rename the inode "for IO" list variable to i_io_list so we can
add a new "writeback list" in a subsequent patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>
The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>
Make file->f_path always point to the overlay dentry so that the path in
/proc/pid/fd is correct and to ensure that label-based LSMs have access to the
overlay as well as the underlay (path-based LSMs probably don't need it).
Using my union testsuite to set things up, before the patch I see:
[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
...
lr-x------. 1 root root 64 Jun 5 14:38 5 -> /a/foo107
[root@andromeda union-testsuite]# stat /mnt/a/foo107
...
Device: 23h/35d Inode: 13381 Links: 1
...
[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
...
Device: 23h/35d Inode: 13381 Links: 1
...
After the patch:
[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
...
lr-x------. 1 root root 64 Jun 5 14:22 5 -> /mnt/a/foo107
[root@andromeda union-testsuite]# stat /mnt/a/foo107
...
Device: 23h/35d Inode: 40346 Links: 1
...
[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
...
Device: 23h/35d Inode: 40346 Links: 1
...
Note the change in where /proc/$$/fd/5 points to in the ls command. It was
pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
(which is correct).
The inode accessed, however, is the lower layer. The union layer is on device
25h/37d and the upper layer on 24h/36d.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I've noticed significant locking contention in memory reclaimer around
sb_lock inside grab_super_passive(). Grab_super_passive() is called from
two places: in icache/dcache shrinkers (function super_cache_scan) and
from writeback (function __writeback_inodes_wb). Both are required for
progress in memory allocator.
Grab_super_passive() acquires sb_lock to increment sb->s_count and check
sb->s_instances. It seems sb->s_umount locked for read is enough here:
super-block deactivation always runs under sb->s_umount locked for write.
Protecting super-block itself isn't a problem: in super_cache_scan() sb
is protected by shrinker_rwsem: it cannot be freed if its slab shrinkers
are still active. Inside writeback super-block comes from inode from bdi
writeback list under wb->list_lock.
This patch removes locking sb_lock and checks s_instances under s_umount:
generic_shutdown_super() unlinks it under sb->s_umount locked for write.
New variant is called trylock_super() and since it only locks semaphore,
callers must call up_read(&sb->s_umount) instead of drop_super(sb) when
they're done.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull misc VFS updates from Al Viro:
"This cycle a lot of stuff sits on topical branches, so I'll be sending
more or less one pull request per branch.
This is the first pile; more to follow in a few. In this one are
several misc commits from early in the cycle (before I went for
separate branches), plus the rework of mntput/dput ordering on umount,
switching to use of fs_pin instead of convoluted games in
namespace_unlock()"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch the IO-triggering parts of umount to fs_pin
new fs_pin killing logics
allow attaching fs_pin to a group not associated with some superblock
get rid of the second argument of acct_kill()
take count and rcu_head out of fs_pin
dcache: let the dentry count go down to zero without taking d_lock
pull bumping refcount into ->kill()
kill pin_put()
mode_t whack-a-mole: chelsio
file->f_path.dentry is pinned down for as long as the file is open...
get rid of lustre_dump_dentry()
gut proc_register() a bit
kill d_validate()
ncpfs: get rid of d_validate() nonsense
selinuxfs: don't open-code d_genocide()
Kmem accounting of memcg is unusable now, because it lacks slab shrinker
support. That means when we hit the limit we will get ENOMEM w/o any
chance to recover. What we should do then is to call shrink_slab, which
would reclaim old inode/dentry caches from this cgroup. This is what
this patch set is intended to do.
Basically, it does two things. First, it introduces the notion of
per-memcg slab shrinker. A shrinker that wants to reclaim objects per
cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
passed the memory cgroup to scan from in shrink_control->memcg. For
such shrinkers shrink_slab iterates over the whole cgroup subtree under
the target cgroup and calls the shrinker for each kmem-active memory
cgroup.
Secondly, this patch set makes the list_lru structure per-memcg. It's
done transparently to list_lru users - everything they have to do is to
tell list_lru_init that they want memcg-aware list_lru. Then the
list_lru will automatically distribute objects among per-memcg lists
basing on which cgroup the object is accounted to. This way to make FS
shrinkers (icache, dcache) memcg-aware we only need to make them use
memcg-aware list_lru, and this is what this patch set does.
As before, this patch set only enables per-memcg kmem reclaim when the
pressure goes from memory.limit, not from memory.kmem.limit. Handling
memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
it is still unclear whether we will have this knob in the unified
hierarchy.
This patch (of 9):
NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists. Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:
count = list_lru_count_node(lru, sc->nid);
freed = list_lru_walk_node(lru, sc->nid, isolate_func,
isolate_arg, &sc->nr_to_scan);
where sc is an instance of the shrink_control structure passed to it
from vmscan.
To simplify this, let's add special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.
This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the target memcg and make
list_lru_shrink_{count,walk} handle this appropriately.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
New pseudo-filesystem: nsfs. Targets of /proc/*/ns/* live there now.
It's not mountable (not even registered, so it's not in /proc/filesystems,
etc.). Files on it *are* bindable - we explicitly permit that in do_loopback().
This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
get_proc_ns() is a macro now (it's simply returning ->i_private; would
have been an inline, if not for header ordering headache).
proc_ns_inode() is an ex-parrot. The interface used in procfs is
ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).
Dentries and inodes are never hashed; a non-counting reference to dentry
is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
if present. See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
of that mechanism.
As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
from ns_get_path().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We need to be able to check inode permissions (but not filesystem implied
permissions) for stackable filesystems. Expose this interface for overlayfs.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Pull vfs updates from Al Viro:
"The big thing in this pile is Eric's unmount-on-rmdir series; we
finally have everything we need for that. The final piece of prereqs
is delayed mntput() - now filesystem shutdown always happens on
shallow stack.
Other than that, we have several new primitives for iov_iter (Matt
Wilcox, culled from his XIP-related series) pushing the conversion to
->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
cleanups and fixes (including the external name refcounting, which
gives consistent behaviour of d_move() wrt procfs symlinks for long
and short names alike) and assorted cleanups and fixes all over the
place.
This is just the first pile; there's a lot of stuff from various
people that ought to go in this window. Starting with
unionmount/overlayfs mess... ;-/"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
fs/file_table.c: Update alloc_file() comment
vfs: Deduplicate code shared by xattr system calls operating on paths
reiserfs: remove pointless forward declaration of struct nameidata
don't need that forward declaration of struct nameidata in dcache.h anymore
take dname_external() into fs/dcache.c
let path_init() failures treated the same way as subsequent link_path_walk()
fix misuses of f_count() in ppp and netlink
ncpfs: use list_for_each_entry() for d_subdirs walk
vfs: move getname() from callers to do_mount()
gfs2_atomic_open(): skip lookups on hashed dentry
[infiniband] remove pointless assignments
gadgetfs: saner API for gadgetfs_create_file()
f_fs: saner API for ffs_sb_create_file()
jfs: don't hash direct inode
[s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
ecryptfs: ->f_op is never NULL
android: ->f_op is never NULL
nouveau: __iomem misannotations
missing annotation in fs/file.c
fs: namespace: suppress 'may be used uninitialized' warnings
...
Add guard_bio_eod() check for mpage code in order to allow us to do IO
even on the odd last sectors of a device, even if the block size is some
multiple of the physical sector size.
Using mpage_readpages() for block device requires this guard check.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The gcc version 4.9.1 compiler complains Even though it isn't possible for
these variables to not get initialized before they are used.
fs/namespace.c: In function ‘SyS_mount’:
fs/namespace.c:2720:8: warning: ‘kernel_dev’ may be used uninitialized in this function [-Wmaybe-uninitialized]
ret = do_mount(kernel_dev, kernel_dir->name, kernel_type, flags,
^
fs/namespace.c:2699:8: note: ‘kernel_dev’ was declared here
char *kernel_dev;
^
fs/namespace.c:2720:8: warning: ‘kernel_type’ may be used uninitialized in this function [-Wmaybe-uninitialized]
ret = do_mount(kernel_dev, kernel_dir->name, kernel_type, flags,
^
fs/namespace.c:2697:8: note: ‘kernel_type’ was declared here
char *kernel_type;
^
Fix the warnings by simplifying copy_mount_string() as suggested by Al Viro.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
These externs belong in fs/internal.h. Rename (they are not acct-specific
anymore) and move them over there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The only thing we need it for is alt-sysrq-r (emergency remount r/o)
and these days we can do just as well without going through the
list of files.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
aka br_write_{lock,unlock} of vfsmount_lock. Inlines in fs/mount.h,
vfsmount_lock extern moved over there as well.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that the shrinker is passing a node in the scan control structure, we
can pass this to the the generic LRU list code to isolate reclaim to the
lists on matching nodes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert superblock shrinker to use the new count/scan API, and propagate
the API changes through to the filesystem callouts. The filesystem
callouts already use a count/scan API, so it's just changing counters to
longs to match the VM API.
This requires the dentry and inode shrinker callouts to be converted to
the count/scan API. This is mainly a mechanical change.
[glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This series reworks our current object cache shrinking infrastructure in
two main ways:
* Noticing that a lot of users copy and paste their own version of LRU
lists for objects, we put some effort in providing a generic version.
It is modeled after the filesystem users: dentries, inodes, and xfs
(for various tasks), but we expect that other users could benefit in
the near future with little or no modification. Let us know if you
have any issues.
* The underlying list_lru being proposed automatically and
transparently keeps the elements in per-node lists, and is able to
manipulate the node lists individually. Given this infrastructure, we
are able to modify the up-to-now hammer called shrink_slab to proceed
with node-reclaim instead of always searching memory from all over like
it has been doing.
Per-node lru lists are also expected to lead to less contention in the lru
locks on multi-node scans, since we are now no longer fighting for a
global lock. The locks usually disappear from the profilers with this
change.
Although we have no official benchmarks for this version - be our guest to
independently evaluate this - earlier versions of this series were
performance tested (details at
http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
visible performance regressions while yielding a better qualitative
behavior in NUMA machines.
With this infrastructure in place, we can use the list_lru entry point to
provide memcg isolation and per-memcg targeted reclaim. Historically,
those two pieces of work have been posted together. This version presents
only the infrastructure work, deferring the memcg work for a later time,
so we can focus on getting this part tested. You can see more about the
history of such work at http://lwn.net/Articles/552769/
Dave Chinner (18):
dcache: convert dentry_stat.nr_unused to per-cpu counters
dentry: move to per-sb LRU locks
dcache: remove dentries from LRU before putting on dispose list
mm: new shrinker API
shrinker: convert superblock shrinkers to new API
list: add a new LRU list type
inode: convert inode lru list to generic lru list code.
dcache: convert to use new lru list infrastructure
list_lru: per-node list infrastructure
shrinker: add node awareness
fs: convert inode and dentry shrinking to be node aware
xfs: convert buftarg LRU to generic code
xfs: rework buffer dispose list tracking
xfs: convert dquot cache lru to list_lru
fs: convert fs shrinkers to new scan/count API
drivers: convert shrinkers to new count/scan API
shrinker: convert remaining shrinkers to count/scan API
shrinker: Kill old ->shrink API.
Glauber Costa (7):
fs: bump inode and dentry counters to long
super: fix calculation of shrinkable objects for small numbers
list_lru: per-node API
vmscan: per-node deferred work
i915: bail out earlier when shrinker cannot acquire mutex
hugepage: convert huge zero page shrinker to new shrinker API
list_lru: dynamically adjust node arrays
This patch:
There are situations in very large machines in which we can have a large
quantity of dirty inodes, unused dentries, etc. This is particularly true
when umounting a filesystem, where eventually since every live object will
eventually be discarded.
Dave Chinner reported a problem with this while experimenting with the
shrinker revamp patchset. So we believe it is time for a change. This
patch just moves int to longs. Machines where it matters should have a
big long anyway.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We check submounts before doing d_drop() on a non-empty directory dentry in
NFS (have_submounts()), but we do not exclude a racing mount. Nor do we
prevent mounts to be added to the disconnected subtree using relative paths
after the d_drop().
This patch fixes these issues by checking for unlinked (unhashed, non-root)
ancestors before proceeding with the mount. This is done with rename
seqlock taken for write and with ->d_lock grabbed on each ancestor in turn,
including our dentry itself. This ensures that the only one of
check_submounts_and_drop() or has_unlinked_ancestor() can succeed.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
default_file_splice_from() ends up calling vfs_write() (via very convoluted
callchain). It's an overkill, since we already have done rw_verify_area()
in the caller by the time we call vfs_write() we are under set_fs(KERNEL_DS),
so access_ok() is also pointless. Add a new helper (__kernel_write()),
use it instead of kernel_write() in there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 169ebd9013 ("writeback: Avoid iput() from flusher thread")
removed iget-iput pair from inode writeback. As a side effect, inodes
that are dirty during iput_final() call won't be ever added to inode LRU
(iput_final() doesn't add dirty inodes to LRU and later when the inode
is cleaned there's noone to add the inode there). Thus inodes are
effectively unreclaimable until someone looks them up again.
The practical effect of this bug is limited by the fact that inodes are
pinned by a dentry for long enough that the inode gets cleaned. But
still the bug can have nasty consequences leading up to OOM conditions
under certain circumstances. Following can easily reproduce the
problem:
for (( i = 0; i < 1000; i++ )); do
mkdir $i
for (( j = 0; j < 1000; j++ )); do
touch $i/$j
echo 2 > /proc/sys/vm/drop_caches
done
done
then one needs to run 'sync; ls -lR' to make inodes reclaimable again.
We fix the issue by inserting unused clean inodes into the LRU after
writeback finishes in inode_sync_complete().
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
...and fix up the callers. For do_file_open_root, just declare a
struct filename on the stack and fill out the .name field. For
do_filp_open, make it also take a struct filename pointer, and fix up its
callers to call it appropriately.
For filp_open, add a variant that takes a struct filename pointer and turn
filp_open into a wrapper around it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most of places where we want freeze protection coincides with the places where
we also have remount-ro protection. So make mnt_want_write() and
mnt_drop_write() (and their _file alternative) prevent freezing as well.
For the few cases that are really interested only in remount-ro protection
provide new function variants.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split inode_permission() into inode- and superblock-dependent parts.
This is aimed at unionmounts where the superblock from the upper layer has to
be checked rather than the superblock from the lower layer as the upper layer
may be writable, thus allowing an unwritable file from the lower layer to be
copied up and modified.
Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com> (Further development)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just pass struct file *. Methods are happier that way...
There's no need to return struct file * from finish_open() now,
so let it return int. Next: saner prototypes for parts in
namei.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All users of open intents have been converted to use ->atomic_{open,create}.
This patch gets rid of nd->intent.open and related infrastructure.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new inode operation which is called on the last component of an open.
Using this the filesystem can look up, possibly create and open the file in one
atomic operation. If it cannot perform this (e.g. the file type turned out to
be wrong) it may signal this by returning NULL instead of an open struct file
pointer.
i_op->atomic_open() is only called if the last component is negative or needs
lookup. Handling cached positive dentries here doesn't add much value: these
can be opened using f_op->open(). If the cached file turns out to be invalid,
the open can be retried, this time using ->atomic_open() with a fresh dentry.
For now leave the old way of using open intents in lookup and revalidate in
place. This will be removed once all the users are converted.
David Howells noticed that if ->atomic_open() opens the file but does not create
it, handle_truncate() will be called on it even if it is not a regular file.
Fix this by checking the file type in this case too.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's enough to set ->mnt_ns of internal vfsmounts to something
distinct from all struct mnt_namespace out there; then we can
just use the check for ->mnt_ns != NULL in the fast path of
mntput_no_expire()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split __dentry_open() into two functions:
do_dentry_open() - does most of the actual work, doesn't put file on failure
open_check_o_direct() - after a successful open, checks direct_IO method
This will allow i_op->atomic_open to do just the file initialization and leave
the direct_IO checking to the VFS.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.
Since there are at least two users it makes sense to share this code in a
library. This is also easier maintainable than a macro forest.
This will also make it later possible to dynamically allocate lglocks and
also use them in modules (this would both still need some additional, but
now straightforward, code)
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently remouting superblock read-only is racy in a major way.
With the per mount read-only infrastructure it is now possible to
prevent most races, which this patch attempts.
Before starting the remount read-only, iterate through all mounts
belonging to the superblock and if none of them have any pending
writes, set sb->s_readonly_remount. This indicates that remount is in
progress and no further write requests are allowed. If the remount
succeeds set MS_RDONLY and reset s_readonly_remount.
If the remounting is unsuccessful just reset s_readonly_remount.
This can result in transient EROFS errors, despite the fact the
remount failed. Unfortunately hodling off writes is difficult as
remount itself may touch the filesystem (e.g. through load_nls())
which would deadlock.
A later patch deals with delayed writes due to nlink going to zero.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>