The namespace cleanup path leaks a dentry which holds a reference count
on a network namespace. Keeping that network namespace from being freed
when the last user goes away. Leaving things like vlan devices in the
leaked network namespace.
If you use ip netns add for much real work this problem becomes apparent
pretty quickly. It light testing the problem hides because frequently
you simply don't notice the leak.
Use d_set_d_op() so that DCACHE_OP_* flags are set correctly.
This issue exists back to 3.0.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Process accounting applications as top, ps visit some files under
/proc/<pid>. With seq_put_decimal_ull(), we can optimize /proc/<pid>/stat
and /proc/<pid>/statm files.
This patch adds
- seq_put_decimal_ll() for signed values.
- allow delimiter == 0.
- convert seq_printf() to seq_put_decimal_ull/ll in /proc/stat, statm.
Test result on a system with 2000+ procs.
Before patch:
[kamezawa@bluextal test]$ top -b -n 1 | wc -l
2223
[kamezawa@bluextal test]$ time top -b -n 1 > /dev/null
real 0m0.675s
user 0m0.044s
sys 0m0.121s
[kamezawa@bluextal test]$ time ps -elf > /dev/null
real 0m0.236s
user 0m0.056s
sys 0m0.176s
After patch:
kamezawa@bluextal ~]$ time top -b -n 1 > /dev/null
real 0m0.657s
user 0m0.052s
sys 0m0.100s
[kamezawa@bluextal ~]$ time ps -elf > /dev/null
real 0m0.198s
user 0m0.050s
sys 0m0.145s
Considering top, ps tend to scan /proc periodically, this will reduce cpu
consumption by top/ps to some extent.
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
== stat_check.py
num = 0
with open("/proc/stat") as f:
while num < 1000 :
data = f.read()
f.seek(0, 0)
num = num + 1
==
perf shows
20.39% stat_check.py [kernel.kallsyms] [k] format_decode
13.41% stat_check.py [kernel.kallsyms] [k] number
12.61% stat_check.py [kernel.kallsyms] [k] vsnprintf
10.85% stat_check.py [kernel.kallsyms] [k] memcpy
4.85% stat_check.py [kernel.kallsyms] [k] radix_tree_lookup
4.43% stat_check.py [kernel.kallsyms] [k] seq_printf
This patch removes most of calls to vsnprintf() by adding num_to_str()
and seq_print_decimal_ull(), which prints decimal numbers without rich
functions provided by printf().
On my 8cpu box.
== Before patch ==
[root@bluextal test]# time ./stat_check.py
real 0m0.150s
user 0m0.026s
sys 0m0.121s
== After patch ==
[root@bluextal test]# time ./stat_check.py
real 0m0.055s
user 0m0.022s
sys 0m0.030s
[akpm@linux-foundation.org: remove incorrect comment, use less statck in num_to_str(), move comment from .h to .c, simplify seq_put_decimal_ull()]
[andrea@betterlinux.com: avoid breaking the ABI in /proc/stat]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Turner <pjt@google.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On a typical 16 cpus machine, "cat /proc/stat" gives more than 4096 bytes,
and is slow :
# strace -T -o /tmp/STRACE cat /proc/stat | wc -c
5826
# grep "cpu " /tmp/STRACE
read(0, "cpu 1949310 19 2144714 12117253"..., 32768) = 5826 <0.001504>
Thats partly because show_stat() must be called twice since initial
buffer size is too small (4096 bytes for less than 32 possible cpus)
Fix this by :
1) Taking into account nr_irqs in the initial buffer sizing.
2) Using ksize() to allow better filling of initial buffer.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_sparsemem_vmemmap_info() is only used inside fs/proc/kcore.c
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
for 'VM_NODUMP' flag. The idea is is to add a new madvise() flag:
MADV_DONTDUMP, which can be set by applications to specifically request
memory regions which should not dump core.
The specific application I have in mind is qemu: we can add a flag there
that wouldn't dump all of guest memory when qemu dumps core. This flag
might also be useful for security sensitive apps that want to absolutely
make sure that parts of memory are not dumped. To clear the flag use:
MADV_DODUMP.
[akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
[akpm@linux-foundation.org: fix up the architectures which broke]
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The motivation for this patchset was that I was looking at a way for a
qemu-kvm process, to exclude the guest memory from its core dump, which
can be quite large. There are already a number of filter flags in
/proc/<pid>/coredump_filter, however, these allow one to specify 'types'
of kernel memory, not specific address ranges (which is needed in this
case).
Since there are no more vma flags available, the first patch eliminates
the need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by
the kernel to mark vdso and vsyscall pages. However, it is simple
enough to check if a vma covers a vdso or vsyscall page without the need
for this flag.
The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
'MADV_DONTDUMP', and unset via 'MADV_DODUMP'. The core dump filters
continue to work the same as before unless 'MADV_DONTDUMP' is set on the
region.
The qemu code which implements this features is at:
http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch
In my testing the qemu core dump shrunk from 383MB -> 13MB with this
patch.
I also believe that the 'MADV_DONTDUMP' flag might be useful for
security sensitive apps, which might want to select which areas are
dumped.
This patch:
The VM_ALWAYSDUMP flag is currently used by the coredump code to
indicate that a vma is part of a vsyscall or vdso section. However, we
can determine if a vma is in one these sections by checking it against
the gate_vma and checking for a non-NULL return value from
arch_vma_name(). Thus, freeing a valuable vma bit.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since '*outlen' is initialized to zero, it is currently possible to
create a filename of length (FAT_LFN_LEN + 1) when utf8 is not enabled.
To enforce the FAT_LFN_LEN limit, we must perform one less iteration.
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Ravishankar N <cyberax82@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xlate_to_uni() is called by vfat_build_slots() with sbi->nls_io as the
final argument. nls_io can never be null at this point because the
check is already being done in fat_fill_super() wherein the mount fails
if it is null.
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Ravishankar N <cyberax82@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We never use the length variable.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Looking for a bug in -rt, I stumbled across this code here from: commit
2dfa4eeab0 ("epoll keyed wakeups: teach epoll about hints coming with
the wakeup key"), specifically:
#ifdef CONFIG_DEBUG_LOCK_ALLOC
static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
unsigned long events, int subclass)
{
unsigned long flags;
spin_lock_irqsave_nested(&wqueue->lock, flags, subclass);
wake_up_locked_poll(wqueue, events);
spin_unlock_irqrestore(&wqueue->lock, flags);
}
#else
static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
unsigned long events, int subclass)
{
wake_up_poll(wqueue, events);
}
#endif
You change the function of ep_wake_up_nested() depending on whether
CONFIG_DEBUG_LOCK_ALLOC is set or not. This looks awfully suspicious,
and there's no comment to explain why. I initially thought that this
was trying to fool lockdep, and hiding a real bug.
Investigating it, I found the creation of wake_up_nested() (which no
longer exists) but was created for the sole purpose of epoll and its
strange wake ups, as explained in commit 0ccf831cbe ("lockdep:
annotate epoll")
Although the commit message says "annotate epoll" the change log is much
better at explaining what is happening than what is in the actual code.
Thus a comment is really necessary here. And to save the time of other
developers from having to go trudging through the git logs trying to
figure out why this code exists.
I took parts of the change log and placed it into a comment above the
affected code. This will make the description of what is happening more
visible to new developers that have to look at this code for the first
time.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some cases the poll() implementation in a driver has to do different
things depending on the events the caller wants to poll for. An example
is when a driver needs to start a DMA engine if the caller polls for
POLLIN, but doesn't want to do that if POLLIN is not requested but instead
only POLLOUT or POLLPRI is requested. This is something that can happen
in the video4linux subsystem among others.
Unfortunately, the current epoll/poll/select implementation doesn't
provide that information reliably. The poll_table_struct does have it: it
has a key field with the event mask. But once a poll() call matches one
or more bits of that mask any following poll() calls are passed a NULL
poll_table pointer.
Also, the eventpoll implementation always left the key field at ~0 instead
of using the requested events mask.
This was changed in eventpoll.c so the key field now contains the actual
events that should be polled for as set by the caller.
The solution to the NULL poll_table pointer is to set the qproc field to
NULL in poll_table once poll() matches the events, not the poll_table
pointer itself. That way drivers can obtain the mask through a new
poll_requested_events inline.
The poll_table_struct can still be NULL since some kernel code calls it
internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h). In
that case poll_requested_events() returns ~0 (i.e. all events).
Very rarely drivers might want to know whether poll_wait will actually
wait. If another earlier file descriptor in the set already matched the
events the caller wanted to wait for, then the kernel will return from the
select() call without waiting. This might be useful information in order
to avoid doing expensive work.
A new helper function poll_does_not_wait() is added that drivers can use
to detect this situation. This is now used in sock_poll_wait() in
include/net/sock.h. This was the only place in the kernel that needed
this information.
Drivers should no longer access any of the poll_table internals, but use
the poll_requested_events() and poll_does_not_wait() access functions
instead. In order to enforce that the poll_table fields are now prepended
with an underscore and a comment was added warning against using them
directly.
This required a change in unix_dgram_poll() in unix/af_unix.c which used
the key field to get the requested events. It's been replaced by a call
to poll_requested_events().
For qproc it was especially important to change its name since the
behavior of that field changes with this patch since this function pointer
can now be NULL when that wasn't possible in the past.
Any driver accessing the qproc or key fields directly will now fail to compile.
Some notes regarding the correctness of this patch: the driver's poll()
function is called with a 'struct poll_table_struct *wait' argument. This
pointer may or may not be NULL, drivers can never rely on it being one or
the other as that depends on whether or not an earlier file descriptor in
the select()'s fdset matched the requested events.
There are only three things a driver can do with the wait argument:
1) obtain the key field:
events = wait ? wait->key : ~0;
This will still work although it should be replaced with the new
poll_requested_events() function (which does exactly the same).
This will now even work better, since wait is no longer set to NULL
unnecessarily.
2) use the qproc callback. This could be deadly since qproc can now be
NULL. Renaming qproc should prevent this from happening. There are no
kernel drivers that actually access this callback directly, BTW.
3) test whether wait == NULL to determine whether poll would return without
waiting. This is no longer sufficient as the correct test is now
wait == NULL || wait->_qproc == NULL.
However, the worst that can happen here is a slight performance hit in
the case where wait != NULL and wait->_qproc == NULL. In that case the
driver will assume that poll_wait() will actually add the fd to the set
of waiting file descriptors. Of course, poll_wait() will not do that
since it tests for wait->_qproc. This will not break anything, though.
There is only one place in the whole kernel where this happens
(sock_poll_wait() in include/net/sock.h) and that code will be replaced
by a call to poll_does_not_wait() in the next patch.
Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
actually waiting. The next file descriptor from the set might match the
event mask and thus any possible waits will never happen.
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Reviewed-by: Jonathan Corbet <corbet@lwn.net>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Move open-coded filesystem magic numbers into magic.h
- Rearrange magic.h so that the filesystem-related constants are grouped
together.
Signed-off-by: Muthukumar R <muthur@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cifs.ko has historically been tolerant of options that it does not
recognize. This is not normal behavior for a filesystem however.
Usually, it should only do this if you mount with '-s', and autofs
generally passes -s to the mount command to allow this behavior.
This patch makes cifs handle the option "sloppy" appropriately. If it's
present in the options string, then the client will tolerate options
that it doesn't recognize. If it's not present then the client will
error out in the presence of options that it does not recognize and
throw an error message explaining why.
There is also a companion patch being proposed for mount.cifs to make it
append "sloppy" to the mount options when passed the '-s' flag. This also
should (obviously) be applied on top of Sachin's conversion to the
standard option parser.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-By: Sachin Prabhu <sprabhu@redhat.com>
Use the standard token parser instead of the long if condition to parse
cifs mount options.
This was first proposed by Scott Lovenberg
http://lists.samba.org/archive/linux-cifs-client/2010-May/006079.html
Mount options have been grouped together in terms of their input types.
Aliases for username, password, domain and credentials have been added.
The password parser has been modified to make it easier to read.
Since the patch was first proposed, the following bugs have been fixed
1) Allow blank 'pass' option to be passed by the cifs mount helper when
using sec=none.
2) Do not explicitly set vol->nullauth to 0. This causes a problem
when using sec=none while also using a username.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cifs_update_eof has the potential to be racy if multiple threads are
trying to modify it at the same time. Protect modifications of the
server_eof value with the inode->i_lock.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
We'll need to do something a bit different depending on the caller.
Abstract the code that marshals the page array into an iovec.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Use DIV_ROUND_UP. Also, PAGE_SIZE is more appropriate here since these
aren't pagecache pages.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
The gfp flags are currently set to __GPF_HIGHMEM, which doesn't allow
for any reclaim. Make this more resilient by or'ing that with
GFP_KERNEL. Also, get rid of the goto and unify the exit codepath.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
We'll need a different set of write completion ops when not writing out
of the pagecache.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
We'll need this to handle rwpidforward option correctly when we use
async writes in the aio_write op.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
...and convert existing cifs users of system_nrt_wq to use that instead.
Also, make it freezable, and set WQ_MEM_RECLAIM since we use it to
deal with write reply handling.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
While in CIFS/SMB we have 16 bit mid, in SMB2 it is 64 bit.
Convert the existing field to 64 bit and mask off higher bits
for CIFS/SMB.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Pull XFS updates from Ben Myers:
"Scalability improvements for dquots, log grant code cleanups, plus
bugfixes and cleanups large and small"
Fix up various trivial conflicts that were due to some of the earlier
patches already having been integrated into v3.3 as bugfixes, and then
there were development patches on top of those. Easily merged by just
taking the newer version from the pulled branch.
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (45 commits)
xfs: fallback to vmalloc for large buffers in xfs_getbmap
xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get
xfs: remove remaining scraps of struct xfs_iomap
xfs: fix inode lookup race
xfs: clean up minor sparse warnings
xfs: remove the global xfs_Gqm structure
xfs: remove the per-filesystem list of dquots
xfs: use per-filesystem radix trees for dquot lookup
xfs: per-filesystem dquot LRU lists
xfs: use common code for quota statistics
xfs: reimplement fdatasync support
xfs: split in-core and on-disk inode log item fields
xfs: make xfs_inode_item_size idempotent
xfs: log timestamp updates
xfs: log file size updates at I/O completion time
xfs: log file size updates as part of unwritten extent conversion
xfs: do not require an ioend for new EOF calculation
xfs: use per-filesystem I/O completion workqueues
quota: make Q_XQUOTASYNC a noop
xfs: include reservations in quota reporting
...
Pull CIFS fixes from Steve French
* git://git.samba.org/sfrench/cifs-2.6:
cifs: clean up ordering in exit_cifs
cifs: clean up call to cifs_dfs_release_automount_timer()
CIFS: Delete echo_retries module parm
CIFS: Prepare credits code for a slot reservation
CIFS: Make wait_for_free_request killable
CIFS: Introduce credit-based flow control
CIFS: Simplify inFlight logic
cifs: fix issue mounting of DFS ROOT when redirecting from one domain controller to the next
CIFS: Respect negotiated MaxMpxCount
CIFS: Fix a spurious error in cifs_push_posix_locks
New features include:
- Add NFS client support for containers.
This should enable most of the necessary functionality, including
lockd support, and support for rpc.statd, NFSv4 idmapper and
RPCSEC_GSS upcalls into the correct network namespace from
which the mount system call was issued.
- NFSv4 idmapper scalability improvements
Base the idmapper cache on the keyring interface to allow concurrent
access to idmapper entries. Start the process of migrating users from
the single-threaded daemon-based approach to the multi-threaded
request-key based approach.
- NFSv4.1 implementation id.
Allows the NFSv4.1 client and server to mutually identify each other
for logging and debugging purposes.
- Support the 'vers=4.1' mount option for mounting NFSv4.1 instead of
having to use the more counterintuitive 'vers=4,minorversion=1'.
- SUNRPC tracepoints.
Start the process of adding tracepoints in order to improve debugging
of the RPC layer.
- pNFS object layout support for autologin.
Important bugfixes include:
- Fix a bug in rpc_wake_up/rpc_wake_up_status that caused them to fail
to wake up all tasks when applied to priority waitqueues.
- Ensure that we handle read delegations correctly, when we try to
truncate a file.
- A number of fixes for NFSv4 state manager loops (mostly to do with
delegation recovery).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJPalZbAAoJEGcL54qWCgDyCi4P+QHcmzQhJO7HWx3Pzjs67bFT
xMSYaKHGWS4AJKUBVl5OKBxUExfrMHBNbElV3IKUIwBlDx8RVtnwfptKSe146iki
dn4TrRO5es8nmI4hRDcGMlzJDZq4y0Qg//qiUFmojiNW/Avw0ljfMoVUejJJ09FV
oeDk4EGtcxkEyH+g48ZjYbyspRnG8qtD3atf70Z3lYE0ELdG/B5Dyzw1RDrA5p73
xJX3lqy8p/4ROzw/dmNoxdAXOrr3Q4/T58Bvp/lUglPy/EHyPmWzFoH0MU0C/PFu
5VnAl6QDbNCTcIw9FvJlX/mIyErpNG9eKzUskUc9L9SA+B+J/i4rIap4KATRN3nH
7QhE5qUacPuJnvxml7MPmlQTuft3fkAQ7NhKIWrbRi1QS9FmJC5NxctIb8loqlFn
yIXdKeLfMshB+NyuFS9uzStX7SmV3eMgVd+5ZxRjYxm+PKJLw2KXeudArL6M5mHK
3QeKZpqwaYQ3RfaTNpvAp0doiXHCO5UbWfI0Pe8xQs/QcMCNReffqV2G4IJKFAu6
WpoN2UDQC9LCBifLw2nS7kku8+ZVXLQU8OC1NVl3TG15xD9cNLXuk3/y5llPGq4O
odo52uLFpJohbDaHMj5RTKOfchTQCm2iyuVmxZEeAySypMSiAXmW7COSKHs/HxI1
VBm+EI00Pvmm5+fUjIlp
=LuHE
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates for Linux 3.4 from Trond Myklebust:
"New features include:
- Add NFS client support for containers.
This should enable most of the necessary functionality, including
lockd support, and support for rpc.statd, NFSv4 idmapper and
RPCSEC_GSS upcalls into the correct network namespace from which
the mount system call was issued.
- NFSv4 idmapper scalability improvements
Base the idmapper cache on the keyring interface to allow
concurrent access to idmapper entries. Start the process of
migrating users from the single-threaded daemon-based approach to
the multi-threaded request-key based approach.
- NFSv4.1 implementation id.
Allows the NFSv4.1 client and server to mutually identify each
other for logging and debugging purposes.
- Support the 'vers=4.1' mount option for mounting NFSv4.1 instead of
having to use the more counterintuitive 'vers=4,minorversion=1'.
- SUNRPC tracepoints.
Start the process of adding tracepoints in order to improve
debugging of the RPC layer.
- pNFS object layout support for autologin.
Important bugfixes include:
- Fix a bug in rpc_wake_up/rpc_wake_up_status that caused them to
fail to wake up all tasks when applied to priority waitqueues.
- Ensure that we handle read delegations correctly, when we try to
truncate a file.
- A number of fixes for NFSv4 state manager loops (mostly to do with
delegation recovery)."
* tag 'nfs-for-3.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (224 commits)
NFS: fix sb->s_id in nfs debug prints
xprtrdma: Remove assumption that each segment is <= PAGE_SIZE
xprtrdma: The transport should not bug-check when a dup reply is received
pnfs-obj: autologin: Add support for protocol autologin
NFS: Remove nfs4_setup_sequence from generic rename code
NFS: Remove nfs4_setup_sequence from generic unlink code
NFS: Remove nfs4_setup_sequence from generic read code
NFS: Remove nfs4_setup_sequence from generic write code
NFS: Fix more NFS debug related build warnings
SUNRPC/LOCKD: Fix build warnings when CONFIG_SUNRPC_DEBUG is undefined
nfs: non void functions must return a value
SUNRPC: Kill compiler warning when RPC_DEBUG is unset
SUNRPC/NFS: Add Kbuild dependencies for NFS_DEBUG/RPC_DEBUG
NFS: Use cond_resched_lock() to reduce latencies in the commit scans
NFSv4: It is not safe to dereference lsp->ls_state in release_lockowner
NFS: ncommit count is being double decremented
SUNRPC: We must not use list_for_each_entry_safe() in rpc_wake_up()
Try using machine credentials for RENEW calls
NFSv4.1: Fix a few issues in filelayout_commit_pagelist
NFSv4.1: Clean ups and bugfixes for the pNFS read/writeback/commit code
...
into debugfs, and use __read_mostly as neccessary.
Also add a MAINTAINER file for cleancache API files.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQEcBAABAgAGBQJPZ1sjAAoJEFjIrFwIi8fJCSkIALqBBCkuPSusvdebT59Oq145
EW2eEwJzVft0vlvS/IPl+W37TH5VWhBVCW8frAYMpLaY5X/W1ZwzFpx76T4acAiM
wkXwsyYXs6N13OOFoH5gmf3cwproAioRxbEeALucxeMR4rK5Yw+oZXT+hSu2KaLh
1LtHyx+u+NLb0QOseAuDcmyjY9r6aLBA1HMcVD2z+4UW1n/9NWexQP3ShYP9uMDs
GnyDzZ8vWXTcoc4Auj0rpaNsT5d47ltGegKASZmmvS3QyMHbZ4sk2HECnQY4wSee
alPJwDyb6mOJmQ3e4s940onjfZPgd8/cW/DVEUrveH+dz6Eqjxqz41dVRVXiLz8=
=tUNC
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm
Pull cleancache changes from Konrad Rzeszutek Wilk:
"This has some patches for the cleancache API that should have been
submitted a _long_ time ago. They are basically cleanups:
- rename of flush to invalidate
- moving reporting of statistics into debugfs
- use __read_mostly as necessary.
Oh, and also the MAINTAINERS file change. The files (except the
MAINTAINERS file) have been in #linux-next for months now. The late
addition of MAINTAINERS file is a brain-fart on my side - didn't
realize I needed that just until I was typing this up - and I based
that patch on v3.3 - so the tree is on top of v3.3."
* tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
MAINTAINERS: Adding cleancache API to the list.
mm: cleancache: Use __read_mostly as appropiate.
mm: cleancache: report statistics via debugfs instead of sysfs.
mm: zcache/tmem/cleancache: s/flush/invalidate/
mm: cleancache: s/flush/invalidate/
While doing the fs/namei.c cleanups, I ran sparse on it, and it pointed
out other large integers and a couple of cases of us using '0' instead
of the proper 'NULL'.
Sparse still doesn't understand some of the conditional locking going
on, but that's no excuse for not fixing up the trivial stuff.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit commit 1de5b41cd3 ("fs/namei.c: fix warnings on 32-bit")
Andrew said that there must be a tidier way of doing this.
This is that tidier way.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc warnings in fs/dcache.c:
Warning(fs/dcache.c:1743): No description found for parameter 'seqp'
Warning(fs/dcache.c:1743): Excess function parameter 'seq' description in '__d_lookup_rcu'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want it to match what hash_name() is doing, which means extra
multiply by 9 in this case...
Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Protect code accessing ctl_table by grabbing the header with grab_header()
and after releasing with sysctl_head_finish(). This is needed if poll()
is called in entries created by modules: currently only hostname and
domainname support poll(), but this bug may be triggered when/if modules
use it and if user called poll() in a file that doesn't support it.
Dave Jones reported the following when using a syscall fuzzer while
hibernating/resuming:
RIP: 0010:[<ffffffff81233e3e>] [<ffffffff81233e3e>] proc_sys_poll+0x4e/0x90
RAX: 0000000000000145 RBX: ffff88020cab6940 RCX: 0000000000000000
RDX: ffffffff81233df0 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88020cab6940
[ ... ]
Code: 00 48 89 fb 48 89 f1 48 8b 40 30 4c 8b 60 e8 b8 45 01 00 00 49 83
7c 24 28 00 74 2e 49 8b 74 24 30 48 85 f6 74 24 48 85 c9 75 32 <8b> 16
b8 45 01 00 00 48 63 d2 49 39 d5 74 10 8b 06 48 98 48 89
If an entry goes away while we are polling() it, ctl_table may not exist
anymore.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
We currently have significant issues with the amount of stack that
allocation in XFS uses, especially in the writeback path. We can
easily consume 4k of stack between mapping the page, manipulating
the bmap btree and allocating blocks from the free list. Not to
mention btree block readahead and other functionality that issues IO
in the allocation path.
As a result, we can no longer fit allocation in the writeback path
in the stack space provided on x86_64. To alleviate this problem,
introduce an allocation workqueue and move all allocations to a
seperate context. This can be easily added as an interposing layer
into xfs_alloc_vextent(), which takes a single argument structure
and does not return until the allocation is complete or has failed.
To do this, add a work structure and a completion to the allocation
args structure. This allows xfs_alloc_vextent to queue the args onto
the workqueue and wait for it to be completed by the worker. This
can be done completely transparently to the caller.
The worker function needs to ensure that it sets and clears the
PF_TRANS flag appropriately as it is being run in an active
transaction context. Work can also be queued in a memory reclaim
context, so a rescuer is needed for the workqueue.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Sparse identified some unsafe handling of open flags in the xfs open
by handle ioctl code. Update the code to use the correct access
macros to ensure that we handle the open flags correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
To fix the deadlock caused by repeatedly calling xfs_rtfree_extent
- removed xfs_ilock() and xfs_trans_ijoin() from xfs_rtfree_extent(),
instead added asserts that the inode is locked and has an inode_item
attached to it.
- in xfs_bunmapi() when dealing with an inode with the rt flag
call xfs_ilock() and xfs_trans_ijoin() so that the
reference count is bumped on the inode and attached it to the
transaction before calling into xfs_bmap_del_extent, similar to
what we do in xfs_bmap_rtalloc.
Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ben Myers <bpm@sgi.com>
xfs_qm_exit() is called in init_xfs_fs().
Signed-off-by: Gerard Snitselaar <dev@snitselaar.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Merge first batch of patches from Andrew Morton:
"A few misc things and all the MM queue"
* emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits)
memcg: avoid THP split in task migration
thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
memcg: clean up existing move charge code
mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
mm/memcontrol.c: s/stealed/stolen/
memcg: fix performance of mem_cgroup_begin_update_page_stat()
memcg: remove PCG_FILE_MAPPED
memcg: use new logic for page stat accounting
memcg: remove PCG_MOVE_LOCK flag from page_cgroup
memcg: simplify move_account() check
memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
memcg: kill dead prev_priority stubs
memcg: remove PCG_CACHE page_cgroup flag
memcg: let css_get_next() rely upon rcu_read_lock()
cgroup: revert ss_id_lock to spinlock
idr: make idr_get_next() good for rcu_read_lock()
memcg: remove unnecessary thp check in page stat accounting
memcg: remove redundant returns
memcg: enum lru_list lru
...
In ceph_vxattrcb_file_layout(), there is a check to determine
whether a preferred PG should be formatted into the output buffer.
That check assumes that a preferred PG number of 0 indicates "no
preference," but that is wrong. No preference is indicated by a
negative (specifically, -1) PG number.
In addition, if that condition yields true, the preferred value
is formatted into a sized buffer, but the size consumed by the
earlier snprintf() call is not accounted for, opening up the
possibilty of a buffer overrun.
Finally, in ceph_vxattrcb_dir_rctime() where the nanoseconds part of
the time displayed did not include leading 0's, which led to
erroneous (sub-second portion of) time values being shown.
This fixes these three issues:
http://tracker.newdream.net/issues/2155http://tracker.newdream.net/issues/2156http://tracker.newdream.net/issues/2157
Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
Many ceph-related Boolean options offer the ability to both enable
and disable a feature. For all those that don't offer this, add
a new option so that they do.
Note that ceph_show_options()--which reports mount options currently
in effect--only reports the option if it is different from the
default value.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>