2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* sysctl.c: General linux system control interface
|
|
|
|
*
|
|
|
|
* Begun 24 March 1995, Stephen Tweedie
|
|
|
|
* Added /proc support, Dec 1995
|
|
|
|
* Added bdflush entry and intvec min/max checking, 2/23/96, Tom Dyas.
|
|
|
|
* Added hooks for /proc/sys/net (minor, minor patch), 96/4/1, Mike Shaver.
|
|
|
|
* Added kernel/java-{interpreter,appletviewer}, 96/5/10, Mike Shaver.
|
|
|
|
* Dynamic registration fixes, Stephen Tweedie.
|
|
|
|
* Added kswapd-interval, ctrl-alt-del, printk stuff, 1/8/97, Chris Horn.
|
|
|
|
* Made sysctl support optional via CONFIG_SYSCTL, 1/10/97, Chris
|
|
|
|
* Horn.
|
|
|
|
* Added proc_doulongvec_ms_jiffies_minmax, 09/08/99, Carlos H. Bauer.
|
|
|
|
* Added proc_doulongvec_minmax, 09/08/99, Carlos H. Bauer.
|
|
|
|
* Changed linked lists to use list.h instead of lists.h, 02/24/00, Bill
|
|
|
|
* Wendling.
|
|
|
|
* The list_for_each() macro wasn't appropriate for the sysctl loop.
|
|
|
|
* Removed it and replaced it with older style, 03/23/00, Bill Wendling
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/module.h>
|
2015-02-22 08:58:50 -08:00
|
|
|
#include <linux/aio.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/sysctl.h>
|
2012-03-28 14:42:50 -07:00
|
|
|
#include <linux/bitmap.h>
|
2010-03-10 15:23:59 -08:00
|
|
|
#include <linux/signal.h>
|
kptr_restrict for hiding kernel pointers from unprivileged users
Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
sysctl.
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.
If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".
[akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
[akpm@linux-foundation.org: coding-style fixup]
[randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-12 16:59:41 -08:00
|
|
|
#include <linux/printk.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/proc_fs.h>
|
V3 file capabilities: alter behavior of cap_setpcap
The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
can change the capabilities of another process, p2. This is not the
meaning that was intended for this capability at all, and this
implementation came about purely because, without filesystem capabilities,
there was no way to use capabilities without one process bestowing them on
another.
Since we now have a filesystem support for capabilities we can fix the
implementation of CAP_SETPCAP.
The most significant thing about this change is that, with it in effect, no
process can set the capabilities of another process.
The capabilities of a program are set via the capability convolution
rules:
pI(post-exec) = pI(pre-exec)
pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
pE(post-exec) = fE ? pP(post-exec) : 0
at exec() time. As such, the only influence the pre-exec() program can
have on the post-exec() program's capabilities are through the pI
capability set.
The correct implementation for CAP_SETPCAP (and that enabled by this patch)
is that it can be used to add extra pI capabilities to the current process
- to be picked up by subsequent exec()s when the above convolution rules
are applied.
Here is how it works:
Let's say we have a process, p. It has capability sets, pE, pP and pI.
Generally, p, can change the value of its own pI to pI' where
(pI' & ~pI) & ~pP = 0.
That is, the only new things in pI' that were not present in pI need to
be present in pP.
The role of CAP_SETPCAP is basically to permit changes to pI beyond
the above:
if (pE & CAP_SETPCAP) {
pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */
}
This capability is useful for things like login, which (say, via
pam_cap) might want to raise certain inheritable capabilities for use
by the children of the logged-in user's shell, but those capabilities
are not useful to or needed by the login program itself.
One such use might be to limit who can run ping. You set the
capabilities of the 'ping' program to be "= cap_net_raw+i", and then
only shells that have (pI & CAP_NET_RAW) will be able to run
it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
would have to also have (pP & CAP_NET_RAW) in order to raise this
capability and pass it on through the inheritable set.
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 03:05:59 -07:00
|
|
|
#include <linux/security.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/ctype.h>
|
2008-04-04 00:51:41 +02:00
|
|
|
#include <linux/kmemcheck.h>
|
2012-07-30 14:42:48 -07:00
|
|
|
#include <linux/kmemleak.h>
|
2007-07-17 04:03:45 -07:00
|
|
|
#include <linux/fs.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/kernel.h>
|
2005-11-11 05:33:52 +01:00
|
|
|
#include <linux/kobject.h>
|
2005-08-16 02:18:02 -03:00
|
|
|
#include <linux/net.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/sysrq.h>
|
|
|
|
#include <linux/highuid.h>
|
|
|
|
#include <linux/writeback.h>
|
2009-09-22 16:18:09 +02:00
|
|
|
#include <linux/ratelimit.h>
|
2010-05-24 14:32:28 -07:00
|
|
|
#include <linux/compaction.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/hugetlb.h>
|
|
|
|
#include <linux/initrd.h>
|
2008-04-29 01:01:32 -07:00
|
|
|
#include <linux/key.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/times.h>
|
|
|
|
#include <linux/limits.h>
|
|
|
|
#include <linux/dcache.h>
|
2010-01-20 22:27:56 +02:00
|
|
|
#include <linux/dnotify.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/syscalls.h>
|
2008-07-23 21:27:03 -07:00
|
|
|
#include <linux/vmstat.h>
|
2006-02-20 18:27:58 -08:00
|
|
|
#include <linux/nfs_fs.h>
|
|
|
|
#include <linux/acpi.h>
|
2007-07-17 18:37:02 -07:00
|
|
|
#include <linux/reboot.h>
|
2008-05-12 21:20:43 +02:00
|
|
|
#include <linux/ftrace.h>
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
|
|
|
#include <linux/perf_event.h>
|
2010-02-25 08:34:15 -05:00
|
|
|
#include <linux/kprobes.h>
|
2010-05-19 21:03:16 +02:00
|
|
|
#include <linux/pipe_fs_i.h>
|
2010-08-09 17:18:56 -07:00
|
|
|
#include <linux/oom.h>
|
2011-04-01 17:07:50 -04:00
|
|
|
#include <linux/kmod.h>
|
2011-10-31 17:11:20 -07:00
|
|
|
#include <linux/capability.h>
|
2012-02-13 03:58:52 +00:00
|
|
|
#include <linux/binfmts.h>
|
2013-02-07 09:46:59 -06:00
|
|
|
#include <linux/sched/sysctl.h>
|
kexec: add sysctl to disable kexec_load
For general-purpose (i.e. distro) kernel builds it makes sense to build
with CONFIG_KEXEC to allow end users to choose what kind of things they
want to do with kexec. However, in the face of trying to lock down a
system with such a kernel, there needs to be a way to disable kexec_load
(much like module loading can be disabled). Without this, it is too easy
for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM
and modules_disabled are set. With this change, it is still possible to
load an image for use later, then disable kexec_load so the image (or lack
of image) can't be altered.
The intention is for using this in environments where "perfect"
enforcement is hard. Without a verified boot, along with verified
modules, and along with verified kexec, this is trying to give a system a
better chance to defend itself (or at least grow the window of
discoverability) against attack in the face of a privilege escalation.
In my mind, I consider several boot scenarios:
1) Verified boot of read-only verified root fs loading fd-based
verification of kexec images.
2) Secure boot of writable root fs loading signed kexec images.
3) Regular boot loading kexec (e.g. kcrash) image early and locking it.
4) Regular boot with no control of kexec image at all.
1 and 2 don't exist yet, but will soon once the verified kexec series has
landed. 4 is the state of things now. The gap between 2 and 4 is too
large, so this change creates scenario 3, a middle-ground above 4 when 2
and 1 are not possible for a system.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 15:55:59 -08:00
|
|
|
#include <linux/kexec.h>
|
bpf: enable non-root eBPF programs
In order to let unprivileged users load and execute eBPF programs
teach verifier to prevent pointer leaks.
Verifier will prevent
- any arithmetic on pointers
(except R10+Imm which is used to compute stack addresses)
- comparison of pointers
(except if (map_value_ptr == 0) ... )
- passing pointers to helper functions
- indirectly passing pointers in stack to helper functions
- returning pointer from bpf program
- storing pointers into ctx or maps
Spill/fill of pointers into stack is allowed, but mangling
of pointers stored in the stack or reading them byte by byte is not.
Within bpf programs the pointers do exist, since programs need to
be able to access maps, pass skb pointer to LD_ABS insns, etc
but programs cannot pass such pointer values to the outside
or obfuscate them.
Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
so that socket filters (tcpdump), af_packet (quic acceleration)
and future kcm can use it.
tracing and tc cls/act program types still require root permissions,
since tracing actually needs to be able to see all kernel pointers
and tc is for root only.
For example, the following unprivileged socket filter program is allowed:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);
if (value)
*value += skb->len;
return 0;
}
but the following program is not:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);
if (value)
*value += (u64) skb;
return 0;
}
since it would leak the kernel address into the map.
Unprivileged socket filter bpf programs have access to the
following helper functions:
- map lookup/update/delete (but they cannot store kernel pointers into them)
- get_random (it's already exposed to unprivileged user space)
- get_smp_processor_id
- tail_call into another socket filter program
- ktime_get_ns
The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
This toggle defaults to off (0), but can be set true (1). Once true,
bpf programs and maps cannot be accessed from unprivileged process,
and the toggle cannot be set back to false.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07 22:23:21 -07:00
|
|
|
#include <linux/bpf.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
#include <asm/uaccess.h>
|
|
|
|
#include <asm/processor.h>
|
|
|
|
|
2006-09-30 01:47:55 +02:00
|
|
|
#ifdef CONFIG_X86
|
|
|
|
#include <asm/nmi.h>
|
2006-12-07 02:14:11 +01:00
|
|
|
#include <asm/stacktrace.h>
|
2008-01-30 13:30:05 +01:00
|
|
|
#include <asm/io.h>
|
2006-09-30 01:47:55 +02:00
|
|
|
#endif
|
2012-03-28 18:30:03 +01:00
|
|
|
#ifdef CONFIG_SPARC
|
|
|
|
#include <asm/setup.h>
|
|
|
|
#endif
|
2010-03-10 15:24:08 -08:00
|
|
|
#ifdef CONFIG_BSD_PROCESS_ACCT
|
|
|
|
#include <linux/acct.h>
|
|
|
|
#endif
|
2010-03-10 15:24:09 -08:00
|
|
|
#ifdef CONFIG_RT_MUTEXES
|
|
|
|
#include <linux/rtmutex.h>
|
|
|
|
#endif
|
2010-03-10 15:24:10 -08:00
|
|
|
#if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_LOCK_STAT)
|
|
|
|
#include <linux/lockdep.h>
|
|
|
|
#endif
|
2010-03-10 15:24:07 -08:00
|
|
|
#ifdef CONFIG_CHR_DEV_SG
|
|
|
|
#include <scsi/sg.h>
|
|
|
|
#endif
|
2006-09-30 01:47:55 +02:00
|
|
|
|
2010-05-07 17:11:44 -04:00
|
|
|
#ifdef CONFIG_LOCKUP_DETECTOR
|
2010-02-12 17:19:19 -05:00
|
|
|
#include <linux/nmi.h>
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
#if defined(CONFIG_SYSCTL)
|
|
|
|
|
|
|
|
/* External variables not in a header file. */
|
2005-06-23 00:09:43 -07:00
|
|
|
extern int suid_dumpable;
|
2012-10-04 17:15:23 -07:00
|
|
|
#ifdef CONFIG_COREDUMP
|
|
|
|
extern int core_uses_pid;
|
2005-04-16 15:20:36 -07:00
|
|
|
extern char core_pattern[];
|
2009-09-23 15:56:56 -07:00
|
|
|
extern unsigned int core_pipe_limit;
|
2012-10-04 17:15:23 -07:00
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
extern int pid_max;
|
2011-09-01 15:26:50 -04:00
|
|
|
extern int extra_free_kbytes;
|
2005-04-16 15:20:36 -07:00
|
|
|
extern int pid_max_min, pid_max_max;
|
2006-01-08 01:00:40 -08:00
|
|
|
extern int percpu_pagelist_fraction;
|
2006-06-26 13:56:52 +02:00
|
|
|
extern int compat_log;
|
2008-01-25 21:08:34 +01:00
|
|
|
extern int latencytop_enabled;
|
2008-05-10 10:08:32 -04:00
|
|
|
extern int sysctl_nr_open_min, sysctl_nr_open_max;
|
2009-01-08 12:04:47 +00:00
|
|
|
#ifndef CONFIG_MMU
|
|
|
|
extern int sysctl_nr_trim_pages;
|
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2007-10-16 23:26:09 -07:00
|
|
|
/* Constants used for minimum and maximum */
|
2010-05-07 17:11:46 -04:00
|
|
|
#ifdef CONFIG_LOCKUP_DETECTOR
|
2007-10-16 23:26:09 -07:00
|
|
|
static int sixty = 60;
|
|
|
|
#endif
|
|
|
|
|
2014-01-20 17:34:13 +00:00
|
|
|
static int __maybe_unused neg_one = -1;
|
|
|
|
|
2007-10-16 23:26:09 -07:00
|
|
|
static int zero;
|
2009-04-06 13:38:46 -07:00
|
|
|
static int __maybe_unused one = 1;
|
|
|
|
static int __maybe_unused two = 2;
|
2016-08-31 16:54:12 -07:00
|
|
|
static int __maybe_unused three = 3;
|
2014-04-03 14:48:19 -07:00
|
|
|
static int __maybe_unused four = 4;
|
2009-02-11 13:04:23 -08:00
|
|
|
static unsigned long one_ul = 1;
|
2007-10-16 23:26:09 -07:00
|
|
|
static int one_hundred = 100;
|
2009-09-22 16:43:33 -07:00
|
|
|
#ifdef CONFIG_PRINTK
|
|
|
|
static int ten_thousand = 10000;
|
|
|
|
#endif
|
2016-09-21 14:51:51 -07:00
|
|
|
#ifdef CONFIG_SCHED_HMP
|
|
|
|
static int one_thousand = 1000;
|
|
|
|
#endif
|
2007-10-16 23:26:09 -07:00
|
|
|
|
2009-04-30 15:08:57 -07:00
|
|
|
/* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
|
|
|
|
static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
|
|
|
|
static int maxolduid = 65535;
|
|
|
|
static int minolduid;
|
|
|
|
|
|
|
|
static int ngroups_max = NGROUPS_MAX;
|
2011-10-31 17:11:20 -07:00
|
|
|
static const int cap_last_cap = CAP_LAST_CAP;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2014-04-07 15:38:57 -07:00
|
|
|
/*this is needed for proc_doulongvec_minmax of sysctl_hung_task_timeout_secs */
|
|
|
|
#ifdef CONFIG_DETECT_HUNG_TASK
|
|
|
|
static unsigned long hung_task_timeout_max = (LONG_MAX/HZ);
|
|
|
|
#endif
|
|
|
|
|
2010-02-25 20:28:57 -05:00
|
|
|
#ifdef CONFIG_INOTIFY_USER
|
|
|
|
#include <linux/inotify.h>
|
|
|
|
#endif
|
2008-09-11 23:29:54 -07:00
|
|
|
#ifdef CONFIG_SPARC
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef __hppa__
|
|
|
|
extern int pwrsw_enabled;
|
2013-01-18 15:12:24 +05:30
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW
|
2005-04-16 15:20:36 -07:00
|
|
|
extern int unaligned_enabled;
|
|
|
|
#endif
|
|
|
|
|
2006-02-28 09:42:23 -08:00
|
|
|
#ifdef CONFIG_IA64
|
2009-01-15 10:38:56 -08:00
|
|
|
extern int unaligned_dump_stack;
|
2006-02-28 09:42:23 -08:00
|
|
|
#endif
|
|
|
|
|
2013-01-09 20:06:28 +05:30
|
|
|
#ifdef CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN
|
|
|
|
extern int no_unaligned_warning;
|
|
|
|
#endif
|
|
|
|
|
2006-10-19 23:28:34 -07:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
|
|
|
|
#define SYSCTL_WRITES_LEGACY -1
|
|
|
|
#define SYSCTL_WRITES_WARN 0
|
|
|
|
#define SYSCTL_WRITES_STRICT 1
|
|
|
|
|
2016-01-20 15:00:45 -08:00
|
|
|
static int sysctl_writes_strict = SYSCTL_WRITES_STRICT;
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
static int proc_do_cad_pid(struct ctl_table *table, int write,
|
2006-10-02 02:19:00 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos);
|
2009-09-23 15:57:19 -07:00
|
|
|
static int proc_taint(struct ctl_table *table, int write,
|
2007-02-10 01:45:24 -08:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos);
|
2006-10-19 23:28:34 -07:00
|
|
|
#endif
|
2006-10-02 02:19:00 -07:00
|
|
|
|
2011-03-23 16:43:11 -07:00
|
|
|
#ifdef CONFIG_PRINTK
|
2012-04-04 11:40:19 -07:00
|
|
|
static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
|
2011-03-23 16:43:11 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos);
|
|
|
|
#endif
|
|
|
|
|
2012-07-30 14:39:18 -07:00
|
|
|
static int proc_dointvec_minmax_coredump(struct ctl_table *table, int write,
|
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos);
|
2012-10-04 17:15:23 -07:00
|
|
|
#ifdef CONFIG_COREDUMP
|
2012-07-30 14:39:18 -07:00
|
|
|
static int proc_dostring_coredump(struct ctl_table *table, int write,
|
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos);
|
2012-10-04 17:15:23 -07:00
|
|
|
#endif
|
2012-07-30 14:39:18 -07:00
|
|
|
|
2010-03-21 22:31:26 -07:00
|
|
|
#ifdef CONFIG_MAGIC_SYSRQ
|
2011-01-24 09:31:38 -08:00
|
|
|
/* Note: sysrq code uses it's own private copy */
|
2013-10-07 01:05:46 +01:00
|
|
|
static int __sysrq_enabled = CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE;
|
2010-03-21 22:31:26 -07:00
|
|
|
|
2014-06-06 14:38:08 -07:00
|
|
|
static int sysrq_sysctl_handler(struct ctl_table *table, int write,
|
2010-03-21 22:31:26 -07:00
|
|
|
void __user *buffer, size_t *lenp,
|
|
|
|
loff_t *ppos)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
error = proc_dointvec(table, write, buffer, lenp, ppos);
|
|
|
|
if (error)
|
|
|
|
return error;
|
|
|
|
|
|
|
|
if (write)
|
|
|
|
sysrq_toggle_support(__sysrq_enabled);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static struct ctl_table kern_table[];
|
|
|
|
static struct ctl_table vm_table[];
|
|
|
|
static struct ctl_table fs_table[];
|
|
|
|
static struct ctl_table debug_table[];
|
|
|
|
static struct ctl_table dev_table[];
|
|
|
|
extern struct ctl_table random_table[];
|
epoll: introduce resource usage limits
It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface. Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds. To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced. A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:
max_user_instances = Maximum number of devices - per user
max_user_watches = Maximum number of "watched" fds - per user
The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM. As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users. The
default value for "max_user_instances" is set to 128, that should be
enough too.
This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
listed, so that should be ok.
[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01 13:13:55 -08:00
|
|
|
#ifdef CONFIG_EPOLL
|
|
|
|
extern struct ctl_table epoll_table[];
|
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
|
|
|
|
int sysctl_legacy_va_layout;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* The default sysctl tables: */
|
|
|
|
|
2012-01-06 03:34:20 -08:00
|
|
|
static struct ctl_table sysctl_base_table[] = {
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "kernel",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = kern_table,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "vm",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = vm_table,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "fs",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = fs_table,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "debug",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = debug_table,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "dev",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = dev_table,
|
|
|
|
},
|
2009-04-03 02:30:53 -07:00
|
|
|
{ }
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
2007-07-09 18:52:00 +02:00
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
2007-12-18 15:21:13 +01:00
|
|
|
static int min_sched_granularity_ns = 100000; /* 100 usecs */
|
|
|
|
static int max_sched_granularity_ns = NSEC_PER_SEC; /* 1 second */
|
|
|
|
static int min_wakeup_granularity_ns; /* 0 usecs */
|
|
|
|
static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
|
2012-10-25 14:16:43 +02:00
|
|
|
#ifdef CONFIG_SMP
|
2009-11-30 12:16:47 +01:00
|
|
|
static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
|
|
|
|
static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
|
2012-10-25 14:16:43 +02:00
|
|
|
#endif /* CONFIG_SMP */
|
|
|
|
#endif /* CONFIG_SCHED_DEBUG */
|
2007-07-09 18:52:00 +02:00
|
|
|
|
2010-05-24 14:32:31 -07:00
|
|
|
#ifdef CONFIG_COMPACTION
|
|
|
|
static int min_extfrag_threshold;
|
|
|
|
static int max_extfrag_threshold = 1000;
|
|
|
|
#endif
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static struct ctl_table kern_table[] = {
|
2009-09-09 15:41:37 +02:00
|
|
|
{
|
|
|
|
.procname = "sched_child_runs_first",
|
|
|
|
.data = &sysctl_sched_child_runs_first,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2012-10-24 15:00:20 -07:00
|
|
|
},
|
2016-07-28 19:18:08 -07:00
|
|
|
#ifdef CONFIG_SCHED_HMP
|
2016-06-07 16:03:17 -07:00
|
|
|
{
|
|
|
|
.procname = "sched_freq_reporting_policy",
|
|
|
|
.data = &sysctl_sched_freq_reporting_policy,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
2014-05-06 18:05:50 -07:00
|
|
|
{
|
2014-09-22 17:19:17 +05:30
|
|
|
.procname = "sched_freq_inc_notify",
|
|
|
|
.data = &sysctl_sched_freq_inc_notify,
|
2014-05-06 18:05:50 -07:00
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2015-03-27 10:03:59 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
2014-06-12 12:14:15 -07:00
|
|
|
},
|
|
|
|
{
|
2014-09-22 17:19:17 +05:30
|
|
|
.procname = "sched_freq_dec_notify",
|
|
|
|
.data = &sysctl_sched_freq_dec_notify,
|
2014-06-12 12:14:15 -07:00
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2015-03-27 10:03:59 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
2014-05-06 18:05:50 -07:00
|
|
|
},
|
2014-11-30 16:26:55 -08:00
|
|
|
{
|
|
|
|
.procname = "sched_cpu_high_irqload",
|
|
|
|
.data = &sysctl_sched_cpu_high_irqload,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2014-08-07 18:24:21 -07:00
|
|
|
{
|
|
|
|
.procname = "sched_ravg_hist_size",
|
|
|
|
.data = &sysctl_sched_ravg_hist_size,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2014-08-11 09:22:24 +05:30
|
|
|
.proc_handler = sched_window_update_handler,
|
2014-08-07 18:24:21 -07:00
|
|
|
},
|
2014-03-29 11:40:16 -07:00
|
|
|
{
|
|
|
|
.procname = "sched_window_stats_policy",
|
|
|
|
.data = &sysctl_sched_window_stats_policy,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2014-08-11 09:22:24 +05:30
|
|
|
.proc_handler = sched_window_update_handler,
|
2014-03-29 11:40:16 -07:00
|
|
|
},
|
2014-06-12 10:47:12 -07:00
|
|
|
{
|
|
|
|
.procname = "sched_spill_load",
|
|
|
|
.data = &sysctl_sched_spill_load_pct,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_hmp_proc_update_handler,
|
2016-09-21 14:51:51 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
2014-06-12 10:47:12 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "sched_spill_nr_run",
|
|
|
|
.data = &sysctl_sched_spill_nr_run,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2015-03-27 10:03:59 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
2014-06-12 10:47:12 -07:00
|
|
|
},
|
2014-03-29 20:04:42 -07:00
|
|
|
{
|
|
|
|
.procname = "sched_upmigrate",
|
|
|
|
.data = &sysctl_sched_upmigrate_pct,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_hmp_proc_update_handler,
|
2016-09-21 14:51:51 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
2014-03-29 20:04:42 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "sched_downmigrate",
|
|
|
|
.data = &sysctl_sched_downmigrate_pct,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_hmp_proc_update_handler,
|
2016-09-21 14:51:51 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
2014-03-29 20:04:42 -07:00
|
|
|
},
|
2016-08-31 16:54:12 -07:00
|
|
|
{
|
|
|
|
.procname = "sched_group_upmigrate",
|
|
|
|
.data = &sysctl_sched_group_upmigrate_pct,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_hmp_proc_update_handler,
|
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "sched_group_downmigrate",
|
|
|
|
.data = &sysctl_sched_group_downmigrate_pct,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_hmp_proc_update_handler,
|
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
2014-03-29 20:04:42 -07:00
|
|
|
{
|
|
|
|
.procname = "sched_init_task_load",
|
|
|
|
.data = &sysctl_sched_init_task_load_pct,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_hmp_proc_update_handler,
|
2016-09-21 14:51:51 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
2014-03-29 20:04:42 -07:00
|
|
|
},
|
2015-08-31 17:21:35 -07:00
|
|
|
{
|
|
|
|
.procname = "sched_select_prev_cpu_us",
|
|
|
|
.data = &sysctl_sched_select_prev_cpu_us,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_hmp_proc_update_handler,
|
2016-09-21 14:51:51 -07:00
|
|
|
.extra1 = &zero,
|
2015-08-31 17:21:35 -07:00
|
|
|
},
|
2015-12-04 06:34:03 +05:30
|
|
|
{
|
|
|
|
.procname = "sched_restrict_cluster_spill",
|
|
|
|
.data = &sysctl_sched_restrict_cluster_spill,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2016-02-03 14:52:23 -08:00
|
|
|
{
|
|
|
|
.procname = "sched_small_wakee_task_load",
|
|
|
|
.data = &sysctl_sched_small_wakee_task_load_pct,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_hmp_proc_update_handler,
|
2016-09-21 14:51:51 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
2016-02-03 14:52:23 -08:00
|
|
|
},
|
2016-03-04 17:15:04 -08:00
|
|
|
{
|
|
|
|
.procname = "sched_big_waker_task_load",
|
|
|
|
.data = &sysctl_sched_big_waker_task_load_pct,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_hmp_proc_update_handler,
|
2016-09-21 14:51:51 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
2016-03-04 17:15:04 -08:00
|
|
|
},
|
2016-09-06 11:59:28 +05:30
|
|
|
{
|
|
|
|
.procname = "sched_prefer_sync_wakee_to_waker",
|
|
|
|
.data = &sysctl_sched_prefer_sync_wakee_to_waker,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2015-10-21 16:04:46 +05:30
|
|
|
{
|
|
|
|
.procname = "sched_enable_thread_grouping",
|
|
|
|
.data = &sysctl_sched_enable_thread_grouping,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2015-06-08 09:08:47 +05:30
|
|
|
{
|
|
|
|
.procname = "sched_new_task_windows",
|
|
|
|
.data = &sysctl_sched_new_task_windows,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_window_update_handler,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "sched_pred_alert_freq",
|
|
|
|
.data = &sysctl_sched_pred_alert_freq,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
2015-05-12 15:01:15 +05:30
|
|
|
{
|
|
|
|
.procname = "sched_freq_aggregate",
|
|
|
|
.data = &sysctl_sched_freq_aggregate,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_window_update_handler,
|
|
|
|
},
|
2016-05-31 12:34:52 +05:30
|
|
|
{
|
|
|
|
.procname = "sched_freq_aggregate_threshold",
|
|
|
|
.data = &sysctl_sched_freq_aggregate_threshold_pct,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_hmp_proc_update_handler,
|
2016-09-21 14:51:51 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
/*
|
|
|
|
* Special handling for sched_freq_aggregate_threshold_pct
|
|
|
|
* which can be greater than 100. Use 1000 as an upper bound
|
|
|
|
* value which works for all practical use cases.
|
|
|
|
*/
|
|
|
|
.extra2 = &one_thousand,
|
2016-05-31 12:34:52 +05:30
|
|
|
},
|
2014-05-28 13:30:26 -07:00
|
|
|
{
|
|
|
|
.procname = "sched_boost",
|
|
|
|
.data = &sysctl_sched_boost,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_boost_handler,
|
2016-08-31 16:54:12 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &three,
|
2014-05-28 13:30:26 -07:00
|
|
|
},
|
2016-09-09 19:38:03 +05:30
|
|
|
{
|
|
|
|
.procname = "sched_short_burst_ns",
|
|
|
|
.data = &sysctl_sched_short_burst,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2016-09-09 19:59:12 +05:30
|
|
|
{
|
|
|
|
.procname = "sched_short_sleep_ns",
|
|
|
|
.data = &sysctl_sched_short_sleep,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2014-03-29 20:04:42 -07:00
|
|
|
#endif /* CONFIG_SCHED_HMP */
|
2007-07-09 18:52:00 +02:00
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
|
|
|
{
|
2007-11-09 22:39:37 +01:00
|
|
|
.procname = "sched_min_granularity_ns",
|
|
|
|
.data = &sysctl_sched_min_granularity,
|
2007-07-09 18:52:00 +02:00
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-12-12 11:34:10 -08:00
|
|
|
.proc_handler = sched_proc_update_handler,
|
2007-11-09 22:39:37 +01:00
|
|
|
.extra1 = &min_sched_granularity_ns,
|
|
|
|
.extra2 = &max_sched_granularity_ns,
|
2007-07-09 18:52:00 +02:00
|
|
|
},
|
2007-08-25 18:41:53 +02:00
|
|
|
{
|
|
|
|
.procname = "sched_latency_ns",
|
|
|
|
.data = &sysctl_sched_latency,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-12-12 11:34:10 -08:00
|
|
|
.proc_handler = sched_proc_update_handler,
|
2007-08-25 18:41:53 +02:00
|
|
|
.extra1 = &min_sched_granularity_ns,
|
|
|
|
.extra2 = &max_sched_granularity_ns,
|
|
|
|
},
|
2016-07-22 13:21:15 +01:00
|
|
|
{
|
Merge branch 'v4.4-16.09-android-tmp' into lsk-v4.4-16.09-android
* v4.4-16.09-android-tmp:
unsafe_[get|put]_user: change interface to use a error target label
usercopy: remove page-spanning test for now
usercopy: fix overlap check for kernel text
mm/slub: support left redzone
Linux 4.4.21
lib/mpi: mpi_write_sgl(): fix skipping of leading zero limbs
regulator: anatop: allow regulator to be in bypass mode
hwrng: exynos - Disable runtime PM on probe failure
cpufreq: Fix GOV_LIMITS handling for the userspace governor
metag: Fix atomic_*_return inline asm constraints
scsi: fix upper bounds check of sense key in scsi_sense_key_string()
ALSA: timer: fix NULL pointer dereference on memory allocation failure
ALSA: timer: fix division by zero after SNDRV_TIMER_IOCTL_CONTINUE
ALSA: timer: fix NULL pointer dereference in read()/ioctl() race
ALSA: hda - Enable subwoofer on Dell Inspiron 7559
ALSA: hda - Add headset mic quirk for Dell Inspiron 5468
ALSA: rawmidi: Fix possible deadlock with virmidi registration
ALSA: fireworks: accessing to user space outside spinlock
ALSA: firewire-tascam: accessing to user space outside spinlock
ALSA: usb-audio: Add sample rate inquiry quirk for B850V3 CP2114
crypto: caam - fix IV loading for authenc (giv)decryption
uprobes: Fix the memcg accounting
x86/apic: Do not init irq remapping if ioapic is disabled
vhost/scsi: fix reuse of &vq->iov[out] in response
bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.
ubifs: Fix assertion in layout_in_gaps()
ovl: fix workdir creation
ovl: listxattr: use strnlen()
ovl: remove posix_acl_default from workdir
ovl: don't copy up opaqueness
wrappers for ->i_mutex access
lustre: remove unused declaration
timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING
timekeeping: Cap array access in timekeeping_debug
xfs: fix superblock inprogress check
ASoC: atmel_ssc_dai: Don't unconditionally reset SSC on stream startup
drm/msm: fix use of copy_from_user() while holding spinlock
drm: Reject page_flip for !DRIVER_MODESET
drm/radeon: fix radeon_move_blit on 32bit systems
s390/sclp_ctl: fix potential information leak with /dev/sclp
rds: fix an infoleak in rds_inc_info_copy
powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0
nvme: Call pci_disable_device on the error path.
cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork
block: make sure a big bio is split into at most 256 bvecs
block: Fix race triggered by blk_set_queue_dying()
ext4: avoid modifying checksum fields directly during checksum verification
ext4: avoid deadlock when expanding inode size
ext4: properly align shifted xattrs when expanding inodes
ext4: fix xattr shifting when expanding inodes part 2
ext4: fix xattr shifting when expanding inodes
ext4: validate that metadata blocks do not overlap superblock
net: Use ns_capable_noaudit() when determining net sysctl permissions
kernel: Add noaudit variant of ns_capable()
KEYS: Fix ASN.1 indefinite length object parsing
drivers:hv: Lock access to hyperv_mmio resource tree
cxlflash: Move to exponential back-off when cmd_room is not available
netfilter: x_tables: check for size overflow
drm/amdgpu/cz: enable/disable vce dpm even if vce pg is disabled
cred: Reject inodes with invalid ids in set_create_file_as()
fs: Check for invalid i_uid in may_follow_link()
IB/IPoIB: Do not set skb truesize since using one linearskb
udp: properly support MSG_PEEK with truncated buffers
crypto: nx-842 - Mask XERS0 bit in return value
cxlflash: Fix to avoid virtual LUN failover failure
cxlflash: Fix to escalate LINK_RESET also on port 1
tipc: fix nl compat regression for link statistics
tipc: fix an infoleak in tipc_nl_compat_link_dump
netfilter: x_tables: check for size overflow
Bluetooth: Add support for Intel Bluetooth device 8265 [8087:0a2b]
drm/i915: Check VBT for port presence in addition to the strap on VLV/CHV
drm/i915: Only ignore eDP ports that are connected
Input: xpad - move pending clear to the correct location
net: thunderx: Fix link status reporting
x86/hyperv: Avoid reporting bogus NMI status for Gen2 instances
crypto: vmx - IV size failing on skcipher API
tda10071: Fix dependency to REGMAP_I2C
crypto: vmx - Fix ABI detection
crypto: vmx - comply with ABIs that specify vrsave as reserved.
HID: core: prevent out-of-bound readings
lpfc: Fix DMA faults observed upon plugging loopback connector
block: fix blk_rq_get_max_sectors for driver private requests
irqchip/gicv3-its: numa: Enable workaround for Cavium thunderx erratum 23144
clocksource: Allow unregistering the watchdog
btrfs: Continue write in case of can_not_nocow
blk-mq: End unstarted requests on dying queue
cxlflash: Fix to resolve dead-lock during EEH recovery
drm/radeon/mst: fix regression in lane/link handling.
ecryptfs: fix handling of directory opening
ALSA: hda: add AMD Polaris-10/11 AZ PCI IDs with proper driver caps
drm: Balance error path for GEM handle allocation
ntp: Fix ADJ_SETOFFSET being used w/ ADJ_NANO
time: Verify time values in adjtimex ADJ_SETOFFSET to avoid overflow
Input: xpad - correctly handle concurrent LED and FF requests
net: thunderx: Fix receive packet stats
net: thunderx: Fix for multiqset not configured upon interface toggle
perf/x86/cqm: Fix CQM memory leak and notifier leak
perf/x86/cqm: Fix CQM handling of grouping events into a cache_group
s390/crypto: provide correct file mode at device register.
proc: revert /proc/<pid>/maps [stack:TID] annotation
intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
cxlflash: Fix to avoid unnecessary scan with internal LUNs
Drivers: hv: vmbus: don't manipulate with clocksources on crash
Drivers: hv: vmbus: avoid scheduling in interrupt context in vmbus_initiate_unload()
Drivers: hv: vmbus: avoid infinite loop in init_vp_index()
arcmsr: fixes not release allocated resource
arcmsr: fixed getting wrong configuration data
s390/pci_dma: fix DMA table corruption with > 4 TB main memory
net/mlx5e: Don't modify CQ before it was created
net/mlx5e: Don't try to modify CQ moderation if it is not supported
mmc: sdhci: Do not BUG on invalid vdd
UVC: Add support for R200 depth camera
sched/numa: Fix use-after-free bug in the task_numa_compare
ALSA: hda - add codec support for Kabylake display audio codec
drm/i915: Fix hpd live status bits for g4x
tipc: fix nullptr crash during subscription cancel
arm64: Add workaround for Cavium erratum 27456
net: thunderx: Fix for Qset error due to CQ full
drm/radeon: fix dp link rate selection (v2)
drm/amdgpu: fix dp link rate selection (v2)
qla2xxx: Use ATIO type to send correct tmr response
mmc: sdhci: 64-bit DMA actually has 4-byte alignment
drm/atomic: Do not unset crtc when an encoder is stolen
drm/i915/skl: Add missing SKL ids
drm/i915/bxt: update list of PCIIDs
hrtimer: Catch illegal clockids
i40e/i40evf: Fix RSS rx-flow-hash configuration through ethtool
mpt3sas: Fix for Asynchronous completion of timedout IO and task abort of timedout IO.
mpt3sas: A correction in unmap_resources
net: cavium: liquidio: fix check for in progress flag
arm64: KVM: Configure TCR_EL2.PS at runtime
irqchip/gic-v3: Make sure read from ICC_IAR1_EL1 is visible on redestributor
pwm: lpc32xx: fix and simplify duty cycle and period calculations
pwm: lpc32xx: correct number of PWM channels from 2 to 1
pwm: fsl-ftm: Fix clock enable/disable when using PM
megaraid_sas: Add an i/o barrier
megaraid_sas: Fix SMAP issue
megaraid_sas: Do not allow PCI access during OCR
s390/cio: update measurement characteristics
s390/cio: ensure consistent measurement state
s390/cio: fix measurement characteristics memleak
qeth: initialize net_device with carrier off
lpfc: Fix external loopback failure.
lpfc: Fix mbox reuse in PLOGI completion
lpfc: Fix RDP Speed reporting.
lpfc: Fix crash in fcp command completion path.
lpfc: Fix driver crash when module parameter lpfc_fcp_io_channel set to 16
lpfc: Fix RegLogin failed error seen on Lancer FC during port bounce
lpfc: Fix the FLOGI discovery logic to comply with T11 standards
lpfc: Fix FCF Infinite loop in lpfc_sli4_fcf_rr_next_index_get.
cxl: Enable PCI device ID for future IBM CXL adapter
cxl: fix build for GCC 4.6.x
cxlflash: Enable device id for future IBM CXL adapter
cxlflash: Resolve oops in wait_port_offline
cxlflash: Fix to resolve cmd leak after host reset
cxl: Fix DSI misses when the context owning task exits
cxl: Fix possible idr warning when contexts are released
Drivers: hv: vmbus: fix rescind-offer handling for device without a driver
Drivers: hv: vmbus: serialize process_chn_event() and vmbus_close_internal()
Drivers: hv: vss: run only on supported host versions
drivers/hv: cleanup synic msrs if vmbus connect failed
Drivers: hv: util: catch allocation errors
tools: hv: report ENOSPC errors in hv_fcopy_daemon
Drivers: hv: utils: run polling callback always in interrupt context
Drivers: hv: util: Increase the timeout for util services
lightnvm: fix missing grown bad block type
lightnvm: fix locking and mempool in rrpc_lun_gc
lightnvm: unlock rq and free ppa_list on submission fail
lightnvm: add check after mempool allocation
lightnvm: fix incorrect nr_free_blocks stat
lightnvm: fix bio submission issue
cxlflash: a couple off by one bugs
fm10k: Cleanup exception handling for mailbox interrupt
fm10k: Cleanup MSI-X interrupts in case of failure
fm10k: reinitialize queuing scheme after calling init_hw
fm10k: always check init_hw for errors
fm10k: reset max_queues on init_hw_vf failure
fm10k: Fix handling of NAPI budget when multiple queues are enabled per vector
fm10k: Correct MTU for jumbo frames
fm10k: do not assume VF always has 1 queue
clk: xgene: Fix divider with non-zero shift value
e1000e: fix division by zero on jumbo MTUs
e1000: fix data race between tx_ring->next_to_clean
ixgbe: Fix handling of NAPI budget when multiple queues are enabled per vector
igb: fix NULL derefs due to skipped SR-IOV enabling
igb: use the correct i210 register for EEMNGCTL
igb: don't unmap NULL hw_addr
i40e: Fix Rx hash reported to the stack by our driver
i40e: clean whole mac filter list
i40evf: check rings before freeing resources
i40e: don't add zero MAC filter
i40e: properly delete VF MAC filters
i40e: Fix memory leaks, sideband filter programming
i40e: fix: do not sleep in netdev_ops
i40e/i40evf: Fix RS bit update in Tx path and disable force WB workaround
i40evf: handle many MAC filters correctly
i40e: Workaround fix for mss < 256 issue
UPSTREAM: audit: fix a double fetch in audit_log_single_execve_arg()
UPSTREAM: ARM: 8494/1: mm: Enable PXN when running non-LPAE kernel on LPAE processor
FIXUP: sched/tune: update accouting before CPU capacity
FIXUP: sched/tune: add fixes missing from a previous patch
arm: Fix #if/#ifdef typo in topology.c
arm: Fix build error "conflicting types for 'scale_cpu_capacity'"
sched/walt: use do_div instead of division operator
DEBUG: cpufreq: fix cpu_capacity tracing build for non-smp systems
sched/walt: include missing header for arm_timer_read_counter()
cpufreq: Kconfig: Fixup incorrect selection by CPU_FREQ_DEFAULT_GOV_SCHED
sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
FIXUP: sched: scheduler-driven cpu frequency selection
sched/rt: Add Kconfig option to enable panicking for RT throttling
sched/rt: print RT tasks when RT throttling is activated
UPSTREAM: sched: Fix a race between __kthread_bind() and sched_setaffinity()
sched/fair: Favor higher cpus only for boosted tasks
vmstat: make vmstat_updater deferrable again and shut down on idle
sched/fair: call OPP update when going idle after migration
sched/cpufreq_sched: fix thermal capping events
sched/fair: Picking cpus with low OPPs for tasks that prefer idle CPUs
FIXUP: sched/tune: do initialization as a postcore_initicall
DEBUG: sched: add tracepoint for RD overutilized
sched/tune: Introducing a new schedtune attribute prefer_idle
sched: use util instead of capacity to select busy cpu
arch_timer: add error handling when the MPM global timer is cleared
FIXUP: sched: Fix double-release of spinlock in move_queued_task
FIXUP: sched/fair: Fix hang during suspend in sched_group_energy
FIXUP: sched: fix SchedFreq integration for both PELT and WALT
sched: EAS: Avoid causing spikes to max-freq unnecessarily
FIXUP: sched: fix set_cfs_cpu_capacity when WALT is in use
sched/walt: Accounting for number of irqs pending on each core
sched: Introduce Window Assisted Load Tracking (WALT)
sched/tune: fix PB and PC cuts indexes definition
sched/fair: optimize idle cpu selection for boosted tasks
FIXUP: sched/tune: fix accounting for runnable tasks
sched/tune: use a single initialisation function
sched/{fair,tune}: simplify fair.c code
FIXUP: sched/tune: fix payoff calculation for boost region
sched/tune: Add support for negative boost values
FIX: sched/tune: move schedtune_nornalize_energy into fair.c
FIX: sched/tune: update usage of boosted task utilisation on CPU selection
sched/fair: add tunable to set initial task load
sched/fair: add tunable to force selection at cpu granularity
sched: EAS: take cstate into account when selecting idle core
sched/cpufreq_sched: Consolidated update
FIXUP: sched: fix build for non-SMP target
DEBUG: sched/tune: add tracepoint on P-E space filtering
DEBUG: sched/tune: add tracepoint for energy_diff() values
DEBUG: sched/tune: add tracepoint for task boost signal
arm: topology: Define TC2 energy and provide it to the scheduler
CHROMIUM: sched: update the average of nr_running
DEBUG: schedtune: add tracepoint for schedtune_tasks_update() values
DEBUG: schedtune: add tracepoint for CPU boost signal
DEBUG: schedtune: add tracepoint for SchedTune configuration update
DEBUG: sched: add energy procfs interface
DEBUG: sched,cpufreq: add cpu_capacity change tracepoint
DEBUG: sched: add tracepoint for CPU load/util signals
DEBUG: sched: add tracepoint for task load/util signals
DEBUG: sched: add tracepoint for cpu/freq scale invariance
sched/fair: filter energy_diff() based on energy_payoff value
sched/tune: add support to compute normalized energy
sched/fair: keep track of energy/capacity variations
sched/fair: add boosted task utilization
sched/{fair,tune}: track RUNNABLE tasks impact on per CPU boost value
sched/tune: compute and keep track of per CPU boost value
sched/tune: add initial support for CGroups based boosting
sched/fair: add boosted CPU usage
sched/fair: add function to convert boost value into "margin"
sched/tune: add sysctl interface to define a boost value
sched/tune: add detailed documentation
fixup! sched/fair: jump to max OPP when crossing UP threshold
fixup! sched: scheduler-driven cpu frequency selection
sched: rt scheduler sets capacity requirement
sched: deadline: use deadline bandwidth in scale_rt_capacity
sched: remove call of sched_avg_update from sched_rt_avg_update
sched/cpufreq_sched: add trace events
sched/fair: jump to max OPP when crossing UP threshold
sched/fair: cpufreq_sched triggers for load balancing
sched/{core,fair}: trigger OPP change request on fork()
sched/fair: add triggers for OPP change requests
sched: scheduler-driven cpu frequency selection
cpufreq: introduce cpufreq_driver_is_slow
sched: Consider misfit tasks when load-balancing
sched: Add group_misfit_task load-balance type
sched: Add per-cpu max capacity to sched_group_capacity
sched: Do eas idle balance regardless of the rq avg idle value
arm64: Enable max freq invariant scheduler load-tracking and capacity support
arm: Enable max freq invariant scheduler load-tracking and capacity support
sched: Update max cpu capacity in case of max frequency constraints
cpufreq: Max freq invariant scheduler load-tracking and cpu capacity support
arm64, topology: Updates to use DT bindings for EAS costing data
sched: Support for extracting EAS energy costs from DT
Documentation: DT bindings for energy model cost data required by EAS
sched: Disable energy-unfriendly nohz kicks
sched: Consider a not over-utilized energy-aware system as balanced
sched: Energy-aware wake-up task placement
sched: Determine the current sched_group idle-state
sched, cpuidle: Track cpuidle state index in the scheduler
sched: Add over-utilization/tipping point indicator
sched: Estimate energy impact of scheduling decisions
sched: Extend sched_group_energy to test load-balancing decisions
sched: Calculate energy consumption of sched_group
sched: Highest energy aware balancing sched_domain level pointer
sched: Relocated cpu_util() and change return type
sched: Compute cpu capacity available at current frequency
arm64: Cpu invariant scheduler load-tracking and capacity support
arm: Cpu invariant scheduler load-tracking and capacity support
sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
sched: Initialize energy data structures
sched: Introduce energy data structures
sched: Make energy awareness a sched feature
sched: Documentation for scheduler energy cost model
sched: Prevent unnecessary active balance of single task in sched group
sched: Enable idle balance to pull single task towards cpu with higher capacity
sched: Consider spare cpu capacity at task wake-up
sched: Add cpu capacity awareness to wakeup balancing
sched: Store system-wide maximum cpu capacity in root domain
arm: Update arch_scale_cpu_capacity() to reflect change to define
arm64: Enable frequency invariant scheduler load-tracking support
arm: Enable frequency invariant scheduler load-tracking support
cpufreq: Frequency invariant scheduler load-tracking support
sched/fair: Fix new task's load avg removed from source CPU in wake_up_new_task()
FROMLIST: pstore: drop pmsg bounce buffer
UPSTREAM: usercopy: remove page-spanning test for now
UPSTREAM: usercopy: force check_object_size() inline
BACKPORT: usercopy: fold builtin_const check into inline function
UPSTREAM: x86/uaccess: force copy_*_user() to be inlined
UPSTREAM: HID: core: prevent out-of-bound readings
Android: Fix build breakages.
UPSTREAM: tty: Prevent ldisc drivers from re-using stale tty fields
UPSTREAM: netfilter: nfnetlink: correctly validate length of batch messages
cpuset: Make cpusets restore on hotplug
UPSTREAM: mm/slub: support left redzone
UPSTREAM: Make the hardened user-copy code depend on having a hardened allocator
Android: MMC/UFS IO Latency Histograms.
UPSTREAM: usercopy: fix overlap check for kernel text
UPSTREAM: usercopy: avoid potentially undefined behavior in pointer math
UPSTREAM: unsafe_[get|put]_user: change interface to use a error target label
BACKPORT: arm64: mm: fix location of _etext
BACKPORT: ARM: 8583/1: mm: fix location of _etext
BACKPORT: Don't show empty tag stats for unprivileged uids
UPSTREAM: tcp: fix use after free in tcp_xmit_retransmit_queue()
ANDROID: base-cfg: drop SECCOMP_FILTER config
UPSTREAM: [media] xc2028: unlock on error in xc2028_set_config()
UPSTREAM: [media] xc2028: avoid use after free
ANDROID: base-cfg: enable SECCOMP config
ANDROID: rcu_sync: Export rcu_sync_lockdep_assert
RFC: FROMLIST: cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork
RFC: FROMLIST: cgroup: avoid synchronize_sched() in __cgroup_procs_write()
RFC: FROMLIST: locking/percpu-rwsem: Optimize readers and reduce global impact
net: ipv6: Fix ping to link-local addresses.
ipv6: fix endianness error in icmpv6_err
ANDROID: dm: android-verity: Allow android-verity to be compiled as an independent module
backporting: a brief introduce of backported feautures on 4.4
Linux 4.4.20
sysfs: correctly handle read offset on PREALLOC attrs
hwmon: (iio_hwmon) fix memory leak in name attribute
ALSA: line6: Fix POD sysfs attributes segfault
ALSA: line6: Give up on the lock while URBs are released.
ALSA: line6: Remove double line6_pcm_release() after failed acquire.
ACPI / SRAT: fix SRAT parsing order with both LAPIC and X2APIC present
ACPI / sysfs: fix error code in get_status()
ACPI / drivers: replace acpi_probe_lock spinlock with mutex
ACPI / drivers: fix typo in ACPI_DECLARE_PROBE_ENTRY macro
staging: comedi: ni_mio_common: fix wrong insn_write handler
staging: comedi: ni_mio_common: fix AO inttrig backwards compatibility
staging: comedi: comedi_test: fix timer race conditions
staging: comedi: daqboard2000: bug fix board type matching code
USB: serial: option: add WeTelecom 0x6802 and 0x6803 products
USB: serial: option: add WeTelecom WM-D200
USB: serial: mos7840: fix non-atomic allocation in write path
USB: serial: mos7720: fix non-atomic allocation in write path
USB: fix typo in wMaxPacketSize validation
usb: chipidea: udc: don't touch DP when controller is in host mode
USB: avoid left shift by -1
dmaengine: usb-dmac: check CHCR.DE bit in usb_dmac_isr_channel()
crypto: qat - fix aes-xts key sizes
crypto: nx - off by one bug in nx_of_update_msc()
Input: i8042 - set up shared ps2_cmd_mutex for AUX ports
Input: i8042 - break load dependency between atkbd/psmouse and i8042
Input: tegra-kbc - fix inverted reset logic
btrfs: properly track when rescan worker is running
btrfs: waiting on qgroup rescan should not always be interruptible
fs/seq_file: fix out-of-bounds read
gpio: Fix OF build problem on UM
usb: renesas_usbhs: gadget: fix return value check in usbhs_mod_gadget_probe()
megaraid_sas: Fix probing cards without io port
mpt3sas: Fix resume on WarpDrive flash cards
cdc-acm: fix wrong pipe type on rx interrupt xfers
i2c: cros-ec-tunnel: Fix usage of cros_ec_cmd_xfer()
mfd: cros_ec: Add cros_ec_cmd_xfer_status() helper
aacraid: Check size values after double-fetch from user
ARC: Elide redundant setup of DMA callbacks
ARC: Call trace_hardirqs_on() before enabling irqs
ARC: use correct offset in pt_regs for saving/restoring user mode r25
ARC: build: Better way to detect ISA compatible toolchain
drm/i915: fix aliasing_ppgtt leak
drm/amdgpu: record error code when ring test failed
drm/amd/amdgpu: sdma resume fail during S4 on CI
drm/amdgpu: skip TV/CV in display parsing
drm/amdgpu: avoid a possible array overflow
drm/amdgpu: fix amdgpu_move_blit on 32bit systems
drm/amdgpu: Change GART offset to 64-bit
iio: fix sched WARNING "do not call blocking ops when !TASK_RUNNING"
sched/nohz: Fix affine unpinned timers mess
sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression
of: fix reference counting in of_graph_get_endpoint_by_regs
arm64: dts: rockchip: add reset saradc node for rk3368 SoCs
mac80211: fix purging multicast PS buffer queue
s390/dasd: fix hanging device after clear subchannel
EDAC: Increment correct counter in edac_inc_ue_error()
pinctrl/amd: Remove the default de-bounce time
iommu/arm-smmu: Don't BUG() if we find aborting STEs with disable_bypass
iommu/arm-smmu: Fix CMDQ error handling
iommu/dma: Don't put uninitialised IOVA domains
xhci: Make sure xhci handles USB_SPEED_SUPER_PLUS devices.
USB: serial: ftdi_sio: add PIDs for Ivium Technologies devices
USB: serial: ftdi_sio: add device ID for WICED USB UART dev board
USB: serial: option: add support for Telit LE920A4
USB: serial: option: add D-Link DWM-156/A3
USB: serial: fix memleak in driver-registration error path
xhci: don't dereference a xhci member after removing xhci
usb: xhci: Fix panic if disconnect
xhci: always handle "Command Ring Stopped" events
usb/gadget: fix gadgetfs aio support.
usb: gadget: fsl_qe_udc: off by one in setup_received_handle()
USB: validate wMaxPacketValue entries in endpoint descriptors
usb: renesas_usbhs: Use dmac only if the pipe type is bulk
usb: renesas_usbhs: clear the BRDYSTS in usbhsg_ep_enable()
USB: hub: change the locking in hub_activate
USB: hub: fix up early-exit pathway in hub_activate
usb: hub: Fix unbalanced reference count/memory leak/deadlocks
usb: define USB_SPEED_SUPER_PLUS speed for SuperSpeedPlus USB3.1 devices
usb: dwc3: gadget: increment request->actual once
usb: dwc3: pci: add Intel Kabylake PCI ID
usb: misc: usbtest: add fix for driver hang
usb: ehci: change order of register cleanup during shutdown
crypto: caam - defer aead_set_sh_desc in case of zero authsize
crypto: caam - fix echainiv(authenc) encrypt shared descriptor
crypto: caam - fix non-hmac hashes
genirq/msi: Make sure PCI MSIs are activated early
genirq/msi: Remove unused MSI_FLAG_IDENTITY_MAP
um: Don't discard .text.exit section
ACPI / CPPC: Prevent cpc_desc_ptr points to the invalid data
ACPI: CPPC: Return error if _CPC is invalid on a CPU
mmc: sdhci-acpi: Reduce Baytrail eMMC/SD/SDIO hangs
PCI: Limit config space size for Netronome NFP4000
PCI: Add Netronome NFP4000 PF device ID
PCI: Limit config space size for Netronome NFP6000 family
PCI: Add Netronome vendor and device IDs
PCI: Support PCIe devices with short cfg_size
NVMe: Don't unmap controller registers on reset
ALSA: hda - Manage power well properly for resume
libnvdimm, nd_blk: mask off reserved status bits
perf intel-pt: Fix occasional decoding errors when tracing system-wide
vfio/pci: Fix NULL pointer oops in error interrupt setup handling
virtio: fix memory leak in virtqueue_add()
parisc: Fix order of EREFUSED define in errno.h
arm64: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO
ALSA: usb-audio: Add quirk for ELP HD USB Camera
ALSA: usb-audio: Add a sample rate quirk for Creative Live! Cam Socialize HD (VF0610)
powerpc/eeh: eeh_pci_enable(): fix checking of post-request state
SUNRPC: allow for upcalls for same uid but different gss service
SUNRPC: Handle EADDRNOTAVAIL on connection failures
tools/testing/nvdimm: fix SIGTERM vs hotplug crash
uprobes/x86: Fix RIP-relative handling of EVEX-encoded instructions
x86/mm: Disable preemption during CR3 read+write
hugetlb: fix nr_pmds accounting with shared page tables
mm: SLUB hardened usercopy support
mm: SLAB hardened usercopy support
s390/uaccess: Enable hardened usercopy
sparc/uaccess: Enable hardened usercopy
powerpc/uaccess: Enable hardened usercopy
ia64/uaccess: Enable hardened usercopy
arm64/uaccess: Enable hardened usercopy
ARM: uaccess: Enable hardened usercopy
x86/uaccess: Enable hardened usercopy
x86: remove more uaccess_32.h complexity
x86: remove pointless uaccess_32.h complexity
x86: fix SMAP in 32-bit environments
Use the new batched user accesses in generic user string handling
Add 'unsafe' user access functions for batched accesses
x86: reorganize SMAP handling in user space accesses
mm: Hardened usercopy
mm: Implement stack frame object validation
mm: Add is_migrate_cma_page
Linux 4.4.19
Documentation/module-signing.txt: Note need for version info if reusing a key
module: Invalidate signatures on force-loaded modules
dm flakey: error READ bios during the down_interval
rtc: s3c: Add s3c_rtc_{enable/disable}_clk in s3c_rtc_setfreq()
lpfc: fix oops in lpfc_sli4_scmd_to_wqidx_distr() from lpfc_send_taskmgmt()
ACPI / EC: Work around method reentrancy limit in ACPICA for _Qxx
x86/platform/intel_mid_pci: Rework IRQ0 workaround
PCI: Mark Atheros AR9485 and QCA9882 to avoid bus reset
MIPS: hpet: Increase HPET_MIN_PROG_DELTA and decrease HPET_MIN_CYCLES
MIPS: Don't register r4k sched clock when CPUFREQ enabled
MIPS: mm: Fix definition of R6 cache instruction
SUNRPC: Don't allocate a full sockaddr_storage for tracing
Input: elan_i2c - properly wake up touchpad on ASUS laptops
target: Fix ordered task CHECK_CONDITION early exception handling
target: Fix max_unmap_lba_count calc overflow
target: Fix race between iscsi-target connection shutdown + ABORT_TASK
target: Fix missing complete during ABORT_TASK + CMD_T_FABRIC_STOP
target: Fix ordered task target_setup_cmd_from_cdb exception hang
iscsi-target: Fix panic when adding second TCP connection to iSCSI session
ubi: Fix race condition between ubi device creation and udev
ubi: Fix early logging
ubi: Make volume resize power cut aware
of: fix memory leak related to safe_name()
IB/mlx4: Fix memory leak if QP creation failed
IB/mlx4: Fix error flow when sending mads under SRIOV
IB/mlx4: Fix the SQ size of an RC QP
IB/IWPM: Fix a potential skb leak
IB/IPoIB: Don't update neigh validity for unresolved entries
IB/SA: Use correct free function
IB/mlx5: Return PORT_ERR in Active to Initializing tranisition
IB/mlx5: Fix post send fence logic
IB/mlx5: Fix entries check in mlx5_ib_resize_cq
IB/mlx5: Fix returned values of query QP
IB/mlx5: Fix entries checks in mlx5_ib_create_cq
IB/mlx5: Fix MODIFY_QP command input structure
ALSA: hda - Fix headset mic detection problem for two dell machines
ALSA: hda: add AMD Bonaire AZ PCI ID with proper driver caps
ALSA: hda/realtek - Can't adjust speaker's volume on a Dell AIO
ALSA: hda: Fix krealloc() with __GFP_ZERO usage
mm/hugetlb: avoid soft lockup in set_max_huge_pages()
mtd: nand: fix bug writing 1 byte less than page size
block: fix bdi vs gendisk lifetime mismatch
block: add missing group association in bio-cloning functions
metag: Fix __cmpxchg_u32 asm constraint for CMP
ftrace/recordmcount: Work around for addition of metag magic but not relocations
balloon: check the number of available pages in leak balloon
drm/i915/dp: Revert "drm/i915/dp: fall back to 18 bpp when sink capability is unknown"
drm/i915: Never fully mask the the EI up rps interrupt on SNB/IVB
drm/edid: Add 6 bpc quirk for display AEO model 0.
drm: Restore double clflush on the last partial cacheline
drm/nouveau/fbcon: fix font width not divisible by 8
drm/nouveau/gr/nv3x: fix instobj write offsets in gr setup
drm/nouveau: check for supported chipset before booting fbdev off the hw
drm/radeon: support backlight control for UNIPHY3
drm/radeon: fix firmware info version checks
drm/radeon: Poll for both connect/disconnect on analog connectors
drm/radeon: add a delay after ATPX dGPU power off
drm/amdgpu/gmc7: add missing mullins case
drm/amdgpu: fix firmware info version checks
drm/amdgpu: Disable RPM helpers while reprobing connectors on resume
drm/amdgpu: support backlight control for UNIPHY3
drm/amdgpu: Poll for both connect/disconnect on analog connectors
drm/amdgpu: add a delay after ATPX dGPU power off
w1:omap_hdq: fix regression
netlabel: add address family checks to netlbl_{sock,req}_delattr()
ARM: dts: sunxi: Add a startup delay for fixed regulator enabled phys
audit: fix a double fetch in audit_log_single_execve_arg()
iommu/amd: Update Alias-DTE in update_device_table()
iommu/amd: Init unity mappings only for dma_ops domains
iommu/amd: Handle IOMMU_DOMAIN_DMA in ops->domain_free call-back
iommu/vt-d: Return error code in domain_context_mapping_one()
iommu/exynos: Suppress unbinding to prevent system failure
drm/i915: Don't complain about lack of ACPI video bios
nfsd: don't return an unhashed lock stateid after taking mutex
nfsd: Fix race between FREE_STATEID and LOCK
nfs: don't create zero-length requests
MIPS: KVM: Propagate kseg0/mapped tlb fault errors
MIPS: KVM: Fix gfn range check in kseg0 tlb faults
MIPS: KVM: Add missing gfn range check
MIPS: KVM: Fix mapped fault broken commpage handling
random: add interrupt callback to VMBus IRQ handler
random: print a warning for the first ten uninitialized random users
random: initialize the non-blocking pool via add_hwgenerator_randomness()
CIFS: Fix a possible invalid memory access in smb2_query_symlink()
cifs: fix crash due to race in hmac(md5) handling
cifs: Check for existing directory when opening file with O_CREAT
fs/cifs: make share unaccessible at root level mountable
jbd2: make journal y2038 safe
ARC: mm: don't loose PTE_SPECIAL in pte_modify()
remoteproc: Fix potential race condition in rproc_add
ovl: disallow overlayfs as upperdir
HID: uhid: fix timeout when probe races with IO
EDAC: Correct channel count limit
Bluetooth: Fix l2cap_sock_setsockopt() with optname BT_RCVMTU
spi: pxa2xx: Clear all RFT bits in reset_sccr1() on Intel Quark
i2c: efm32: fix a failure path in efm32_i2c_probe()
s5p-mfc: Add release callback for memory region devs
s5p-mfc: Set device name for reserved memory region devs
hp-wmi: Fix wifi cannot be hard-unblocked
dm: set DMF_SUSPENDED* _before_ clearing DMF_NOFLUSH_SUSPENDING
sur40: fix occasional oopses on device close
sur40: lower poll interval to fix occasional FPS drops to ~56 FPS
Fix RC5 decoding with Fintek CIR chipset
vb2: core: Skip planes array verification if pb is NULL
videobuf2-v4l2: Verify planes array in buffer dequeueing
media: dvb_ringbuffer: Add memory barriers
media: usbtv: prevent access to free'd resources
mfd: qcom_rpm: Parametrize also ack selector size
mfd: qcom_rpm: Fix offset error for msm8660
intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
s390/cio: allow to reset channel measurement block
KVM: nVMX: Fix memory corruption when using VMCS shadowing
KVM: VMX: handle PML full VMEXIT that occurs during event delivery
KVM: MTRR: fix kvm_mtrr_check_gfn_range_consistency page fault
KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
arm64: mm: avoid fdt_check_header() before the FDT is fully mapped
arm64: dts: rockchip: fixes the gic400 2nd region size for rk3368
pinctrl: cherryview: prevent concurrent access to GPIO controllers
Bluetooth: hci_intel: Fix null gpio desc pointer dereference
gpio: intel-mid: Remove potentially harmful code
gpio: pca953x: Fix NBANK calculation for PCA9536
tty/serial: atmel: fix RS485 half duplex with DMA
serial: samsung: Fix ERR pointer dereference on deferred probe
tty: serial: msm: Don't read off end of tx fifo
arm64: Fix incorrect per-cpu usage for boot CPU
arm64: debug: unmask PSTATE.D earlier
arm64: kernel: Save and restore UAO and addr_limit on exception entry
USB: usbfs: fix potential infoleak in devio
usb: renesas_usbhs: fix NULL pointer dereference in xfer_work()
USB: serial: option: add support for Telit LE910 PID 0x1206
usb: dwc3: fix for the isoc transfer EP_BUSY flag
usb: quirks: Add no-lpm quirk for Elan
usb: renesas_usbhs: protect the CFIFOSEL setting in usbhsg_ep_enable()
usb: f_fs: off by one bug in _ffs_func_bind()
usb: gadget: avoid exposing kernel stack
UPSTREAM: usb: gadget: configfs: add mutex lock before unregister gadget
ANDROID: dm-verity: adopt changes made to dm callbacks
UPSTREAM: ecryptfs: fix handling of directory opening
ANDROID: net: core: fix UID-based routing
ANDROID: net: fib: remove duplicate assignment
FROMLIST: proc: Fix timerslack_ns CAP_SYS_NICE check when adjusting self
ANDROID: dm verity fec: pack the fec_header structure
ANDROID: dm: android-verity: Verify header before fetching table
ANDROID: dm: allow adb disable-verity only in userdebug
ANDROID: dm: mount as linear target if eng build
ANDROID: dm: use default verity public key
ANDROID: dm: fix signature verification flag
ANDROID: dm: use name_to_dev_t
ANDROID: dm: rename dm-linear methods for dm-android-verity
ANDROID: dm: Minor cleanup
ANDROID: dm: Mounting root as linear device when verity disabled
ANDROID: dm-android-verity: Rebase on top of 4.1
ANDROID: dm: Add android verity target
ANDROID: dm: fix dm_substitute_devices()
ANDROID: dm: Rebase on top of 4.1
CHROMIUM: dm: boot time specification of dm=
Implement memory_state_time, used by qcom,cpubw
Revert "panic: Add board ID to panic output"
usb: gadget: f_accessory: remove duplicate endpoint alloc
BACKPORT: brcmfmac: defer DPC processing during probe
FROMLIST: proc: Add LSM hook checks to /proc/<tid>/timerslack_ns
FROMLIST: proc: Relax /proc/<tid>/timerslack_ns capability requirements
UPSTREAM: ppp: defer netns reference release for ppp channel
cpuset: Add allow_attach hook for cpusets on android.
UPSTREAM: KEYS: Fix ASN.1 indefinite length object parsing
ANDROID: sdcardfs: fix itnull.cocci warnings
android-recommended.cfg: enable fstack-protector-strong
Linux 4.4.18
mm: memcontrol: fix memcg id ref counter on swap charge move
mm: memcontrol: fix swap counter leak on swapout from offline cgroup
mm: memcontrol: fix cgroup creation failure after many small jobs
ext4: fix reference counting bug on block allocation error
ext4: short-cut orphan cleanup on error
ext4: validate s_reserved_gdt_blocks on mount
ext4: don't call ext4_should_journal_data() on the journal inode
ext4: fix deadlock during page writeback
ext4: check for extents that wrap around
crypto: scatterwalk - Fix test in scatterwalk_done
crypto: gcm - Filter out async ghash if necessary
fs/dcache.c: avoid soft-lockup in dput()
fuse: fix wrong assignment of ->flags in fuse_send_init()
fuse: fuse_flush must check mapping->flags for errors
fuse: fsync() did not return IO errors
sysv, ipc: fix security-layer leaking
block: fix use-after-free in seq file
x86/syscalls/64: Add compat_sys_keyctl for 32-bit userspace
drm/i915: Pretend cursor is always on for ILK-style WM calculations (v2)
x86/mm/pat: Fix BUG_ON() in mmap_mem() on QEMU/i386
x86/pat: Document the PAT initialization sequence
x86/xen, pat: Remove PAT table init code from Xen
x86/mtrr: Fix PAT init handling when MTRR is disabled
x86/mtrr: Fix Xorg crashes in Qemu sessions
x86/mm/pat: Replace cpu_has_pat with boot_cpu_has()
x86/mm/pat: Add pat_disable() interface
x86/mm/pat: Add support of non-default PAT MSR setting
devpts: clean up interface to pty drivers
random: strengthen input validation for RNDADDTOENTCNT
apparmor: fix ref count leak when profile sha1 hash is read
Revert "s390/kdump: Clear subchannel ID to signal non-CCW/SCSI IPL"
KEYS: 64-bit MIPS needs to use compat_sys_keyctl for 32-bit userspace
arm: oabi compat: add missing access checks
cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind
i2c: i801: Allow ACPI SystemIO OpRegion to conflict with PCI BAR
x86/mm/32: Enable full randomization on i386 and X86_32
HID: sony: do not bail out when the sixaxis refuses the output report
PNP: Add Broadwell to Intel MCH size workaround
PNP: Add Haswell-ULT to Intel MCH size workaround
scsi: ignore errors from scsi_dh_add_device()
ipath: Restrict use of the write() interface
tcp: consider recv buf for the initial window scale
qed: Fix setting/clearing bit in completion bitmap
net/irda: fix NULL pointer dereference on memory allocation failure
net: bgmac: Fix infinite loop in bgmac_dma_tx_add()
bonding: set carrier off for devices created through netlink
ipv4: reject RTNH_F_DEAD and RTNH_F_LINKDOWN from user space
tcp: enable per-socket rate limiting of all 'challenge acks'
tcp: make challenge acks less predictable
arm64: relocatable: suppress R_AARCH64_ABS64 relocations in vmlinux
arm64: vmlinux.lds: make __rela_offset and __dynsym_offset ABSOLUTE
Linux 4.4.17
vfs: fix deadlock in file_remove_privs() on overlayfs
intel_th: Fix a deadlock in modprobing
intel_th: pci: Add Kaby Lake PCH-H support
net: mvneta: set real interrupt per packet for tx_done
libceph: apply new_state before new_up_client on incrementals
libata: LITE-ON CX1-JB256-HP needs lower max_sectors
i2c: mux: reg: wrong condition checked for of_address_to_resource return value
posix_cpu_timer: Exit early when process has been reaped
media: fix airspy usb probe error path
ipr: Clear interrupt on croc/crocodile when running with LSI
SCSI: fix new bug in scsi_dev_info_list string matching
RDS: fix rds_tcp_init() error path
can: fix oops caused by wrong rtnl dellink usage
can: fix handling of unmodifiable configuration options fix
can: c_can: Update D_CAN TX and RX functions to 32 bit - fix Altera Cyclone access
can: at91_can: RX queue could get stuck at high bus load
perf/x86: fix PEBS issues on Intel Atom/Core2
ovl: handle ATTR_KILL*
sched/fair: Fix effective_load() to consistently use smoothed load
mmc: block: fix packed command header endianness
block: fix use-after-free in sys_ioprio_get()
qeth: delete napi struct when removing a qeth device
platform/chrome: cros_ec_dev - double fetch bug in ioctl
clk: rockchip: initialize flags of clk_init_data in mmc-phase clock
spi: sun4i: fix FIFO limit
spi: sunxi: fix transfer timeout
namespace: update event counter when umounting a deleted dentry
9p: use file_dentry()
ext4: verify extent header depth
ecryptfs: don't allow mmap when the lower fs doesn't support it
Revert "ecryptfs: forbid opening files without mmap handler"
locks: use file_inode()
power_supply: power_supply_read_temp only if use_cnt > 0
cgroup: set css->id to -1 during init
pinctrl: imx: Do not treat a PIN without MUX register as an error
pinctrl: single: Fix missing flush of posted write for a wakeirq
pvclock: Add CPU barriers to get correct version value
Input: tsc200x - report proper input_dev name
Input: xpad - validate USB endpoint count during probe
Input: wacom_w8001 - w8001_MAX_LENGTH should be 13
Input: xpad - fix oops when attaching an unknown Xbox One gamepad
Input: elantech - add more IC body types to the list
Input: vmmouse - remove port reservation
ALSA: timer: Fix leak in events via snd_timer_user_tinterrupt
ALSA: timer: Fix leak in events via snd_timer_user_ccallback
ALSA: timer: Fix leak in SNDRV_TIMER_IOCTL_PARAMS
xenbus: don't bail early from xenbus_dev_request_and_reply()
xenbus: don't BUG() on user mode induced condition
xen/pciback: Fix conf_space read/write overlap check.
ARC: unwind: ensure that .debug_frame is generated (vs. .eh_frame)
arc: unwind: warn only once if DW2_UNWIND is disabled
kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w
pps: do not crash when failed to register
vmlinux.lds: account for destructor sections
mm, meminit: ensure node is online before checking whether pages are uninitialised
mm, meminit: always return a valid node from early_pfn_to_nid
mm, compaction: prevent VM_BUG_ON when terminating freeing scanner
fs/nilfs2: fix potential underflow in call to crc32_le
mm, compaction: abort free scanner if split fails
mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
dmaengine: at_xdmac: double FIFO flush needed to compute residue
dmaengine: at_xdmac: fix residue corruption
dmaengine: at_xdmac: align descriptors on 64 bits
x86/quirks: Add early quirk to reset Apple AirPort card
x86/quirks: Reintroduce scanning of secondary buses
x86/quirks: Apply nvidia_bugs quirk only on root bus
USB: OHCI: Don't mark EDs as ED_OPER if scheduling fails
Conflicts:
arch/arm/kernel/topology.c
arch/arm64/include/asm/arch_gicv3.h
arch/arm64/kernel/topology.c
block/bio.c
drivers/cpufreq/Kconfig
drivers/md/Makefile
drivers/media/dvb-core/dvb_ringbuffer.c
drivers/media/tuners/tuner-xc2028.c
drivers/misc/Kconfig
drivers/misc/Makefile
drivers/mmc/core/host.c
drivers/scsi/ufs/ufshcd.c
drivers/scsi/ufs/ufshcd.h
drivers/usb/dwc3/gadget.c
drivers/usb/gadget/configfs.c
fs/ecryptfs/file.c
include/linux/mmc/core.h
include/linux/mmc/host.h
include/linux/mmzone.h
include/linux/sched.h
include/linux/sched/sysctl.h
include/trace/events/power.h
include/trace/events/sched.h
init/Kconfig
kernel/cpuset.c
kernel/exit.c
kernel/sched/Makefile
kernel/sched/core.c
kernel/sched/cputime.c
kernel/sched/fair.c
kernel/sched/features.h
kernel/sched/rt.c
kernel/sched/sched.h
kernel/sched/stop_task.c
kernel/sched/tune.c
lib/Kconfig.debug
mm/Makefile
mm/vmstat.c
Change-Id: I243a43231ca56a6362076fa6301827e1b0493be5
Signed-off-by: Runmin Wang <runminw@codeaurora.org>
2016-12-12 15:32:39 -08:00
|
|
|
.procname = "sched_is_big_little",
|
|
|
|
.data = &sysctl_sched_is_big_little,
|
2016-07-22 13:21:15 +01:00
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2016-07-29 14:04:11 +01:00
|
|
|
{
|
|
|
|
.procname = "sched_sync_hint_enable",
|
|
|
|
.data = &sysctl_sched_sync_hint_enable,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2016-03-11 16:44:16 -08:00
|
|
|
{
|
|
|
|
.procname = "sched_initial_task_util",
|
|
|
|
.data = &sysctl_sched_initial_task_util,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2016-07-14 09:57:29 +01:00
|
|
|
{
|
|
|
|
.procname = "sched_cstate_aware",
|
|
|
|
.data = &sysctl_sched_cstate_aware,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2007-07-09 18:52:00 +02:00
|
|
|
{
|
|
|
|
.procname = "sched_wakeup_granularity_ns",
|
|
|
|
.data = &sysctl_sched_wakeup_granularity,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-12-12 11:34:10 -08:00
|
|
|
.proc_handler = sched_proc_update_handler,
|
2007-07-09 18:52:00 +02:00
|
|
|
.extra1 = &min_wakeup_granularity_ns,
|
|
|
|
.extra2 = &max_wakeup_granularity_ns,
|
|
|
|
},
|
2012-10-25 14:16:43 +02:00
|
|
|
#ifdef CONFIG_SMP
|
2009-11-30 12:16:47 +01:00
|
|
|
{
|
|
|
|
.procname = "sched_tunable_scaling",
|
|
|
|
.data = &sysctl_sched_tunable_scaling,
|
|
|
|
.maxlen = sizeof(enum sched_tunable_scaling),
|
|
|
|
.mode = 0644,
|
2009-12-12 11:34:10 -08:00
|
|
|
.proc_handler = sched_proc_update_handler,
|
2009-11-30 12:16:47 +01:00
|
|
|
.extra1 = &min_sched_tunable_scaling,
|
|
|
|
.extra2 = &max_sched_tunable_scaling,
|
2008-06-27 13:41:35 +02:00
|
|
|
},
|
2007-10-15 17:00:18 +02:00
|
|
|
{
|
2012-08-16 11:15:30 +09:00
|
|
|
.procname = "sched_migration_cost_ns",
|
2007-10-15 17:00:18 +02:00
|
|
|
.data = &sysctl_sched_migration_cost,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-10-15 17:00:18 +02:00
|
|
|
},
|
2007-11-09 22:39:39 +01:00
|
|
|
{
|
|
|
|
.procname = "sched_nr_migrate",
|
|
|
|
.data = &sysctl_sched_nr_migrate,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
2008-01-25 21:08:29 +01:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-01-25 21:08:29 +01:00
|
|
|
},
|
2009-09-01 10:34:37 +02:00
|
|
|
{
|
2012-08-16 11:15:30 +09:00
|
|
|
.procname = "sched_time_avg_ms",
|
2009-09-01 10:34:37 +02:00
|
|
|
.data = &sysctl_sched_time_avg,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2017-02-02 14:24:34 +05:30
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &one,
|
2009-09-01 10:34:37 +02:00
|
|
|
},
|
2010-11-15 15:47:06 -08:00
|
|
|
{
|
2012-08-16 11:15:30 +09:00
|
|
|
.procname = "sched_shares_window_ns",
|
2010-11-15 15:47:06 -08:00
|
|
|
.data = &sysctl_sched_shares_window,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2012-10-25 14:16:43 +02:00
|
|
|
#endif /* CONFIG_SMP */
|
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
2012-10-25 14:16:47 +02:00
|
|
|
{
|
|
|
|
.procname = "numa_balancing_scan_delay_ms",
|
|
|
|
.data = &sysctl_numa_balancing_scan_delay,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2012-10-25 14:16:43 +02:00
|
|
|
{
|
|
|
|
.procname = "numa_balancing_scan_period_min_ms",
|
|
|
|
.data = &sysctl_numa_balancing_scan_period_min,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "numa_balancing_scan_period_max_ms",
|
|
|
|
.data = &sysctl_numa_balancing_scan_period_max,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.
That method has various (obvious) disadvantages:
- it samples the working set at dissimilar rates,
giving some tasks a sampling quality advantage
over others.
- creates performance problems for tasks with very
large working sets
- over-samples processes with large address spaces but
which only very rarely execute
Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.
The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 100ms up to just once per 8 seconds. The current
sampling volume is 256 MB per interval.
As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 8
seconds of CPU time executed.
This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.
[ In AutoNUMA speak, this patch deals with the effective sampling
rate of the 'hinting page fault'. AutoNUMA's scanning is
currently rate-limited, but it is also fundamentally
single-threaded, executing in the knuma_scand kernel thread,
so the limit in AutoNUMA is global and does not scale up with
the number of CPUs, nor does it scan tasks in an execution
proportional manner.
So the idea of rate-limiting the scanning was first implemented
in the AutoNUMA tree via a global rate limit. This patch goes
beyond that by implementing an execution rate proportional
working set sampling rate that is not implemented via a single
global scanning daemon. ]
[ Dan Carpenter pointed out a possible NULL pointer dereference in the
first version of this patch. ]
Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
2012-10-25 14:16:45 +02:00
|
|
|
{
|
|
|
|
.procname = "numa_balancing_scan_size_mb",
|
|
|
|
.data = &sysctl_numa_balancing_scan_size,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2014-10-16 14:39:37 +04:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &one,
|
mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.
That method has various (obvious) disadvantages:
- it samples the working set at dissimilar rates,
giving some tasks a sampling quality advantage
over others.
- creates performance problems for tasks with very
large working sets
- over-samples processes with large address spaces but
which only very rarely execute
Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.
The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 100ms up to just once per 8 seconds. The current
sampling volume is 256 MB per interval.
As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 8
seconds of CPU time executed.
This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.
[ In AutoNUMA speak, this patch deals with the effective sampling
rate of the 'hinting page fault'. AutoNUMA's scanning is
currently rate-limited, but it is also fundamentally
single-threaded, executing in the knuma_scand kernel thread,
so the limit in AutoNUMA is global and does not scale up with
the number of CPUs, nor does it scan tasks in an execution
proportional manner.
So the idea of rate-limiting the scanning was first implemented
in the AutoNUMA tree via a global rate limit. This patch goes
beyond that by implementing an execution rate proportional
working set sampling rate that is not implemented via a single
global scanning daemon. ]
[ Dan Carpenter pointed out a possible NULL pointer dereference in the
first version of this patch. ]
Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
2012-10-25 14:16:45 +02:00
|
|
|
},
|
2014-01-23 15:53:13 -08:00
|
|
|
{
|
|
|
|
.procname = "numa_balancing",
|
|
|
|
.data = NULL, /* filled in by handler */
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sysctl_numa_balancing,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2012-10-25 14:16:43 +02:00
|
|
|
#endif /* CONFIG_NUMA_BALANCING */
|
|
|
|
#endif /* CONFIG_SCHED_DEBUG */
|
2008-02-13 15:45:39 +01:00
|
|
|
{
|
|
|
|
.procname = "sched_rt_period_us",
|
|
|
|
.data = &sysctl_sched_rt_period,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = sched_rt_handler,
|
2008-02-13 15:45:39 +01:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "sched_rt_runtime_us",
|
|
|
|
.data = &sysctl_sched_rt_runtime,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = sched_rt_handler,
|
2008-02-13 15:45:39 +01:00
|
|
|
},
|
2013-02-07 09:47:04 -06:00
|
|
|
{
|
|
|
|
.procname = "sched_rr_timeslice_ms",
|
|
|
|
.data = &sched_rr_timeslice,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sched_rr_handler,
|
|
|
|
},
|
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 14:18:03 +01:00
|
|
|
#ifdef CONFIG_SCHED_AUTOGROUP
|
|
|
|
{
|
|
|
|
.procname = "sched_autogroup_enabled",
|
|
|
|
.data = &sysctl_sched_autogroup_enabled,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2011-02-20 15:08:12 +08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 14:18:03 +01:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
|
|
|
#endif
|
2011-07-21 09:43:30 -07:00
|
|
|
#ifdef CONFIG_CFS_BANDWIDTH
|
|
|
|
{
|
|
|
|
.procname = "sched_cfs_bandwidth_slice_us",
|
|
|
|
.data = &sysctl_sched_cfs_bandwidth_slice,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &one,
|
|
|
|
},
|
|
|
|
#endif
|
2015-06-22 18:11:44 +01:00
|
|
|
#ifdef CONFIG_SCHED_TUNE
|
|
|
|
{
|
|
|
|
.procname = "sched_cfs_boost",
|
|
|
|
.data = &sysctl_sched_cfs_boost,
|
|
|
|
.maxlen = sizeof(sysctl_sched_cfs_boost),
|
2015-06-23 09:17:54 +01:00
|
|
|
#ifdef CONFIG_CGROUP_SCHEDTUNE
|
|
|
|
.mode = 0444,
|
|
|
|
#else
|
2015-06-22 18:11:44 +01:00
|
|
|
.mode = 0644,
|
2015-06-23 09:17:54 +01:00
|
|
|
#endif
|
2015-06-22 18:11:44 +01:00
|
|
|
.proc_handler = &sysctl_sched_cfs_boost_handler,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
|
|
|
#endif
|
2007-07-19 01:48:56 -07:00
|
|
|
#ifdef CONFIG_PROVE_LOCKING
|
|
|
|
{
|
|
|
|
.procname = "prove_locking",
|
|
|
|
.data = &prove_locking,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-07-19 01:48:56 -07:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_LOCK_STAT
|
|
|
|
{
|
|
|
|
.procname = "lock_stat",
|
|
|
|
.data = &lock_stat,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-07-19 01:48:56 -07:00
|
|
|
},
|
2007-07-09 18:52:00 +02:00
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "panic",
|
|
|
|
.data = &panic_timeout,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2012-10-04 17:15:23 -07:00
|
|
|
#ifdef CONFIG_COREDUMP
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "core_uses_pid",
|
|
|
|
.data = &core_uses_pid,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "core_pattern",
|
|
|
|
.data = core_pattern,
|
2007-05-16 22:11:16 -07:00
|
|
|
.maxlen = CORENAME_MAX_SIZE,
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0644,
|
2012-07-30 14:39:18 -07:00
|
|
|
.proc_handler = proc_dostring_coredump,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2009-09-23 15:56:56 -07:00
|
|
|
{
|
|
|
|
.procname = "core_pipe_limit",
|
|
|
|
.data = &core_pipe_limit,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-09-23 15:56:56 -07:00
|
|
|
},
|
2012-10-04 17:15:23 -07:00
|
|
|
#endif
|
2007-02-10 01:45:24 -08:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "tainted",
|
2008-10-15 22:01:41 -07:00
|
|
|
.maxlen = sizeof(long),
|
2007-02-10 01:45:24 -08:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_taint,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
{
|
|
|
|
.procname = "sysctl_writes_strict",
|
|
|
|
.data = &sysctl_writes_strict,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &neg_one,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2007-02-10 01:45:24 -08:00
|
|
|
#endif
|
2008-01-25 21:08:34 +01:00
|
|
|
#ifdef CONFIG_LATENCYTOP
|
|
|
|
{
|
|
|
|
.procname = "latencytop",
|
|
|
|
.data = &latencytop_enabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-01-25 21:08:34 +01:00
|
|
|
},
|
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifdef CONFIG_BLK_DEV_INITRD
|
|
|
|
{
|
|
|
|
.procname = "real-root-dev",
|
|
|
|
.data = &real_root_dev,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
#endif
|
2007-07-15 23:40:10 -07:00
|
|
|
{
|
|
|
|
.procname = "print-fatal-signals",
|
|
|
|
.data = &print_fatal_signals,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-07-15 23:40:10 -07:00
|
|
|
},
|
2008-09-11 23:29:54 -07:00
|
|
|
#ifdef CONFIG_SPARC
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "reboot-cmd",
|
|
|
|
.data = reboot_command,
|
|
|
|
.maxlen = 256,
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dostring,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "stop-a",
|
|
|
|
.data = &stop_a_enabled,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "scons-poweroff",
|
|
|
|
.data = &scons_pwroff,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
#endif
|
2008-11-16 23:49:24 -08:00
|
|
|
#ifdef CONFIG_SPARC64
|
|
|
|
{
|
|
|
|
.procname = "tsb-ratio",
|
|
|
|
.data = &sysctl_tsb_ratio,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-11-16 23:49:24 -08:00
|
|
|
},
|
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifdef __hppa__
|
|
|
|
{
|
|
|
|
.procname = "soft-power",
|
|
|
|
.data = &pwrsw_enabled,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2013-01-18 15:12:24 +05:30
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "unaligned-trap",
|
|
|
|
.data = &unaligned_enabled,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
.procname = "ctrl-alt-del",
|
|
|
|
.data = &C_A_D,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2008-10-06 19:06:12 -04:00
|
|
|
#ifdef CONFIG_FUNCTION_TRACER
|
2008-05-12 21:20:43 +02:00
|
|
|
{
|
|
|
|
.procname = "ftrace_enabled",
|
|
|
|
.data = &ftrace_enabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = ftrace_enable_sysctl,
|
2008-05-12 21:20:43 +02:00
|
|
|
},
|
|
|
|
#endif
|
2008-12-16 23:06:40 -05:00
|
|
|
#ifdef CONFIG_STACK_TRACER
|
|
|
|
{
|
|
|
|
.procname = "stack_tracer_enabled",
|
|
|
|
.data = &stack_tracer_enabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = stack_trace_sysctl,
|
2008-12-16 23:06:40 -05:00
|
|
|
},
|
|
|
|
#endif
|
2008-10-23 19:26:08 -04:00
|
|
|
#ifdef CONFIG_TRACING
|
|
|
|
{
|
2008-11-04 11:58:21 +01:00
|
|
|
.procname = "ftrace_dump_on_oops",
|
2008-10-23 19:26:08 -04:00
|
|
|
.data = &ftrace_dump_on_oops,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-10-23 19:26:08 -04:00
|
|
|
},
|
2013-06-14 16:21:43 -04:00
|
|
|
{
|
|
|
|
.procname = "traceoff_on_warning",
|
|
|
|
.data = &__disable_trace_on_warning,
|
|
|
|
.maxlen = sizeof(__disable_trace_on_warning),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2014-12-12 22:27:10 -05:00
|
|
|
{
|
|
|
|
.procname = "tracepoint_printk",
|
|
|
|
.data = &tracepoint_printk,
|
|
|
|
.maxlen = sizeof(tracepoint_printk),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2008-10-23 19:26:08 -04:00
|
|
|
#endif
|
2015-09-09 15:38:55 -07:00
|
|
|
#ifdef CONFIG_KEXEC_CORE
|
kexec: add sysctl to disable kexec_load
For general-purpose (i.e. distro) kernel builds it makes sense to build
with CONFIG_KEXEC to allow end users to choose what kind of things they
want to do with kexec. However, in the face of trying to lock down a
system with such a kernel, there needs to be a way to disable kexec_load
(much like module loading can be disabled). Without this, it is too easy
for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM
and modules_disabled are set. With this change, it is still possible to
load an image for use later, then disable kexec_load so the image (or lack
of image) can't be altered.
The intention is for using this in environments where "perfect"
enforcement is hard. Without a verified boot, along with verified
modules, and along with verified kexec, this is trying to give a system a
better chance to defend itself (or at least grow the window of
discoverability) against attack in the face of a privilege escalation.
In my mind, I consider several boot scenarios:
1) Verified boot of read-only verified root fs loading fd-based
verification of kexec images.
2) Secure boot of writable root fs loading signed kexec images.
3) Regular boot loading kexec (e.g. kcrash) image early and locking it.
4) Regular boot with no control of kexec image at all.
1 and 2 don't exist yet, but will soon once the verified kexec series has
landed. 4 is the state of things now. The gap between 2 and 4 is too
large, so this change creates scenario 3, a middle-ground above 4 when 2
and 1 are not possible for a system.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 15:55:59 -08:00
|
|
|
{
|
|
|
|
.procname = "kexec_load_disabled",
|
|
|
|
.data = &kexec_load_disabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
/* only handle a transition from default "0" to "1" */
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &one,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
|
|
|
#endif
|
2008-07-08 19:00:17 +02:00
|
|
|
#ifdef CONFIG_MODULES
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "modprobe",
|
|
|
|
.data = &modprobe_path,
|
|
|
|
.maxlen = KMOD_PATH_LEN,
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dostring,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2009-04-02 15:49:29 -07:00
|
|
|
{
|
|
|
|
.procname = "modules_disabled",
|
|
|
|
.data = &modules_disabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
/* only handle a transition from default "0" to "1" */
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-04-02 15:49:29 -07:00
|
|
|
.extra1 = &one,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
2014-04-10 14:09:31 -07:00
|
|
|
#ifdef CONFIG_UEVENT_HELPER
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "hotplug",
|
2005-11-16 09:00:00 +01:00
|
|
|
.data = &uevent_helper,
|
|
|
|
.maxlen = UEVENT_HELPER_PATH_LEN,
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dostring,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2014-04-10 14:09:31 -07:00
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifdef CONFIG_CHR_DEV_SG
|
|
|
|
{
|
|
|
|
.procname = "sg-big-buff",
|
|
|
|
.data = &sg_big_buff,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0444,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_BSD_PROCESS_ACCT
|
|
|
|
{
|
|
|
|
.procname = "acct",
|
|
|
|
.data = &acct_parm,
|
|
|
|
.maxlen = 3*sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_MAGIC_SYSRQ
|
|
|
|
{
|
|
|
|
.procname = "sysrq",
|
2006-12-13 00:34:36 -08:00
|
|
|
.data = &__sysrq_enabled,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2010-03-21 22:31:26 -07:00
|
|
|
.proc_handler = sysrq_sysctl_handler,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
#endif
|
2006-10-19 23:28:34 -07:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "cad_pid",
|
2006-10-02 02:19:00 -07:00
|
|
|
.data = NULL,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0600,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_do_cad_pid,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2006-10-19 23:28:34 -07:00
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "threads-max",
|
2015-04-16 12:47:50 -07:00
|
|
|
.data = NULL,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2015-04-16 12:47:50 -07:00
|
|
|
.proc_handler = sysctl_max_threads,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "random",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = random_table,
|
|
|
|
},
|
2011-04-01 17:07:50 -04:00
|
|
|
{
|
|
|
|
.procname = "usermodehelper",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = usermodehelper_table,
|
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "overflowuid",
|
|
|
|
.data = &overflowuid,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &minolduid,
|
|
|
|
.extra2 = &maxolduid,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "overflowgid",
|
|
|
|
.data = &overflowgid,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &minolduid,
|
|
|
|
.extra2 = &maxolduid,
|
|
|
|
},
|
2006-01-06 00:19:28 -08:00
|
|
|
#ifdef CONFIG_S390
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifdef CONFIG_MATHEMU
|
|
|
|
{
|
|
|
|
.procname = "ieee_emulation_warnings",
|
|
|
|
.data = &sysctl_ieee_emulation_warnings,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
.procname = "userprocess_debug",
|
2010-05-17 10:00:21 +02:00
|
|
|
.data = &show_unhandled_signals,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
.procname = "pid_max",
|
|
|
|
.data = &pid_max,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &pid_max_min,
|
|
|
|
.extra2 = &pid_max_max,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "panic_on_oops",
|
|
|
|
.data = &panic_on_oops,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2008-02-08 04:21:25 -08:00
|
|
|
#if defined CONFIG_PRINTK
|
|
|
|
{
|
|
|
|
.procname = "printk",
|
|
|
|
.data = &console_loglevel,
|
|
|
|
.maxlen = 4*sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-02-08 04:21:25 -08:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "printk_ratelimit",
|
2008-07-25 01:45:58 -07:00
|
|
|
.data = &printk_ratelimit_state.interval,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "printk_ratelimit_burst",
|
2008-07-25 01:45:58 -07:00
|
|
|
.data = &printk_ratelimit_state.burst,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2009-09-22 16:43:33 -07:00
|
|
|
{
|
|
|
|
.procname = "printk_delay",
|
|
|
|
.data = &printk_delay_msec,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-09-22 16:43:33 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &ten_thousand,
|
|
|
|
},
|
2010-11-11 14:05:18 -08:00
|
|
|
{
|
|
|
|
.procname = "dmesg_restrict",
|
|
|
|
.data = &dmesg_restrict,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2012-04-04 11:40:19 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax_sysadmin,
|
2010-11-11 14:05:18 -08:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
kptr_restrict for hiding kernel pointers from unprivileged users
Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
sysctl.
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.
If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".
[akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
[akpm@linux-foundation.org: coding-style fixup]
[randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-12 16:59:41 -08:00
|
|
|
{
|
|
|
|
.procname = "kptr_restrict",
|
|
|
|
.data = &kptr_restrict,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2012-04-04 11:40:19 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax_sysadmin,
|
kptr_restrict for hiding kernel pointers from unprivileged users
Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
sysctl.
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.
If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".
[akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
[akpm@linux-foundation.org: coding-style fixup]
[randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-12 16:59:41 -08:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &two,
|
|
|
|
},
|
2010-11-15 21:17:27 -08:00
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "ngroups_max",
|
|
|
|
.data = &ngroups_max,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0444,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2011-10-31 17:11:20 -07:00
|
|
|
{
|
|
|
|
.procname = "cap_last_cap",
|
|
|
|
.data = (void *)&cap_last_cap,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0444,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2010-05-07 17:11:44 -04:00
|
|
|
#if defined(CONFIG_LOCKUP_DETECTOR)
|
2010-02-12 17:19:19 -05:00
|
|
|
{
|
2010-05-07 17:11:44 -04:00
|
|
|
.procname = "watchdog",
|
2013-05-19 20:45:15 +02:00
|
|
|
.data = &watchdog_user_enabled,
|
2010-02-12 17:19:19 -05:00
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2015-04-14 15:44:13 -07:00
|
|
|
.proc_handler = proc_watchdog,
|
2011-05-22 22:10:22 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
2010-05-07 17:11:44 -04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "watchdog_thresh",
|
2011-05-22 22:10:22 -07:00
|
|
|
.data = &watchdog_thresh,
|
2010-05-07 17:11:44 -04:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2015-04-14 15:44:13 -07:00
|
|
|
.proc_handler = proc_watchdog_thresh,
|
2013-05-17 10:31:04 +08:00
|
|
|
.extra1 = &zero,
|
2010-05-07 17:11:44 -04:00
|
|
|
.extra2 = &sixty,
|
2010-02-12 17:19:19 -05:00
|
|
|
},
|
2015-04-14 15:44:13 -07:00
|
|
|
{
|
|
|
|
.procname = "nmi_watchdog",
|
|
|
|
.data = &nmi_watchdog_enabled,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_nmi_watchdog,
|
|
|
|
.extra1 = &zero,
|
|
|
|
#if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
|
|
|
|
.extra2 = &one,
|
|
|
|
#else
|
|
|
|
.extra2 = &zero,
|
|
|
|
#endif
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "soft_watchdog",
|
|
|
|
.data = &soft_watchdog_enabled,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_soft_watchdog,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2015-06-24 16:55:45 -07:00
|
|
|
{
|
|
|
|
.procname = "watchdog_cpumask",
|
|
|
|
.data = &watchdog_cpumask_bits,
|
|
|
|
.maxlen = NR_CPUS,
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_watchdog_cpumask,
|
|
|
|
},
|
2010-05-07 17:11:46 -04:00
|
|
|
{
|
|
|
|
.procname = "softlockup_panic",
|
|
|
|
.data = &softlockup_panic,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2015-11-05 18:44:44 -08:00
|
|
|
#ifdef CONFIG_HARDLOCKUP_DETECTOR
|
|
|
|
{
|
|
|
|
.procname = "hardlockup_panic",
|
|
|
|
.data = &hardlockup_panic,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
|
|
|
#endif
|
2014-06-23 13:22:05 -07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
{
|
|
|
|
.procname = "softlockup_all_cpu_backtrace",
|
|
|
|
.data = &sysctl_softlockup_all_cpu_backtrace,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2015-11-05 18:44:41 -08:00
|
|
|
{
|
|
|
|
.procname = "hardlockup_all_cpu_backtrace",
|
|
|
|
.data = &sysctl_hardlockup_all_cpu_backtrace,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2014-06-23 13:22:05 -07:00
|
|
|
#endif /* CONFIG_SMP */
|
2010-11-29 17:07:17 -05:00
|
|
|
#endif
|
|
|
|
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
|
|
|
|
{
|
|
|
|
.procname = "unknown_nmi_panic",
|
|
|
|
.data = &unknown_nmi_panic,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2010-02-12 17:19:19 -05:00
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
#if defined(CONFIG_X86)
|
2006-09-26 10:52:27 +02:00
|
|
|
{
|
|
|
|
.procname = "panic_on_unrecovered_nmi",
|
|
|
|
.data = &panic_on_unrecovered_nmi,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-09-26 10:52:27 +02:00
|
|
|
},
|
2009-06-24 14:32:11 -07:00
|
|
|
{
|
|
|
|
.procname = "panic_on_io_nmi",
|
|
|
|
.data = &panic_on_io_nmi,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-06-24 14:32:11 -07:00
|
|
|
},
|
2011-11-29 15:08:36 +09:00
|
|
|
#ifdef CONFIG_DEBUG_STACKOVERFLOW
|
|
|
|
{
|
|
|
|
.procname = "panic_on_stackoverflow",
|
|
|
|
.data = &sysctl_panic_on_stackoverflow,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "bootloader_type",
|
|
|
|
.data = &bootloader_type,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0444,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2009-05-07 16:54:11 -07:00
|
|
|
{
|
|
|
|
.procname = "bootloader_version",
|
|
|
|
.data = &bootloader_version,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0444,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-05-07 16:54:11 -07:00
|
|
|
},
|
2006-12-07 02:14:11 +01:00
|
|
|
{
|
|
|
|
.procname = "kstack_depth_to_print",
|
|
|
|
.data = &kstack_depth_to_print,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-12-07 02:14:11 +01:00
|
|
|
},
|
2008-01-30 13:30:05 +01:00
|
|
|
{
|
|
|
|
.procname = "io_delay_type",
|
|
|
|
.data = &io_delay_type,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-01-30 13:30:05 +01:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
2006-02-20 18:28:07 -08:00
|
|
|
#if defined(CONFIG_MMU)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "randomize_va_space",
|
|
|
|
.data = &randomize_va_space,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2006-02-20 18:28:07 -08:00
|
|
|
#endif
|
2006-01-14 13:21:00 -08:00
|
|
|
#if defined(CONFIG_S390) && defined(CONFIG_SMP)
|
2005-07-27 11:44:57 -07:00
|
|
|
{
|
|
|
|
.procname = "spin_retry",
|
|
|
|
.data = &spin_retry,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-07-27 11:44:57 -07:00
|
|
|
},
|
2006-02-20 18:27:58 -08:00
|
|
|
#endif
|
2007-07-28 03:33:16 -04:00
|
|
|
#if defined(CONFIG_ACPI_SLEEP) && defined(CONFIG_X86)
|
2006-02-20 18:27:58 -08:00
|
|
|
{
|
|
|
|
.procname = "acpi_video_flags",
|
2007-07-19 01:47:41 -07:00
|
|
|
.data = &acpi_realmode_flags,
|
2006-02-20 18:27:58 -08:00
|
|
|
.maxlen = sizeof (unsigned long),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2006-02-20 18:27:58 -08:00
|
|
|
},
|
2006-02-28 09:42:23 -08:00
|
|
|
#endif
|
2013-01-09 20:06:28 +05:30
|
|
|
#ifdef CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN
|
2006-02-28 09:42:23 -08:00
|
|
|
{
|
|
|
|
.procname = "ignore-unaligned-usertrap",
|
|
|
|
.data = &no_unaligned_warning,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-02-28 09:42:23 -08:00
|
|
|
},
|
2013-01-09 20:06:28 +05:30
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_IA64
|
2009-01-15 10:38:56 -08:00
|
|
|
{
|
|
|
|
.procname = "unaligned-dump-stack",
|
|
|
|
.data = &unaligned_dump_stack,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-01-15 10:38:56 -08:00
|
|
|
},
|
2006-06-26 13:56:52 +02:00
|
|
|
#endif
|
2009-01-15 11:08:40 -08:00
|
|
|
#ifdef CONFIG_DETECT_HUNG_TASK
|
|
|
|
{
|
|
|
|
.procname = "hung_task_panic",
|
|
|
|
.data = &sysctl_hung_task_panic,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-01-15 11:08:40 -08:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2008-01-25 21:08:02 +01:00
|
|
|
{
|
|
|
|
.procname = "hung_task_check_count",
|
|
|
|
.data = &sysctl_hung_task_check_count,
|
2013-09-23 16:43:58 +08:00
|
|
|
.maxlen = sizeof(int),
|
2008-01-25 21:08:02 +01:00
|
|
|
.mode = 0644,
|
2013-09-23 16:43:58 +08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
2008-01-25 21:08:02 +01:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "hung_task_timeout_secs",
|
|
|
|
.data = &sysctl_hung_task_timeout_secs,
|
2008-01-25 21:08:34 +01:00
|
|
|
.maxlen = sizeof(unsigned long),
|
2008-01-25 21:08:02 +01:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dohung_task_timeout_secs,
|
2014-04-07 15:38:57 -07:00
|
|
|
.extra2 = &hung_task_timeout_max,
|
2008-01-25 21:08:02 +01:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "hung_task_warnings",
|
|
|
|
.data = &sysctl_hung_task_warnings,
|
2014-01-20 17:34:13 +00:00
|
|
|
.maxlen = sizeof(int),
|
2008-01-25 21:08:02 +01:00
|
|
|
.mode = 0644,
|
2014-01-20 17:34:13 +00:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &neg_one,
|
2008-01-25 21:08:02 +01:00
|
|
|
},
|
2007-10-16 23:26:09 -07:00
|
|
|
#endif
|
2006-06-26 13:56:52 +02:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
{
|
|
|
|
.procname = "compat-log",
|
|
|
|
.data = &compat_log,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-06-26 13:56:52 +02:00
|
|
|
},
|
2005-07-27 11:44:57 -07:00
|
|
|
#endif
|
2006-06-27 02:54:53 -07:00
|
|
|
#ifdef CONFIG_RT_MUTEXES
|
|
|
|
{
|
|
|
|
.procname = "max_lock_depth",
|
|
|
|
.data = &max_lock_depth,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-06-27 02:54:53 -07:00
|
|
|
},
|
2007-05-08 00:26:04 -07:00
|
|
|
#endif
|
2007-07-17 18:37:02 -07:00
|
|
|
{
|
|
|
|
.procname = "poweroff_cmd",
|
|
|
|
.data = &poweroff_cmd,
|
|
|
|
.maxlen = POWEROFF_CMD_PATH_LEN,
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dostring,
|
2007-07-17 18:37:02 -07:00
|
|
|
},
|
2008-04-29 01:01:32 -07:00
|
|
|
#ifdef CONFIG_KEYS
|
|
|
|
{
|
|
|
|
.procname = "keys",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = key_sysctls,
|
|
|
|
},
|
|
|
|
#endif
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
|
|
|
#ifdef CONFIG_PERF_EVENTS
|
2011-06-03 17:54:40 -04:00
|
|
|
/*
|
|
|
|
* User-space scripts rely on the existence of this file
|
|
|
|
* as a feature check for perf_events being enabled.
|
|
|
|
*
|
|
|
|
* So it's an ABI, do not remove!
|
|
|
|
*/
|
2009-04-09 10:53:45 +02:00
|
|
|
{
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
|
|
|
.procname = "perf_event_paranoid",
|
|
|
|
.data = &sysctl_perf_event_paranoid,
|
|
|
|
.maxlen = sizeof(sysctl_perf_event_paranoid),
|
2009-04-09 10:53:45 +02:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-04-09 10:53:45 +02:00
|
|
|
},
|
2009-05-05 17:50:24 +02:00
|
|
|
{
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
|
|
|
.procname = "perf_event_mlock_kb",
|
|
|
|
.data = &sysctl_perf_event_mlock,
|
|
|
|
.maxlen = sizeof(sysctl_perf_event_mlock),
|
2009-05-05 17:50:24 +02:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-05-05 17:50:24 +02:00
|
|
|
},
|
2009-05-25 17:39:05 +02:00
|
|
|
{
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
|
|
|
.procname = "perf_event_max_sample_rate",
|
|
|
|
.data = &sysctl_perf_event_sample_rate,
|
|
|
|
.maxlen = sizeof(sysctl_perf_event_sample_rate),
|
2009-05-25 17:39:05 +02:00
|
|
|
.mode = 0644,
|
2011-02-16 11:22:34 +01:00
|
|
|
.proc_handler = perf_proc_update_handler,
|
2013-09-25 14:29:37 +02:00
|
|
|
.extra1 = &one,
|
2009-05-25 17:39:05 +02:00
|
|
|
},
|
2013-06-21 08:51:36 -07:00
|
|
|
{
|
|
|
|
.procname = "perf_cpu_time_max_percent",
|
|
|
|
.data = &sysctl_perf_cpu_time_max_percent,
|
|
|
|
.maxlen = sizeof(sysctl_perf_cpu_time_max_percent),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = perf_cpu_time_max_percent_handler,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
2009-04-09 10:53:45 +02:00
|
|
|
#endif
|
2008-04-04 00:51:41 +02:00
|
|
|
#ifdef CONFIG_KMEMCHECK
|
|
|
|
{
|
|
|
|
.procname = "kmemcheck",
|
|
|
|
.data = &kmemcheck_enabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-04-04 00:51:41 +02:00
|
|
|
},
|
2009-09-15 21:53:11 +02:00
|
|
|
#endif
|
2014-12-10 15:45:50 -08:00
|
|
|
{
|
|
|
|
.procname = "panic_on_warn",
|
|
|
|
.data = &panic_on_warn,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2015-05-26 22:50:33 +00:00
|
|
|
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
|
|
|
|
{
|
|
|
|
.procname = "timer_migration",
|
|
|
|
.data = &sysctl_timer_migration,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = timer_migration_handler,
|
|
|
|
},
|
bpf: enable non-root eBPF programs
In order to let unprivileged users load and execute eBPF programs
teach verifier to prevent pointer leaks.
Verifier will prevent
- any arithmetic on pointers
(except R10+Imm which is used to compute stack addresses)
- comparison of pointers
(except if (map_value_ptr == 0) ... )
- passing pointers to helper functions
- indirectly passing pointers in stack to helper functions
- returning pointer from bpf program
- storing pointers into ctx or maps
Spill/fill of pointers into stack is allowed, but mangling
of pointers stored in the stack or reading them byte by byte is not.
Within bpf programs the pointers do exist, since programs need to
be able to access maps, pass skb pointer to LD_ABS insns, etc
but programs cannot pass such pointer values to the outside
or obfuscate them.
Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
so that socket filters (tcpdump), af_packet (quic acceleration)
and future kcm can use it.
tracing and tc cls/act program types still require root permissions,
since tracing actually needs to be able to see all kernel pointers
and tc is for root only.
For example, the following unprivileged socket filter program is allowed:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);
if (value)
*value += skb->len;
return 0;
}
but the following program is not:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);
if (value)
*value += (u64) skb;
return 0;
}
since it would leak the kernel address into the map.
Unprivileged socket filter bpf programs have access to the
following helper functions:
- map lookup/update/delete (but they cannot store kernel pointers into them)
- get_random (it's already exposed to unprivileged user space)
- get_smp_processor_id
- tail_call into another socket filter program
- ktime_get_ns
The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
This toggle defaults to off (0), but can be set true (1). Once true,
bpf programs and maps cannot be accessed from unprivileged process,
and the toggle cannot be set back to false.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07 22:23:21 -07:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_BPF_SYSCALL
|
|
|
|
{
|
|
|
|
.procname = "unprivileged_bpf_disabled",
|
|
|
|
.data = &sysctl_unprivileged_bpf_disabled,
|
|
|
|
.maxlen = sizeof(sysctl_unprivileged_bpf_disabled),
|
|
|
|
.mode = 0644,
|
|
|
|
/* only handle a transition from default "0" to "1" */
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &one,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2015-05-26 22:50:33 +00:00
|
|
|
#endif
|
2014-01-10 14:11:24 -08:00
|
|
|
#if defined(CONFIG_ARM) || defined(CONFIG_ARM64)
|
|
|
|
{
|
|
|
|
.procname = "boot_reason",
|
|
|
|
.data = &boot_reason,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0444,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
.procname = "cold_boot",
|
|
|
|
.data = &cold_boot,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0444,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
|
|
|
#endif
|
|
|
|
/*
|
|
|
|
* NOTE: do not add new entries to this table unless you have read
|
|
|
|
* Documentation/sysctl/ctl_unnumbered.txt
|
|
|
|
*/
|
2009-04-03 02:30:53 -07:00
|
|
|
{ }
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static struct ctl_table vm_table[] = {
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "overcommit_memory",
|
|
|
|
.data = &sysctl_overcommit_memory,
|
|
|
|
.maxlen = sizeof(sysctl_overcommit_memory),
|
|
|
|
.mode = 0644,
|
2011-03-23 16:43:09 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &two,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2006-06-23 02:03:13 -07:00
|
|
|
{
|
|
|
|
.procname = "panic_on_oom",
|
|
|
|
.data = &sysctl_panic_on_oom,
|
|
|
|
.maxlen = sizeof(sysctl_panic_on_oom),
|
|
|
|
.mode = 0644,
|
2011-03-23 16:43:09 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &two,
|
2006-06-23 02:03:13 -07:00
|
|
|
},
|
2007-10-16 23:25:56 -07:00
|
|
|
{
|
|
|
|
.procname = "oom_kill_allocating_task",
|
|
|
|
.data = &sysctl_oom_kill_allocating_task,
|
|
|
|
.maxlen = sizeof(sysctl_oom_kill_allocating_task),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-10-16 23:25:56 -07:00
|
|
|
},
|
oom: add sysctl to enable task memory dump
Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
dump of all system tasks (excluding kernel threads) when performing an
OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
oom_adj score, and name.
This is helpful for determining why there was an OOM condition and which
rogue task caused it.
It is configurable so that large systems, such as those with several
thousand tasks, do not incur a performance penalty associated with dumping
data they may not desire.
If an OOM was triggered as a result of a memory controller, the tasklist
shall be filtered to exclude tasks that are not a member of the same
cgroup.
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 00:14:07 -08:00
|
|
|
{
|
|
|
|
.procname = "oom_dump_tasks",
|
|
|
|
.data = &sysctl_oom_dump_tasks,
|
|
|
|
.maxlen = sizeof(sysctl_oom_dump_tasks),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
oom: add sysctl to enable task memory dump
Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
dump of all system tasks (excluding kernel threads) when performing an
OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
oom_adj score, and name.
This is helpful for determining why there was an OOM condition and which
rogue task caused it.
It is configurable so that large systems, such as those with several
thousand tasks, do not incur a performance penalty associated with dumping
data they may not desire.
If an OOM was triggered as a result of a memory controller, the tasklist
shall be filtered to exclude tasks that are not a member of the same
cgroup.
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 00:14:07 -08:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "overcommit_ratio",
|
|
|
|
.data = &sysctl_overcommit_ratio,
|
|
|
|
.maxlen = sizeof(sysctl_overcommit_ratio),
|
|
|
|
.mode = 0644,
|
2014-01-21 15:49:14 -08:00
|
|
|
.proc_handler = overcommit_ratio_handler,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "overcommit_kbytes",
|
|
|
|
.data = &sysctl_overcommit_kbytes,
|
|
|
|
.maxlen = sizeof(sysctl_overcommit_kbytes),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = overcommit_kbytes_handler,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "page-cluster",
|
|
|
|
.data = &page_cluster,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2011-03-23 16:43:09 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "dirty_background_ratio",
|
|
|
|
.data = &dirty_background_ratio,
|
|
|
|
.maxlen = sizeof(dirty_background_ratio),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = dirty_background_ratio_handler,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
mm: add dirty_background_bytes and dirty_bytes sysctls
This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.
dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.
With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.
dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system. If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.
When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value. For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:
dirtyable_memory = free pages + mapped pages + file cache
dirty_background_bytes = dirty_background_ratio * dirtyable_memory
-or-
dirty_background_ratio = dirty_background_bytes / dirtyable_memory
AND
dirty_bytes = dirty_ratio * dirtyable_memory
-or-
dirty_ratio = dirty_bytes / dirtyable_memory
Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified. When one sysctl is written, the other appears as 0 when read.
The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.
Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100. This restriction is maintained, but
dirty_bytes has a lower limit of only one page.
Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio. This restriction is maintained in addition to
restricting dirty_background_bytes. If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 14:39:31 -08:00
|
|
|
{
|
|
|
|
.procname = "dirty_background_bytes",
|
|
|
|
.data = &dirty_background_bytes,
|
|
|
|
.maxlen = sizeof(dirty_background_bytes),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = dirty_background_bytes_handler,
|
2009-02-11 13:04:23 -08:00
|
|
|
.extra1 = &one_ul,
|
mm: add dirty_background_bytes and dirty_bytes sysctls
This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.
dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.
With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.
dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system. If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.
When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value. For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:
dirtyable_memory = free pages + mapped pages + file cache
dirty_background_bytes = dirty_background_ratio * dirtyable_memory
-or-
dirty_background_ratio = dirty_background_bytes / dirtyable_memory
AND
dirty_bytes = dirty_ratio * dirtyable_memory
-or-
dirty_ratio = dirty_bytes / dirtyable_memory
Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified. When one sysctl is written, the other appears as 0 when read.
The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.
Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100. This restriction is maintained, but
dirty_bytes has a lower limit of only one page.
Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio. This restriction is maintained in addition to
restricting dirty_background_bytes. If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 14:39:31 -08:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "dirty_ratio",
|
|
|
|
.data = &vm_dirty_ratio,
|
|
|
|
.maxlen = sizeof(vm_dirty_ratio),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = dirty_ratio_handler,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
mm: add dirty_background_bytes and dirty_bytes sysctls
This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.
dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.
With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.
dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system. If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.
When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value. For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:
dirtyable_memory = free pages + mapped pages + file cache
dirty_background_bytes = dirty_background_ratio * dirtyable_memory
-or-
dirty_background_ratio = dirty_background_bytes / dirtyable_memory
AND
dirty_bytes = dirty_ratio * dirtyable_memory
-or-
dirty_ratio = dirty_bytes / dirtyable_memory
Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified. When one sysctl is written, the other appears as 0 when read.
The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.
Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100. This restriction is maintained, but
dirty_bytes has a lower limit of only one page.
Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio. This restriction is maintained in addition to
restricting dirty_background_bytes. If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 14:39:31 -08:00
|
|
|
{
|
|
|
|
.procname = "dirty_bytes",
|
|
|
|
.data = &vm_dirty_bytes,
|
|
|
|
.maxlen = sizeof(vm_dirty_bytes),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = dirty_bytes_handler,
|
2009-04-30 15:08:57 -07:00
|
|
|
.extra1 = &dirty_bytes_min,
|
mm: add dirty_background_bytes and dirty_bytes sysctls
This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.
dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.
With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.
dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system. If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.
When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value. For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:
dirtyable_memory = free pages + mapped pages + file cache
dirty_background_bytes = dirty_background_ratio * dirtyable_memory
-or-
dirty_background_ratio = dirty_background_bytes / dirtyable_memory
AND
dirty_bytes = dirty_ratio * dirtyable_memory
-or-
dirty_ratio = dirty_bytes / dirtyable_memory
Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified. When one sysctl is written, the other appears as 0 when read.
The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.
Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100. This restriction is maintained, but
dirty_bytes has a lower limit of only one page.
Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio. This restriction is maintained in addition to
restricting dirty_background_bytes. If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 14:39:31 -08:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "dirty_writeback_centisecs",
|
2006-03-24 03:15:48 -08:00
|
|
|
.data = &dirty_writeback_interval,
|
|
|
|
.maxlen = sizeof(dirty_writeback_interval),
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = dirty_writeback_centisecs_handler,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "dirty_expire_centisecs",
|
2006-03-24 03:15:48 -08:00
|
|
|
.data = &dirty_expire_interval,
|
|
|
|
.maxlen = sizeof(dirty_expire_interval),
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0644,
|
2011-03-23 16:43:09 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2015-03-17 12:23:32 -04:00
|
|
|
{
|
|
|
|
.procname = "dirtytime_expire_seconds",
|
|
|
|
.data = &dirtytime_expire_interval,
|
|
|
|
.maxlen = sizeof(dirty_expire_interval),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = dirtytime_interval_handler,
|
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2012-07-31 16:41:52 -07:00
|
|
|
.procname = "nr_pdflush_threads",
|
|
|
|
.mode = 0444 /* read-only */,
|
|
|
|
.proc_handler = pdflush_proc_obsolete,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "swappiness",
|
|
|
|
.data = &vm_swappiness,
|
|
|
|
.maxlen = sizeof(vm_swappiness),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-14 17:58:21 -08:00
|
|
|
{
|
2005-04-16 15:20:36 -07:00
|
|
|
.procname = "nr_hugepages",
|
2008-07-23 21:27:42 -07:00
|
|
|
.data = NULL,
|
2005-04-16 15:20:36 -07:00
|
|
|
.maxlen = sizeof(unsigned long),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = hugetlb_sysctl_handler,
|
hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-14 17:58:21 -08:00
|
|
|
},
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
{
|
|
|
|
.procname = "nr_hugepages_mempolicy",
|
|
|
|
.data = NULL,
|
|
|
|
.maxlen = sizeof(unsigned long),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = &hugetlb_mempolicy_sysctl_handler,
|
|
|
|
},
|
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "hugetlb_shm_group",
|
|
|
|
.data = &sysctl_hugetlb_shm_group,
|
|
|
|
.maxlen = sizeof(gid_t),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2007-07-17 04:03:13 -07:00
|
|
|
{
|
|
|
|
.procname = "hugepages_treat_as_movable",
|
|
|
|
.data = &hugepages_treat_as_movable,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2013-09-11 14:22:13 -07:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-07-17 04:03:13 -07:00
|
|
|
},
|
hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 16:20:12 -08:00
|
|
|
{
|
|
|
|
.procname = "nr_overcommit_hugepages",
|
2008-07-23 21:27:42 -07:00
|
|
|
.data = NULL,
|
|
|
|
.maxlen = sizeof(unsigned long),
|
hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 16:20:12 -08:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = hugetlb_overcommit_handler,
|
hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 16:20:12 -08:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
|
|
|
{
|
|
|
|
.procname = "lowmem_reserve_ratio",
|
|
|
|
.data = &sysctl_lowmem_reserve_ratio,
|
|
|
|
.maxlen = sizeof(sysctl_lowmem_reserve_ratio),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = lowmem_reserve_ratio_sysctl_handler,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2006-01-08 01:00:39 -08:00
|
|
|
{
|
|
|
|
.procname = "drop_caches",
|
|
|
|
.data = &sysctl_drop_caches,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = drop_caches_sysctl_handler,
|
2011-03-23 16:43:09 -07:00
|
|
|
.extra1 = &one,
|
2014-04-03 14:48:19 -07:00
|
|
|
.extra2 = &four,
|
2006-01-08 01:00:39 -08:00
|
|
|
},
|
2010-05-24 14:32:28 -07:00
|
|
|
#ifdef CONFIG_COMPACTION
|
|
|
|
{
|
|
|
|
.procname = "compact_memory",
|
|
|
|
.data = &sysctl_compact_memory,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0200,
|
|
|
|
.proc_handler = sysctl_compaction_handler,
|
|
|
|
},
|
2010-05-24 14:32:31 -07:00
|
|
|
{
|
|
|
|
.procname = "extfrag_threshold",
|
|
|
|
.data = &sysctl_extfrag_threshold,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sysctl_extfrag_handler,
|
|
|
|
.extra1 = &min_extfrag_threshold,
|
|
|
|
.extra2 = &max_extfrag_threshold,
|
|
|
|
},
|
2015-04-15 16:13:20 -07:00
|
|
|
{
|
|
|
|
.procname = "compact_unevictable_allowed",
|
|
|
|
.data = &sysctl_compact_unevictable_allowed,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2010-05-24 14:32:31 -07:00
|
|
|
|
2010-05-24 14:32:28 -07:00
|
|
|
#endif /* CONFIG_COMPACTION */
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "min_free_kbytes",
|
|
|
|
.data = &min_free_kbytes,
|
|
|
|
.maxlen = sizeof(min_free_kbytes),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = min_free_kbytes_sysctl_handler,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
2011-09-01 15:26:50 -04:00
|
|
|
{
|
|
|
|
.procname = "extra_free_kbytes",
|
|
|
|
.data = &extra_free_kbytes,
|
|
|
|
.maxlen = sizeof(extra_free_kbytes),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = min_free_kbytes_sysctl_handler,
|
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
2006-01-08 01:00:40 -08:00
|
|
|
{
|
|
|
|
.procname = "percpu_pagelist_fraction",
|
|
|
|
.data = &percpu_pagelist_fraction,
|
|
|
|
.maxlen = sizeof(percpu_pagelist_fraction),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = percpu_pagelist_fraction_sysctl_handler,
|
2014-06-23 13:22:04 -07:00
|
|
|
.extra1 = &zero,
|
2006-01-08 01:00:40 -08:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifdef CONFIG_MMU
|
|
|
|
{
|
|
|
|
.procname = "max_map_count",
|
|
|
|
.data = &sysctl_max_map_count,
|
|
|
|
.maxlen = sizeof(sysctl_max_map_count),
|
|
|
|
.mode = 0644,
|
2009-12-17 15:27:05 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-12-14 17:59:52 -08:00
|
|
|
.extra1 = &zero,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2009-01-08 12:04:47 +00:00
|
|
|
#else
|
|
|
|
{
|
|
|
|
.procname = "nr_trim_pages",
|
|
|
|
.data = &sysctl_nr_trim_pages,
|
|
|
|
.maxlen = sizeof(sysctl_nr_trim_pages),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-01-08 12:04:47 +00:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
|
|
|
{
|
|
|
|
.procname = "laptop_mode",
|
|
|
|
.data = &laptop_mode,
|
|
|
|
.maxlen = sizeof(laptop_mode),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "block_dump",
|
|
|
|
.data = &block_dump,
|
|
|
|
.maxlen = sizeof(block_dump),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "vfs_cache_pressure",
|
|
|
|
.data = &sysctl_vfs_cache_pressure,
|
|
|
|
.maxlen = sizeof(sysctl_vfs_cache_pressure),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
|
|
|
#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
|
|
|
|
{
|
|
|
|
.procname = "legacy_va_layout",
|
|
|
|
.data = &sysctl_legacy_va_layout,
|
|
|
|
.maxlen = sizeof(sysctl_legacy_va_layout),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
|
|
|
#endif
|
2006-01-18 17:42:32 -08:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
{
|
|
|
|
.procname = "zone_reclaim_mode",
|
|
|
|
.data = &zone_reclaim_mode,
|
|
|
|
.maxlen = sizeof(zone_reclaim_mode),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-02-01 03:05:29 -08:00
|
|
|
.extra1 = &zero,
|
2006-01-18 17:42:32 -08:00
|
|
|
},
|
2006-07-03 00:24:13 -07:00
|
|
|
{
|
|
|
|
.procname = "min_unmapped_ratio",
|
|
|
|
.data = &sysctl_min_unmapped_ratio,
|
|
|
|
.maxlen = sizeof(sysctl_min_unmapped_ratio),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = sysctl_min_unmapped_ratio_sysctl_handler,
|
2006-07-03 00:24:13 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
2006-09-25 23:31:52 -07:00
|
|
|
{
|
|
|
|
.procname = "min_slab_ratio",
|
|
|
|
.data = &sysctl_min_slab_ratio,
|
|
|
|
.maxlen = sizeof(sysctl_min_slab_ratio),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = sysctl_min_slab_ratio_sysctl_handler,
|
2006-09-25 23:31:52 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 02:53:50 -07:00
|
|
|
#endif
|
2007-05-09 02:35:13 -07:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
{
|
|
|
|
.procname = "stat_interval",
|
|
|
|
.data = &sysctl_stat_interval,
|
|
|
|
.maxlen = sizeof(sysctl_stat_interval),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2007-05-09 02:35:13 -07:00
|
|
|
},
|
|
|
|
#endif
|
2009-12-15 19:27:45 +00:00
|
|
|
#ifdef CONFIG_MMU
|
2007-06-28 15:55:21 -04:00
|
|
|
{
|
|
|
|
.procname = "mmap_min_addr",
|
2009-07-31 12:54:11 -04:00
|
|
|
.data = &dac_mmap_min_addr,
|
|
|
|
.maxlen = sizeof(unsigned long),
|
2007-06-28 15:55:21 -04:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = mmap_min_addr_handler,
|
2007-06-28 15:55:21 -04:00
|
|
|
},
|
2009-12-15 19:27:45 +00:00
|
|
|
#endif
|
2007-07-15 23:38:01 -07:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
{
|
|
|
|
.procname = "numa_zonelist_order",
|
|
|
|
.data = &numa_zonelist_order,
|
|
|
|
.maxlen = NUMA_ZONELIST_ORDER_LEN,
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = numa_zonelist_order_handler,
|
2007-07-15 23:38:01 -07:00
|
|
|
},
|
|
|
|
#endif
|
2007-10-13 08:16:04 +01:00
|
|
|
#if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \
|
2007-03-01 10:07:42 +09:00
|
|
|
(defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 02:53:50 -07:00
|
|
|
{
|
|
|
|
.procname = "vdso_enabled",
|
2014-05-05 12:19:32 -07:00
|
|
|
#ifdef CONFIG_X86_32
|
|
|
|
.data = &vdso32_enabled,
|
|
|
|
.maxlen = sizeof(vdso32_enabled),
|
|
|
|
#else
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 02:53:50 -07:00
|
|
|
.data = &vdso_enabled,
|
|
|
|
.maxlen = sizeof(vdso_enabled),
|
2014-05-05 12:19:32 -07:00
|
|
|
#endif
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 02:53:50 -07:00
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 02:53:50 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
2008-02-04 22:29:20 -08:00
|
|
|
#ifdef CONFIG_HIGHMEM
|
|
|
|
{
|
|
|
|
.procname = "highmem_is_dirtyable",
|
|
|
|
.data = &vm_highmem_is_dirtyable,
|
|
|
|
.maxlen = sizeof(vm_highmem_is_dirtyable),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2008-02-04 22:29:20 -08:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
|
|
|
#endif
|
2009-09-16 11:50:15 +02:00
|
|
|
#ifdef CONFIG_MEMORY_FAILURE
|
|
|
|
{
|
|
|
|
.procname = "memory_failure_early_kill",
|
|
|
|
.data = &sysctl_memory_failure_early_kill,
|
|
|
|
.maxlen = sizeof(sysctl_memory_failure_early_kill),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-09-16 11:50:15 +02:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "memory_failure_recovery",
|
|
|
|
.data = &sysctl_memory_failure_recovery,
|
|
|
|
.maxlen = sizeof(sysctl_memory_failure_recovery),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-09-16 11:50:15 +02:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
|
|
|
#endif
|
mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:08:10 -07:00
|
|
|
{
|
|
|
|
.procname = "user_reserve_kbytes",
|
|
|
|
.data = &sysctl_user_reserve_kbytes,
|
|
|
|
.maxlen = sizeof(sysctl_user_reserve_kbytes),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
},
|
2013-04-29 15:08:11 -07:00
|
|
|
{
|
|
|
|
.procname = "admin_reserve_kbytes",
|
|
|
|
.data = &sysctl_admin_reserve_kbytes,
|
|
|
|
.maxlen = sizeof(sysctl_admin_reserve_kbytes),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
},
|
2016-01-12 09:18:57 -08:00
|
|
|
#ifdef CONFIG_HAVE_ARCH_MMAP_RND_BITS
|
|
|
|
{
|
|
|
|
.procname = "mmap_rnd_bits",
|
|
|
|
.data = &mmap_rnd_bits,
|
|
|
|
.maxlen = sizeof(mmap_rnd_bits),
|
|
|
|
.mode = 0600,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = (void *)&mmap_rnd_bits_min,
|
|
|
|
.extra2 = (void *)&mmap_rnd_bits_max,
|
|
|
|
},
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS
|
|
|
|
{
|
|
|
|
.procname = "mmap_rnd_compat_bits",
|
|
|
|
.data = &mmap_rnd_compat_bits,
|
|
|
|
.maxlen = sizeof(mmap_rnd_compat_bits),
|
|
|
|
.mode = 0600,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = (void *)&mmap_rnd_compat_bits_min,
|
|
|
|
.extra2 = (void *)&mmap_rnd_compat_bits_max,
|
|
|
|
},
|
2015-12-21 13:00:58 +05:30
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_SWAP
|
|
|
|
{
|
|
|
|
.procname = "swap_ratio",
|
|
|
|
.data = &sysctl_swap_ratio,
|
|
|
|
.maxlen = sizeof(sysctl_swap_ratio),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "swap_ratio_enable",
|
|
|
|
.data = &sysctl_swap_ratio_enable,
|
|
|
|
.maxlen = sizeof(sysctl_swap_ratio_enable),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
},
|
2016-01-12 09:18:57 -08:00
|
|
|
#endif
|
2009-04-03 02:30:53 -07:00
|
|
|
{ }
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static struct ctl_table fs_table[] = {
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "inode-nr",
|
|
|
|
.data = &inodes_stat,
|
fs: bump inode and dentry counters to long
This series reworks our current object cache shrinking infrastructure in
two main ways:
* Noticing that a lot of users copy and paste their own version of LRU
lists for objects, we put some effort in providing a generic version.
It is modeled after the filesystem users: dentries, inodes, and xfs
(for various tasks), but we expect that other users could benefit in
the near future with little or no modification. Let us know if you
have any issues.
* The underlying list_lru being proposed automatically and
transparently keeps the elements in per-node lists, and is able to
manipulate the node lists individually. Given this infrastructure, we
are able to modify the up-to-now hammer called shrink_slab to proceed
with node-reclaim instead of always searching memory from all over like
it has been doing.
Per-node lru lists are also expected to lead to less contention in the lru
locks on multi-node scans, since we are now no longer fighting for a
global lock. The locks usually disappear from the profilers with this
change.
Although we have no official benchmarks for this version - be our guest to
independently evaluate this - earlier versions of this series were
performance tested (details at
http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
visible performance regressions while yielding a better qualitative
behavior in NUMA machines.
With this infrastructure in place, we can use the list_lru entry point to
provide memcg isolation and per-memcg targeted reclaim. Historically,
those two pieces of work have been posted together. This version presents
only the infrastructure work, deferring the memcg work for a later time,
so we can focus on getting this part tested. You can see more about the
history of such work at http://lwn.net/Articles/552769/
Dave Chinner (18):
dcache: convert dentry_stat.nr_unused to per-cpu counters
dentry: move to per-sb LRU locks
dcache: remove dentries from LRU before putting on dispose list
mm: new shrinker API
shrinker: convert superblock shrinkers to new API
list: add a new LRU list type
inode: convert inode lru list to generic lru list code.
dcache: convert to use new lru list infrastructure
list_lru: per-node list infrastructure
shrinker: add node awareness
fs: convert inode and dentry shrinking to be node aware
xfs: convert buftarg LRU to generic code
xfs: rework buffer dispose list tracking
xfs: convert dquot cache lru to list_lru
fs: convert fs shrinkers to new scan/count API
drivers: convert shrinkers to new count/scan API
shrinker: convert remaining shrinkers to count/scan API
shrinker: Kill old ->shrink API.
Glauber Costa (7):
fs: bump inode and dentry counters to long
super: fix calculation of shrinkable objects for small numbers
list_lru: per-node API
vmscan: per-node deferred work
i915: bail out earlier when shrinker cannot acquire mutex
hugepage: convert huge zero page shrinker to new shrinker API
list_lru: dynamically adjust node arrays
This patch:
There are situations in very large machines in which we can have a large
quantity of dirty inodes, unused dentries, etc. This is particularly true
when umounting a filesystem, where eventually since every live object will
eventually be discarded.
Dave Chinner reported a problem with this while experimenting with the
shrinker revamp patchset. So we believe it is time for a change. This
patch just moves int to longs. Machines where it matters should have a
big long anyway.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 10:17:53 +10:00
|
|
|
.maxlen = 2*sizeof(long),
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0444,
|
2010-10-23 05:03:02 -04:00
|
|
|
.proc_handler = proc_nr_inodes,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "inode-state",
|
|
|
|
.data = &inodes_stat,
|
fs: bump inode and dentry counters to long
This series reworks our current object cache shrinking infrastructure in
two main ways:
* Noticing that a lot of users copy and paste their own version of LRU
lists for objects, we put some effort in providing a generic version.
It is modeled after the filesystem users: dentries, inodes, and xfs
(for various tasks), but we expect that other users could benefit in
the near future with little or no modification. Let us know if you
have any issues.
* The underlying list_lru being proposed automatically and
transparently keeps the elements in per-node lists, and is able to
manipulate the node lists individually. Given this infrastructure, we
are able to modify the up-to-now hammer called shrink_slab to proceed
with node-reclaim instead of always searching memory from all over like
it has been doing.
Per-node lru lists are also expected to lead to less contention in the lru
locks on multi-node scans, since we are now no longer fighting for a
global lock. The locks usually disappear from the profilers with this
change.
Although we have no official benchmarks for this version - be our guest to
independently evaluate this - earlier versions of this series were
performance tested (details at
http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
visible performance regressions while yielding a better qualitative
behavior in NUMA machines.
With this infrastructure in place, we can use the list_lru entry point to
provide memcg isolation and per-memcg targeted reclaim. Historically,
those two pieces of work have been posted together. This version presents
only the infrastructure work, deferring the memcg work for a later time,
so we can focus on getting this part tested. You can see more about the
history of such work at http://lwn.net/Articles/552769/
Dave Chinner (18):
dcache: convert dentry_stat.nr_unused to per-cpu counters
dentry: move to per-sb LRU locks
dcache: remove dentries from LRU before putting on dispose list
mm: new shrinker API
shrinker: convert superblock shrinkers to new API
list: add a new LRU list type
inode: convert inode lru list to generic lru list code.
dcache: convert to use new lru list infrastructure
list_lru: per-node list infrastructure
shrinker: add node awareness
fs: convert inode and dentry shrinking to be node aware
xfs: convert buftarg LRU to generic code
xfs: rework buffer dispose list tracking
xfs: convert dquot cache lru to list_lru
fs: convert fs shrinkers to new scan/count API
drivers: convert shrinkers to new count/scan API
shrinker: convert remaining shrinkers to count/scan API
shrinker: Kill old ->shrink API.
Glauber Costa (7):
fs: bump inode and dentry counters to long
super: fix calculation of shrinkable objects for small numbers
list_lru: per-node API
vmscan: per-node deferred work
i915: bail out earlier when shrinker cannot acquire mutex
hugepage: convert huge zero page shrinker to new shrinker API
list_lru: dynamically adjust node arrays
This patch:
There are situations in very large machines in which we can have a large
quantity of dirty inodes, unused dentries, etc. This is particularly true
when umounting a filesystem, where eventually since every live object will
eventually be discarded.
Dave Chinner reported a problem with this while experimenting with the
shrinker revamp patchset. So we believe it is time for a change. This
patch just moves int to longs. Machines where it matters should have a
big long anyway.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 10:17:53 +10:00
|
|
|
.maxlen = 7*sizeof(long),
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0444,
|
2010-10-23 05:03:02 -04:00
|
|
|
.proc_handler = proc_nr_inodes,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "file-nr",
|
|
|
|
.data = &files_stat,
|
2010-10-26 14:22:44 -07:00
|
|
|
.maxlen = sizeof(files_stat),
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0444,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_nr_files,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "file-max",
|
|
|
|
.data = &files_stat.max_files,
|
2010-10-26 14:22:44 -07:00
|
|
|
.maxlen = sizeof(files_stat.max_files),
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0644,
|
2010-10-26 14:22:44 -07:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2008-02-06 01:37:16 -08:00
|
|
|
{
|
|
|
|
.procname = "nr_open",
|
|
|
|
.data = &sysctl_nr_open,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2008-05-10 10:08:32 -04:00
|
|
|
.extra1 = &sysctl_nr_open_min,
|
|
|
|
.extra2 = &sysctl_nr_open_max,
|
2008-02-06 01:37:16 -08:00
|
|
|
},
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "dentry-state",
|
|
|
|
.data = &dentry_stat,
|
fs: bump inode and dentry counters to long
This series reworks our current object cache shrinking infrastructure in
two main ways:
* Noticing that a lot of users copy and paste their own version of LRU
lists for objects, we put some effort in providing a generic version.
It is modeled after the filesystem users: dentries, inodes, and xfs
(for various tasks), but we expect that other users could benefit in
the near future with little or no modification. Let us know if you
have any issues.
* The underlying list_lru being proposed automatically and
transparently keeps the elements in per-node lists, and is able to
manipulate the node lists individually. Given this infrastructure, we
are able to modify the up-to-now hammer called shrink_slab to proceed
with node-reclaim instead of always searching memory from all over like
it has been doing.
Per-node lru lists are also expected to lead to less contention in the lru
locks on multi-node scans, since we are now no longer fighting for a
global lock. The locks usually disappear from the profilers with this
change.
Although we have no official benchmarks for this version - be our guest to
independently evaluate this - earlier versions of this series were
performance tested (details at
http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
visible performance regressions while yielding a better qualitative
behavior in NUMA machines.
With this infrastructure in place, we can use the list_lru entry point to
provide memcg isolation and per-memcg targeted reclaim. Historically,
those two pieces of work have been posted together. This version presents
only the infrastructure work, deferring the memcg work for a later time,
so we can focus on getting this part tested. You can see more about the
history of such work at http://lwn.net/Articles/552769/
Dave Chinner (18):
dcache: convert dentry_stat.nr_unused to per-cpu counters
dentry: move to per-sb LRU locks
dcache: remove dentries from LRU before putting on dispose list
mm: new shrinker API
shrinker: convert superblock shrinkers to new API
list: add a new LRU list type
inode: convert inode lru list to generic lru list code.
dcache: convert to use new lru list infrastructure
list_lru: per-node list infrastructure
shrinker: add node awareness
fs: convert inode and dentry shrinking to be node aware
xfs: convert buftarg LRU to generic code
xfs: rework buffer dispose list tracking
xfs: convert dquot cache lru to list_lru
fs: convert fs shrinkers to new scan/count API
drivers: convert shrinkers to new count/scan API
shrinker: convert remaining shrinkers to count/scan API
shrinker: Kill old ->shrink API.
Glauber Costa (7):
fs: bump inode and dentry counters to long
super: fix calculation of shrinkable objects for small numbers
list_lru: per-node API
vmscan: per-node deferred work
i915: bail out earlier when shrinker cannot acquire mutex
hugepage: convert huge zero page shrinker to new shrinker API
list_lru: dynamically adjust node arrays
This patch:
There are situations in very large machines in which we can have a large
quantity of dirty inodes, unused dentries, etc. This is particularly true
when umounting a filesystem, where eventually since every live object will
eventually be discarded.
Dave Chinner reported a problem with this while experimenting with the
shrinker revamp patchset. So we believe it is time for a change. This
patch just moves int to longs. Machines where it matters should have a
big long anyway.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 10:17:53 +10:00
|
|
|
.maxlen = 6*sizeof(long),
|
2005-04-16 15:20:36 -07:00
|
|
|
.mode = 0444,
|
2010-10-10 05:36:23 -04:00
|
|
|
.proc_handler = proc_nr_dentry,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "overflowuid",
|
|
|
|
.data = &fs_overflowuid,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &minolduid,
|
|
|
|
.extra2 = &maxolduid,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "overflowgid",
|
|
|
|
.data = &fs_overflowgid,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-16 15:20:36 -07:00
|
|
|
.extra1 = &minolduid,
|
|
|
|
.extra2 = &maxolduid,
|
|
|
|
},
|
2008-08-06 15:12:22 +02:00
|
|
|
#ifdef CONFIG_FILE_LOCKING
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "leases-enable",
|
|
|
|
.data = &leases_enable,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2008-08-06 15:12:22 +02:00
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifdef CONFIG_DNOTIFY
|
|
|
|
{
|
|
|
|
.procname = "dir-notify-enable",
|
|
|
|
.data = &dir_notify_enable,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_MMU
|
2008-08-06 15:12:22 +02:00
|
|
|
#ifdef CONFIG_FILE_LOCKING
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "lease-break-time",
|
|
|
|
.data = &lease_break_time,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2008-08-06 15:12:22 +02:00
|
|
|
#endif
|
2008-10-15 22:05:12 -07:00
|
|
|
#ifdef CONFIG_AIO
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
.procname = "aio-nr",
|
|
|
|
.data = &aio_nr,
|
|
|
|
.maxlen = sizeof(aio_nr),
|
|
|
|
.mode = 0444,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "aio-max-nr",
|
|
|
|
.data = &aio_max_nr,
|
|
|
|
.maxlen = sizeof(aio_max_nr),
|
|
|
|
.mode = 0644,
|
2009-11-16 03:11:48 -08:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2005-04-16 15:20:36 -07:00
|
|
|
},
|
2008-10-15 22:05:12 -07:00
|
|
|
#endif /* CONFIG_AIO */
|
2006-06-01 13:10:59 -07:00
|
|
|
#ifdef CONFIG_INOTIFY_USER
|
2005-07-13 12:38:18 -04:00
|
|
|
{
|
|
|
|
.procname = "inotify",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = inotify_table,
|
|
|
|
},
|
|
|
|
#endif
|
epoll: introduce resource usage limits
It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface. Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds. To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced. A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:
max_user_instances = Maximum number of devices - per user
max_user_watches = Maximum number of "watched" fds - per user
The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM. As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users. The
default value for "max_user_instances" is set to 128, that should be
enough too.
This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
listed, so that should be ok.
[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01 13:13:55 -08:00
|
|
|
#ifdef CONFIG_EPOLL
|
|
|
|
{
|
|
|
|
.procname = "epoll",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = epoll_table,
|
|
|
|
},
|
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
|
|
|
{
|
|
|
|
.procname = "protected_symlinks",
|
|
|
|
.data = &sysctl_protected_symlinks,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0600,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "protected_hardlinks",
|
|
|
|
.data = &sysctl_protected_hardlinks,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0600,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2005-06-23 00:09:43 -07:00
|
|
|
{
|
|
|
|
.procname = "suid_dumpable",
|
|
|
|
.data = &suid_dumpable,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2012-07-30 14:39:18 -07:00
|
|
|
.proc_handler = proc_dointvec_minmax_coredump,
|
2009-04-02 16:58:33 -07:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &two,
|
2005-06-23 00:09:43 -07:00
|
|
|
},
|
2007-02-14 00:34:07 -08:00
|
|
|
#if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
|
|
|
|
{
|
|
|
|
.procname = "binfmt_misc",
|
|
|
|
.mode = 0555,
|
2015-05-09 22:09:14 -05:00
|
|
|
.child = sysctl_mount_point,
|
2007-02-14 00:34:07 -08:00
|
|
|
},
|
|
|
|
#endif
|
2010-05-19 21:03:16 +02:00
|
|
|
{
|
2010-06-03 14:54:39 +02:00
|
|
|
.procname = "pipe-max-size",
|
|
|
|
.data = &pipe_max_size,
|
2010-05-19 21:03:16 +02:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2010-06-03 14:54:39 +02:00
|
|
|
.proc_handler = &pipe_proc_fn,
|
|
|
|
.extra1 = &pipe_min_size,
|
2010-05-19 21:03:16 +02:00
|
|
|
},
|
2016-01-18 16:36:09 +01:00
|
|
|
{
|
|
|
|
.procname = "pipe-user-pages-hard",
|
|
|
|
.data = &pipe_user_pages_hard,
|
|
|
|
.maxlen = sizeof(pipe_user_pages_hard),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "pipe-user-pages-soft",
|
|
|
|
.data = &pipe_user_pages_soft,
|
|
|
|
.maxlen = sizeof(pipe_user_pages_soft),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
},
|
2009-04-03 02:30:53 -07:00
|
|
|
{ }
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static struct ctl_table debug_table[] = {
|
2012-10-08 16:28:16 -07:00
|
|
|
#ifdef CONFIG_SYSCTL_EXCEPTION_TRACE
|
2007-07-22 11:12:28 +02:00
|
|
|
{
|
|
|
|
.procname = "exception-trace",
|
|
|
|
.data = &show_unhandled_signals,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec
|
|
|
|
},
|
2010-02-25 08:34:15 -05:00
|
|
|
#endif
|
|
|
|
#if defined(CONFIG_OPTPROBES)
|
|
|
|
{
|
|
|
|
.procname = "kprobes-optimization",
|
|
|
|
.data = &sysctl_kprobes_optimization,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_kprobes_optimization_handler,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2007-07-22 11:12:28 +02:00
|
|
|
#endif
|
2009-04-03 02:30:53 -07:00
|
|
|
{ }
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static struct ctl_table dev_table[] = {
|
2009-04-03 02:30:53 -07:00
|
|
|
{ }
|
[PATCH] inotify
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12 17:06:03 -04:00
|
|
|
};
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2012-01-06 03:34:20 -08:00
|
|
|
int __init sysctl_init(void)
|
2007-02-14 00:34:13 -08:00
|
|
|
{
|
2012-07-30 14:42:48 -07:00
|
|
|
struct ctl_table_header *hdr;
|
|
|
|
|
|
|
|
hdr = register_sysctl_table(sysctl_base_table);
|
|
|
|
kmemleak_not_leak(hdr);
|
2007-02-14 00:34:13 -08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-09-27 01:51:04 -07:00
|
|
|
#endif /* CONFIG_SYSCTL */
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* /proc/sys support
|
|
|
|
*/
|
|
|
|
|
2006-09-27 01:51:04 -07:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2014-06-06 14:37:17 -07:00
|
|
|
static int _proc_do_string(char *data, int maxlen, int write,
|
|
|
|
char __user *buffer,
|
2006-10-02 02:18:05 -07:00
|
|
|
size_t *lenp, loff_t *ppos)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
size_t len;
|
|
|
|
char __user *p;
|
|
|
|
char c;
|
2007-02-10 01:46:38 -08:00
|
|
|
|
|
|
|
if (!data || !maxlen || !*lenp) {
|
2005-04-16 15:20:36 -07:00
|
|
|
*lenp = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
2007-02-10 01:46:38 -08:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
if (write) {
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
if (sysctl_writes_strict == SYSCTL_WRITES_STRICT) {
|
|
|
|
/* Only continue writes not past the end of buffer. */
|
|
|
|
len = strlen(data);
|
|
|
|
if (len > maxlen - 1)
|
|
|
|
len = maxlen - 1;
|
|
|
|
|
|
|
|
if (*ppos > len)
|
|
|
|
return 0;
|
|
|
|
len = *ppos;
|
|
|
|
} else {
|
|
|
|
/* Start writing from beginning of buffer. */
|
|
|
|
len = 0;
|
|
|
|
}
|
|
|
|
|
2014-06-06 14:37:18 -07:00
|
|
|
*ppos += *lenp;
|
2005-04-16 15:20:36 -07:00
|
|
|
p = buffer;
|
2014-06-06 14:37:18 -07:00
|
|
|
while ((p - buffer) < *lenp && len < maxlen - 1) {
|
2005-04-16 15:20:36 -07:00
|
|
|
if (get_user(c, p++))
|
|
|
|
return -EFAULT;
|
|
|
|
if (c == 0 || c == '\n')
|
|
|
|
break;
|
2014-06-06 14:37:18 -07:00
|
|
|
data[len++] = c;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2014-06-06 14:37:17 -07:00
|
|
|
data[len] = 0;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
2006-10-02 02:18:04 -07:00
|
|
|
len = strlen(data);
|
|
|
|
if (len > maxlen)
|
|
|
|
len = maxlen;
|
2007-02-10 01:46:38 -08:00
|
|
|
|
|
|
|
if (*ppos > len) {
|
|
|
|
*lenp = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
data += *ppos;
|
|
|
|
len -= *ppos;
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
if (len > *lenp)
|
|
|
|
len = *lenp;
|
|
|
|
if (len)
|
2014-06-06 14:37:17 -07:00
|
|
|
if (copy_to_user(buffer, data, len))
|
2005-04-16 15:20:36 -07:00
|
|
|
return -EFAULT;
|
|
|
|
if (len < *lenp) {
|
2014-06-06 14:37:17 -07:00
|
|
|
if (put_user('\n', buffer + len))
|
2005-04-16 15:20:36 -07:00
|
|
|
return -EFAULT;
|
|
|
|
len++;
|
|
|
|
}
|
|
|
|
*lenp = len;
|
|
|
|
*ppos += len;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
static void warn_sysctl_write(struct ctl_table *table)
|
|
|
|
{
|
|
|
|
pr_warn_once("%s wrote to %s when file position was not 0!\n"
|
|
|
|
"This will not be supported in the future. To silence this\n"
|
|
|
|
"warning, set kernel.sysctl_writes_strict = -1\n",
|
|
|
|
current->comm, table->procname);
|
|
|
|
}
|
|
|
|
|
2006-10-02 02:18:04 -07:00
|
|
|
/**
|
|
|
|
* proc_dostring - read a string sysctl
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes a string from/to the user buffer. If the kernel
|
|
|
|
* buffer provided is not large enough to hold the string, the
|
|
|
|
* string is truncated. The copied string is %NULL-terminated.
|
|
|
|
* If the string is being read by the user process, it is copied
|
|
|
|
* and a newline '\n' is added. It is truncated if the buffer is
|
|
|
|
* not large enough.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dostring(struct ctl_table *table, int write,
|
2006-10-02 02:18:04 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
if (write && *ppos && sysctl_writes_strict == SYSCTL_WRITES_WARN)
|
|
|
|
warn_sysctl_write(table);
|
|
|
|
|
2014-06-06 14:37:17 -07:00
|
|
|
return _proc_do_string((char *)(table->data), table->maxlen, write,
|
|
|
|
(char __user *)buffer, lenp, ppos);
|
2006-10-02 02:18:04 -07:00
|
|
|
}
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static size_t proc_skip_spaces(char **buf)
|
|
|
|
{
|
|
|
|
size_t ret;
|
|
|
|
char *tmp = skip_spaces(*buf);
|
|
|
|
ret = tmp - *buf;
|
|
|
|
*buf = tmp;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2010-05-05 00:26:55 +00:00
|
|
|
static void proc_skip_char(char **buf, size_t *size, const char v)
|
|
|
|
{
|
|
|
|
while (*size) {
|
|
|
|
if (**buf != v)
|
|
|
|
break;
|
|
|
|
(*size)--;
|
|
|
|
(*buf)++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
#define TMPBUFLEN 22
|
|
|
|
/**
|
2010-05-21 11:29:53 -07:00
|
|
|
* proc_get_long - reads an ASCII formatted integer from a user buffer
|
2010-05-05 00:26:45 +00:00
|
|
|
*
|
2010-05-21 11:29:53 -07:00
|
|
|
* @buf: a kernel buffer
|
|
|
|
* @size: size of the kernel buffer
|
|
|
|
* @val: this is where the number will be stored
|
|
|
|
* @neg: set to %TRUE if number is negative
|
|
|
|
* @perm_tr: a vector which contains the allowed trailers
|
|
|
|
* @perm_tr_len: size of the perm_tr vector
|
|
|
|
* @tr: pointer to store the trailer character
|
2010-05-05 00:26:45 +00:00
|
|
|
*
|
2010-05-21 11:29:53 -07:00
|
|
|
* In case of success %0 is returned and @buf and @size are updated with
|
|
|
|
* the amount of bytes read. If @tr is non-NULL and a trailing
|
|
|
|
* character exists (size is non-zero after returning from this
|
|
|
|
* function), @tr is updated with the trailing character.
|
2010-05-05 00:26:45 +00:00
|
|
|
*/
|
|
|
|
static int proc_get_long(char **buf, size_t *size,
|
|
|
|
unsigned long *val, bool *neg,
|
|
|
|
const char *perm_tr, unsigned perm_tr_len, char *tr)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
char *p, tmp[TMPBUFLEN];
|
|
|
|
|
|
|
|
if (!*size)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
len = *size;
|
|
|
|
if (len > TMPBUFLEN - 1)
|
|
|
|
len = TMPBUFLEN - 1;
|
|
|
|
|
|
|
|
memcpy(tmp, *buf, len);
|
|
|
|
|
|
|
|
tmp[len] = 0;
|
|
|
|
p = tmp;
|
|
|
|
if (*p == '-' && *size > 1) {
|
|
|
|
*neg = true;
|
|
|
|
p++;
|
|
|
|
} else
|
|
|
|
*neg = false;
|
|
|
|
if (!isdigit(*p))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
*val = simple_strtoul(p, &p, 0);
|
|
|
|
|
|
|
|
len = p - tmp;
|
|
|
|
|
|
|
|
/* We don't know if the next char is whitespace thus we may accept
|
|
|
|
* invalid integers (e.g. 1234...a) or two integers instead of one
|
|
|
|
* (e.g. 123...1). So lets not allow such large numbers. */
|
|
|
|
if (len == TMPBUFLEN - 1)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (len < *size && perm_tr_len && !memchr(perm_tr, *p, perm_tr_len))
|
|
|
|
return -EINVAL;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (tr && (len < *size))
|
|
|
|
*tr = *p;
|
|
|
|
|
|
|
|
*buf += len;
|
|
|
|
*size -= len;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2010-05-21 11:29:53 -07:00
|
|
|
* proc_put_long - converts an integer to a decimal ASCII formatted string
|
2010-05-05 00:26:45 +00:00
|
|
|
*
|
2010-05-21 11:29:53 -07:00
|
|
|
* @buf: the user buffer
|
|
|
|
* @size: the size of the user buffer
|
|
|
|
* @val: the integer to be converted
|
|
|
|
* @neg: sign of the number, %TRUE for negative
|
2010-05-05 00:26:45 +00:00
|
|
|
*
|
2010-05-21 11:29:53 -07:00
|
|
|
* In case of success %0 is returned and @buf and @size are updated with
|
|
|
|
* the amount of bytes written.
|
2010-05-05 00:26:45 +00:00
|
|
|
*/
|
|
|
|
static int proc_put_long(void __user **buf, size_t *size, unsigned long val,
|
|
|
|
bool neg)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
char tmp[TMPBUFLEN], *p = tmp;
|
|
|
|
|
|
|
|
sprintf(p, "%s%lu", neg ? "-" : "", val);
|
|
|
|
len = strlen(tmp);
|
|
|
|
if (len > *size)
|
|
|
|
len = *size;
|
|
|
|
if (copy_to_user(*buf, tmp, len))
|
|
|
|
return -EFAULT;
|
|
|
|
*size -= len;
|
|
|
|
*buf += len;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#undef TMPBUFLEN
|
|
|
|
|
|
|
|
static int proc_put_char(void __user **buf, size_t *size, char c)
|
|
|
|
{
|
|
|
|
if (*size) {
|
|
|
|
char __user **buffer = (char __user **)buf;
|
|
|
|
if (put_user(c, *buffer))
|
|
|
|
return -EFAULT;
|
|
|
|
(*size)--, (*buffer)++;
|
|
|
|
*buf = *buffer;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int *valp,
|
|
|
|
int write, void *data)
|
|
|
|
{
|
|
|
|
if (write) {
|
2016-06-14 14:58:43 -06:00
|
|
|
*valp = *negp ? -*lvalp : *lvalp;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
|
|
|
int val = *valp;
|
|
|
|
if (val < 0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = true;
|
2015-09-09 15:39:06 -07:00
|
|
|
*lvalp = -(unsigned long)val;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = false;
|
2005-04-16 15:20:36 -07:00
|
|
|
*lvalp = (unsigned long)val;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static const char proc_wspace_sep[] = { ' ', '\t', '\n' };
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
|
2009-09-23 15:57:19 -07:00
|
|
|
int write, void __user *buffer,
|
2006-10-02 02:18:23 -07:00
|
|
|
size_t *lenp, loff_t *ppos,
|
2010-05-05 00:26:45 +00:00
|
|
|
int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int write, void *data),
|
|
|
|
void *data)
|
|
|
|
{
|
2010-05-05 00:26:45 +00:00
|
|
|
int *i, vleft, first = 1, err = 0;
|
|
|
|
unsigned long page = 0;
|
|
|
|
size_t left;
|
|
|
|
char *kbuf;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (!tbl_data || !table->maxlen || !*lenp || (*ppos && !write)) {
|
2005-04-16 15:20:36 -07:00
|
|
|
*lenp = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-10-02 02:18:23 -07:00
|
|
|
i = (int *) tbl_data;
|
2005-04-16 15:20:36 -07:00
|
|
|
vleft = table->maxlen / sizeof(*i);
|
|
|
|
left = *lenp;
|
|
|
|
|
|
|
|
if (!conv)
|
|
|
|
conv = do_proc_dointvec_conv;
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (write) {
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
if (*ppos) {
|
|
|
|
switch (sysctl_writes_strict) {
|
|
|
|
case SYSCTL_WRITES_STRICT:
|
|
|
|
goto out;
|
|
|
|
case SYSCTL_WRITES_WARN:
|
|
|
|
warn_sysctl_write(table);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (left > PAGE_SIZE - 1)
|
|
|
|
left = PAGE_SIZE - 1;
|
|
|
|
page = __get_free_page(GFP_TEMPORARY);
|
|
|
|
kbuf = (char *) page;
|
|
|
|
if (!kbuf)
|
|
|
|
return -ENOMEM;
|
|
|
|
if (copy_from_user(kbuf, buffer, left)) {
|
|
|
|
err = -EFAULT;
|
|
|
|
goto free;
|
|
|
|
}
|
|
|
|
kbuf[left] = 0;
|
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
for (; left && vleft--; i++, first=0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
unsigned long lval;
|
|
|
|
bool neg;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (write) {
|
|
|
|
left -= proc_skip_spaces(&kbuf);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2010-05-25 16:10:14 -07:00
|
|
|
if (!left)
|
|
|
|
break;
|
2010-05-05 00:26:45 +00:00
|
|
|
err = proc_get_long(&kbuf, &left, &lval, &neg,
|
|
|
|
proc_wspace_sep,
|
|
|
|
sizeof(proc_wspace_sep), NULL);
|
|
|
|
if (err)
|
2005-04-16 15:20:36 -07:00
|
|
|
break;
|
2010-05-05 00:26:45 +00:00
|
|
|
if (conv(&neg, &lval, i, 1, data)) {
|
|
|
|
err = -EINVAL;
|
2005-04-16 15:20:36 -07:00
|
|
|
break;
|
2010-05-05 00:26:45 +00:00
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
if (conv(&neg, &lval, i, 0, data)) {
|
|
|
|
err = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
if (!first)
|
2010-05-05 00:26:45 +00:00
|
|
|
err = proc_put_char(&buffer, &left, '\t');
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
err = proc_put_long(&buffer, &left, lval, neg);
|
|
|
|
if (err)
|
2005-04-16 15:20:36 -07:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (!write && !first && left && !err)
|
|
|
|
err = proc_put_char(&buffer, &left, '\n');
|
2010-05-25 16:10:14 -07:00
|
|
|
if (write && !err && left)
|
2010-05-05 00:26:45 +00:00
|
|
|
left -= proc_skip_spaces(&kbuf);
|
|
|
|
free:
|
2005-04-16 15:20:36 -07:00
|
|
|
if (write) {
|
2010-05-05 00:26:45 +00:00
|
|
|
free_page(page);
|
|
|
|
if (first)
|
|
|
|
return err ? : -EINVAL;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
*lenp -= left;
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
out:
|
2005-04-16 15:20:36 -07:00
|
|
|
*ppos += *lenp;
|
2010-05-05 00:26:45 +00:00
|
|
|
return err;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
static int do_proc_dointvec(struct ctl_table *table, int write,
|
2006-10-02 02:18:23 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos,
|
2010-05-05 00:26:45 +00:00
|
|
|
int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
|
2006-10-02 02:18:23 -07:00
|
|
|
int write, void *data),
|
|
|
|
void *data)
|
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return __do_proc_dointvec(table->data, table, write,
|
2006-10-02 02:18:23 -07:00
|
|
|
buffer, lenp, ppos, conv, data);
|
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/**
|
|
|
|
* proc_dointvec - read a vector of integers
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_dointvec(table,write,buffer,lenp,ppos,
|
2005-04-16 15:20:36 -07:00
|
|
|
NULL,NULL);
|
|
|
|
}
|
|
|
|
|
2007-02-10 01:45:24 -08:00
|
|
|
/*
|
2008-10-15 22:01:41 -07:00
|
|
|
* Taint values can only be increased
|
|
|
|
* This means we can safely use a temporary.
|
2007-02-10 01:45:24 -08:00
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
static int proc_taint(struct ctl_table *table, int write,
|
2007-02-10 01:45:24 -08:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2008-10-15 22:01:41 -07:00
|
|
|
struct ctl_table t;
|
|
|
|
unsigned long tmptaint = get_taint();
|
|
|
|
int err;
|
2007-02-10 01:45:24 -08:00
|
|
|
|
2007-04-23 14:41:14 -07:00
|
|
|
if (write && !capable(CAP_SYS_ADMIN))
|
2007-02-10 01:45:24 -08:00
|
|
|
return -EPERM;
|
|
|
|
|
2008-10-15 22:01:41 -07:00
|
|
|
t = *table;
|
|
|
|
t.data = &tmptaint;
|
2009-09-23 15:57:19 -07:00
|
|
|
err = proc_doulongvec_minmax(&t, write, buffer, lenp, ppos);
|
2008-10-15 22:01:41 -07:00
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
if (write) {
|
|
|
|
/*
|
|
|
|
* Poor man's atomic or. Not worth adding a primitive
|
|
|
|
* to everyone's atomic.h for this
|
|
|
|
*/
|
|
|
|
int i;
|
|
|
|
for (i = 0; i < BITS_PER_LONG && tmptaint >> i; i++) {
|
|
|
|
if ((tmptaint >> i) & 1)
|
2013-01-21 17:17:39 +10:30
|
|
|
add_taint(i, LOCKDEP_STILL_OK);
|
2008-10-15 22:01:41 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return err;
|
2007-02-10 01:45:24 -08:00
|
|
|
}
|
|
|
|
|
2011-03-23 16:43:11 -07:00
|
|
|
#ifdef CONFIG_PRINTK
|
2012-04-04 11:40:19 -07:00
|
|
|
static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
|
2011-03-23 16:43:11 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
if (write && !capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
struct do_proc_dointvec_minmax_conv_param {
|
|
|
|
int *min;
|
|
|
|
int *max;
|
|
|
|
};
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
|
|
|
|
int *valp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int write, void *data)
|
|
|
|
{
|
|
|
|
struct do_proc_dointvec_minmax_conv_param *param = data;
|
|
|
|
if (write) {
|
|
|
|
int val = *negp ? -*lvalp : *lvalp;
|
|
|
|
if ((param->min && *param->min > val) ||
|
|
|
|
(param->max && *param->max < val))
|
|
|
|
return -EINVAL;
|
|
|
|
*valp = val;
|
|
|
|
} else {
|
|
|
|
int val = *valp;
|
|
|
|
if (val < 0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = true;
|
2015-09-09 15:39:06 -07:00
|
|
|
*lvalp = -(unsigned long)val;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = false;
|
2005-04-16 15:20:36 -07:00
|
|
|
*lvalp = (unsigned long)val;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* proc_dointvec_minmax - read a vector of integers with min/max values
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
*
|
|
|
|
* This routine will ensure the values are within the range specified by
|
|
|
|
* table->extra1 (min) and table->extra2 (max).
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_minmax(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
struct do_proc_dointvec_minmax_conv_param param = {
|
|
|
|
.min = (int *) table->extra1,
|
|
|
|
.max = (int *) table->extra2,
|
|
|
|
};
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_dointvec(table, write, buffer, lenp, ppos,
|
2005-04-16 15:20:36 -07:00
|
|
|
do_proc_dointvec_minmax_conv, ¶m);
|
|
|
|
}
|
|
|
|
|
2012-07-30 14:39:18 -07:00
|
|
|
static void validate_coredump_safety(void)
|
|
|
|
{
|
2012-10-04 17:15:23 -07:00
|
|
|
#ifdef CONFIG_COREDUMP
|
2013-02-27 17:03:15 -08:00
|
|
|
if (suid_dumpable == SUID_DUMP_ROOT &&
|
2012-07-30 14:39:18 -07:00
|
|
|
core_pattern[0] != '/' && core_pattern[0] != '|') {
|
|
|
|
printk(KERN_WARNING "Unsafe core_pattern used with "\
|
|
|
|
"suid_dumpable=2. Pipe handler or fully qualified "\
|
|
|
|
"core dump path required.\n");
|
|
|
|
}
|
2012-10-04 17:15:23 -07:00
|
|
|
#endif
|
2012-07-30 14:39:18 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static int proc_dointvec_minmax_coredump(struct ctl_table *table, int write,
|
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
int error = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
|
|
|
|
if (!error)
|
|
|
|
validate_coredump_safety();
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2012-10-04 17:15:23 -07:00
|
|
|
#ifdef CONFIG_COREDUMP
|
2012-07-30 14:39:18 -07:00
|
|
|
static int proc_dostring_coredump(struct ctl_table *table, int write,
|
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
int error = proc_dostring(table, write, buffer, lenp, ppos);
|
|
|
|
if (!error)
|
|
|
|
validate_coredump_safety();
|
|
|
|
return error;
|
|
|
|
}
|
2012-10-04 17:15:23 -07:00
|
|
|
#endif
|
2012-07-30 14:39:18 -07:00
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer,
|
|
|
|
size_t *lenp, loff_t *ppos,
|
|
|
|
unsigned long convmul,
|
|
|
|
unsigned long convdiv)
|
|
|
|
{
|
2010-05-05 00:26:45 +00:00
|
|
|
unsigned long *i, *min, *max;
|
|
|
|
int vleft, first = 1, err = 0;
|
|
|
|
unsigned long page = 0;
|
|
|
|
size_t left;
|
|
|
|
char *kbuf;
|
|
|
|
|
|
|
|
if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
|
2005-04-16 15:20:36 -07:00
|
|
|
*lenp = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
2010-05-05 00:26:45 +00:00
|
|
|
|
2006-10-02 02:18:23 -07:00
|
|
|
i = (unsigned long *) data;
|
2005-04-16 15:20:36 -07:00
|
|
|
min = (unsigned long *) table->extra1;
|
|
|
|
max = (unsigned long *) table->extra2;
|
|
|
|
vleft = table->maxlen / sizeof(unsigned long);
|
|
|
|
left = *lenp;
|
2010-05-05 00:26:45 +00:00
|
|
|
|
|
|
|
if (write) {
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
if (*ppos) {
|
|
|
|
switch (sysctl_writes_strict) {
|
|
|
|
case SYSCTL_WRITES_STRICT:
|
|
|
|
goto out;
|
|
|
|
case SYSCTL_WRITES_WARN:
|
|
|
|
warn_sysctl_write(table);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (left > PAGE_SIZE - 1)
|
|
|
|
left = PAGE_SIZE - 1;
|
|
|
|
page = __get_free_page(GFP_TEMPORARY);
|
|
|
|
kbuf = (char *) page;
|
|
|
|
if (!kbuf)
|
|
|
|
return -ENOMEM;
|
|
|
|
if (copy_from_user(kbuf, buffer, left)) {
|
|
|
|
err = -EFAULT;
|
|
|
|
goto free;
|
|
|
|
}
|
|
|
|
kbuf[left] = 0;
|
|
|
|
}
|
|
|
|
|
2010-10-07 12:59:29 -07:00
|
|
|
for (; left && vleft--; i++, first = 0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
unsigned long val;
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
if (write) {
|
2010-05-05 00:26:45 +00:00
|
|
|
bool neg;
|
|
|
|
|
|
|
|
left -= proc_skip_spaces(&kbuf);
|
|
|
|
|
|
|
|
err = proc_get_long(&kbuf, &left, &val, &neg,
|
|
|
|
proc_wspace_sep,
|
|
|
|
sizeof(proc_wspace_sep), NULL);
|
|
|
|
if (err)
|
2005-04-16 15:20:36 -07:00
|
|
|
break;
|
|
|
|
if (neg)
|
|
|
|
continue;
|
|
|
|
if ((min && val < *min) || (max && val > *max))
|
|
|
|
continue;
|
|
|
|
*i = val;
|
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
val = convdiv * (*i) / convmul;
|
2013-11-12 15:11:21 -08:00
|
|
|
if (!first) {
|
2010-05-05 00:26:45 +00:00
|
|
|
err = proc_put_char(&buffer, &left, '\t');
|
2013-11-12 15:11:21 -08:00
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
}
|
2010-05-05 00:26:45 +00:00
|
|
|
err = proc_put_long(&buffer, &left, val, false);
|
|
|
|
if (err)
|
|
|
|
break;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
if (!write && !first && left && !err)
|
|
|
|
err = proc_put_char(&buffer, &left, '\n');
|
|
|
|
if (write && !err)
|
|
|
|
left -= proc_skip_spaces(&kbuf);
|
|
|
|
free:
|
2005-04-16 15:20:36 -07:00
|
|
|
if (write) {
|
2010-05-05 00:26:45 +00:00
|
|
|
free_page(page);
|
|
|
|
if (first)
|
|
|
|
return err ? : -EINVAL;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
*lenp -= left;
|
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 14:37:19 -07:00
|
|
|
out:
|
2005-04-16 15:20:36 -07:00
|
|
|
*ppos += *lenp;
|
2010-05-05 00:26:45 +00:00
|
|
|
return err;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
static int do_proc_doulongvec_minmax(struct ctl_table *table, int write,
|
2006-10-02 02:18:23 -07:00
|
|
|
void __user *buffer,
|
|
|
|
size_t *lenp, loff_t *ppos,
|
|
|
|
unsigned long convmul,
|
|
|
|
unsigned long convdiv)
|
|
|
|
{
|
|
|
|
return __do_proc_doulongvec_minmax(table->data, table, write,
|
2009-09-23 15:57:19 -07:00
|
|
|
buffer, lenp, ppos, convmul, convdiv);
|
2006-10-02 02:18:23 -07:00
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/**
|
|
|
|
* proc_doulongvec_minmax - read a vector of long integers with min/max values
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned long) unsigned long
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
*
|
|
|
|
* This routine will ensure the values are within the range specified by
|
|
|
|
* table->extra1 (min) and table->extra2 (max).
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_doulongvec_minmax(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_doulongvec_minmax(table, write, buffer, lenp, ppos, 1l, 1l);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* proc_doulongvec_ms_jiffies_minmax - read a vector of millisecond values with min/max values
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned long) unsigned long
|
|
|
|
* values from/to the user buffer, treated as an ASCII string. The values
|
|
|
|
* are treated as milliseconds, and converted to jiffies when they are stored.
|
|
|
|
*
|
|
|
|
* This routine will ensure the values are within the range specified by
|
|
|
|
* table->extra1 (min) and table->extra2 (max).
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2007-10-18 03:05:22 -07:00
|
|
|
int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer,
|
|
|
|
size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_doulongvec_minmax(table, write, buffer,
|
2005-04-16 15:20:36 -07:00
|
|
|
lenp, ppos, HZ, 1000l);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static int do_proc_dointvec_jiffies_conv(bool *negp, unsigned long *lvalp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int *valp,
|
|
|
|
int write, void *data)
|
|
|
|
{
|
|
|
|
if (write) {
|
2006-03-24 03:15:50 -08:00
|
|
|
if (*lvalp > LONG_MAX / HZ)
|
|
|
|
return 1;
|
2005-04-16 15:20:36 -07:00
|
|
|
*valp = *negp ? -(*lvalp*HZ) : (*lvalp*HZ);
|
|
|
|
} else {
|
|
|
|
int val = *valp;
|
|
|
|
unsigned long lval;
|
|
|
|
if (val < 0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = true;
|
2015-09-09 15:39:06 -07:00
|
|
|
lval = -(unsigned long)val;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = false;
|
2005-04-16 15:20:36 -07:00
|
|
|
lval = (unsigned long)val;
|
|
|
|
}
|
|
|
|
*lvalp = lval / HZ;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static int do_proc_dointvec_userhz_jiffies_conv(bool *negp, unsigned long *lvalp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int *valp,
|
|
|
|
int write, void *data)
|
|
|
|
{
|
|
|
|
if (write) {
|
2006-03-24 03:15:50 -08:00
|
|
|
if (USER_HZ < HZ && *lvalp > (LONG_MAX / HZ) * USER_HZ)
|
|
|
|
return 1;
|
2005-04-16 15:20:36 -07:00
|
|
|
*valp = clock_t_to_jiffies(*negp ? -*lvalp : *lvalp);
|
|
|
|
} else {
|
|
|
|
int val = *valp;
|
|
|
|
unsigned long lval;
|
|
|
|
if (val < 0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = true;
|
2015-09-09 15:39:06 -07:00
|
|
|
lval = -(unsigned long)val;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = false;
|
2005-04-16 15:20:36 -07:00
|
|
|
lval = (unsigned long)val;
|
|
|
|
}
|
|
|
|
*lvalp = jiffies_to_clock_t(lval);
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-05 00:26:45 +00:00
|
|
|
static int do_proc_dointvec_ms_jiffies_conv(bool *negp, unsigned long *lvalp,
|
2005-04-16 15:20:36 -07:00
|
|
|
int *valp,
|
|
|
|
int write, void *data)
|
|
|
|
{
|
|
|
|
if (write) {
|
2013-07-24 10:39:07 +02:00
|
|
|
unsigned long jif = msecs_to_jiffies(*negp ? -*lvalp : *lvalp);
|
|
|
|
|
|
|
|
if (jif > INT_MAX)
|
|
|
|
return 1;
|
|
|
|
*valp = (int)jif;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
|
|
|
int val = *valp;
|
|
|
|
unsigned long lval;
|
|
|
|
if (val < 0) {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = true;
|
2015-09-09 15:39:06 -07:00
|
|
|
lval = -(unsigned long)val;
|
2005-04-16 15:20:36 -07:00
|
|
|
} else {
|
2010-05-05 00:26:45 +00:00
|
|
|
*negp = false;
|
2005-04-16 15:20:36 -07:00
|
|
|
lval = (unsigned long)val;
|
|
|
|
}
|
|
|
|
*lvalp = jiffies_to_msecs(lval);
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* proc_dointvec_jiffies - read a vector of integers as seconds
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
* The values read are assumed to be in seconds, and are converted into
|
|
|
|
* jiffies.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_jiffies(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_dointvec(table,write,buffer,lenp,ppos,
|
2005-04-16 15:20:36 -07:00
|
|
|
do_proc_dointvec_jiffies_conv,NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* proc_dointvec_userhz_jiffies - read a vector of integers as 1/USER_HZ seconds
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
2005-11-07 01:01:06 -08:00
|
|
|
* @ppos: pointer to the file position
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
* The values read are assumed to be in 1/USER_HZ seconds, and
|
|
|
|
* are converted into jiffies.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_dointvec(table,write,buffer,lenp,ppos,
|
2005-04-16 15:20:36 -07:00
|
|
|
do_proc_dointvec_userhz_jiffies_conv,NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* proc_dointvec_ms_jiffies - read a vector of integers as 1 milliseconds
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
2005-05-01 08:59:26 -07:00
|
|
|
* @ppos: file position
|
|
|
|
* @ppos: the current position in the file
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
* The values read are assumed to be in 1/1000 seconds, and
|
|
|
|
* are converted into jiffies.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_ms_jiffies(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-23 15:57:19 -07:00
|
|
|
return do_proc_dointvec(table, write, buffer, lenp, ppos,
|
2005-04-16 15:20:36 -07:00
|
|
|
do_proc_dointvec_ms_jiffies_conv, NULL);
|
|
|
|
}
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
static int proc_do_cad_pid(struct ctl_table *table, int write,
|
2006-10-02 02:19:00 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
struct pid *new_pid;
|
|
|
|
pid_t tmp;
|
|
|
|
int r;
|
|
|
|
|
2008-02-08 04:19:20 -08:00
|
|
|
tmp = pid_vnr(cad_pid);
|
2006-10-02 02:19:00 -07:00
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
r = __do_proc_dointvec(&tmp, table, write, buffer,
|
2006-10-02 02:19:00 -07:00
|
|
|
lenp, ppos, NULL, NULL);
|
|
|
|
if (r || !write)
|
|
|
|
return r;
|
|
|
|
|
|
|
|
new_pid = find_get_pid(tmp);
|
|
|
|
if (!new_pid)
|
|
|
|
return -ESRCH;
|
|
|
|
|
|
|
|
put_pid(xchg(&cad_pid, new_pid));
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-05 00:26:55 +00:00
|
|
|
/**
|
|
|
|
* proc_do_large_bitmap - read/write from/to a large bitmap
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* The bitmap is stored at table->data and the bitmap length (in bits)
|
|
|
|
* in table->maxlen.
|
|
|
|
*
|
|
|
|
* We use a range comma separated format (e.g. 1,3-4,10-10) so that
|
|
|
|
* large bitmaps may be represented in a compact manner. Writing into
|
|
|
|
* the file will clear the bitmap then update it with the given input.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
|
|
|
int proc_do_large_bitmap(struct ctl_table *table, int write,
|
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
bool first = 1;
|
|
|
|
size_t left = *lenp;
|
|
|
|
unsigned long bitmap_len = table->maxlen;
|
2014-05-12 16:04:53 -07:00
|
|
|
unsigned long *bitmap = *(unsigned long **) table->data;
|
2010-05-05 00:26:55 +00:00
|
|
|
unsigned long *tmp_bitmap = NULL;
|
|
|
|
char tr_a[] = { '-', ',', '\n' }, tr_b[] = { ',', '\n', 0 }, c;
|
|
|
|
|
2014-05-12 16:04:53 -07:00
|
|
|
if (!bitmap || !bitmap_len || !left || (*ppos && !write)) {
|
2010-05-05 00:26:55 +00:00
|
|
|
*lenp = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (write) {
|
|
|
|
unsigned long page = 0;
|
|
|
|
char *kbuf;
|
|
|
|
|
|
|
|
if (left > PAGE_SIZE - 1)
|
|
|
|
left = PAGE_SIZE - 1;
|
|
|
|
|
|
|
|
page = __get_free_page(GFP_TEMPORARY);
|
|
|
|
kbuf = (char *) page;
|
|
|
|
if (!kbuf)
|
|
|
|
return -ENOMEM;
|
|
|
|
if (copy_from_user(kbuf, buffer, left)) {
|
|
|
|
free_page(page);
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
kbuf[left] = 0;
|
|
|
|
|
|
|
|
tmp_bitmap = kzalloc(BITS_TO_LONGS(bitmap_len) * sizeof(unsigned long),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!tmp_bitmap) {
|
|
|
|
free_page(page);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
proc_skip_char(&kbuf, &left, '\n');
|
|
|
|
while (!err && left) {
|
|
|
|
unsigned long val_a, val_b;
|
|
|
|
bool neg;
|
|
|
|
|
|
|
|
err = proc_get_long(&kbuf, &left, &val_a, &neg, tr_a,
|
|
|
|
sizeof(tr_a), &c);
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
if (val_a >= bitmap_len || neg) {
|
|
|
|
err = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
val_b = val_a;
|
|
|
|
if (left) {
|
|
|
|
kbuf++;
|
|
|
|
left--;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (c == '-') {
|
|
|
|
err = proc_get_long(&kbuf, &left, &val_b,
|
|
|
|
&neg, tr_b, sizeof(tr_b),
|
|
|
|
&c);
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
if (val_b >= bitmap_len || neg ||
|
|
|
|
val_a > val_b) {
|
|
|
|
err = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (left) {
|
|
|
|
kbuf++;
|
|
|
|
left--;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-03-28 14:42:50 -07:00
|
|
|
bitmap_set(tmp_bitmap, val_a, val_b - val_a + 1);
|
2010-05-05 00:26:55 +00:00
|
|
|
first = 0;
|
|
|
|
proc_skip_char(&kbuf, &left, '\n');
|
|
|
|
}
|
|
|
|
free_page(page);
|
|
|
|
} else {
|
|
|
|
unsigned long bit_a, bit_b = 0;
|
|
|
|
|
|
|
|
while (left) {
|
|
|
|
bit_a = find_next_bit(bitmap, bitmap_len, bit_b);
|
|
|
|
if (bit_a >= bitmap_len)
|
|
|
|
break;
|
|
|
|
bit_b = find_next_zero_bit(bitmap, bitmap_len,
|
|
|
|
bit_a + 1) - 1;
|
|
|
|
|
|
|
|
if (!first) {
|
|
|
|
err = proc_put_char(&buffer, &left, ',');
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
err = proc_put_long(&buffer, &left, bit_a, false);
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
if (bit_a != bit_b) {
|
|
|
|
err = proc_put_char(&buffer, &left, '-');
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
err = proc_put_long(&buffer, &left, bit_b, false);
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
first = 0; bit_b++;
|
|
|
|
}
|
|
|
|
if (!err)
|
|
|
|
err = proc_put_char(&buffer, &left, '\n');
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!err) {
|
|
|
|
if (write) {
|
|
|
|
if (*ppos)
|
|
|
|
bitmap_or(bitmap, bitmap, tmp_bitmap, bitmap_len);
|
|
|
|
else
|
2012-03-28 14:42:50 -07:00
|
|
|
bitmap_copy(bitmap, tmp_bitmap, bitmap_len);
|
2010-05-05 00:26:55 +00:00
|
|
|
}
|
|
|
|
kfree(tmp_bitmap);
|
|
|
|
*lenp -= left;
|
|
|
|
*ppos += *lenp;
|
|
|
|
return 0;
|
|
|
|
} else {
|
|
|
|
kfree(tmp_bitmap);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-01-12 17:00:45 -08:00
|
|
|
#else /* CONFIG_PROC_SYSCTL */
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dostring(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_minmax(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_jiffies(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_dointvec_ms_jiffies(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-23 15:57:19 -07:00
|
|
|
int proc_doulongvec_minmax(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2007-10-18 03:05:22 -07:00
|
|
|
int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write,
|
2005-04-16 15:20:36 -07:00
|
|
|
void __user *buffer,
|
|
|
|
size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2011-01-12 17:00:45 -08:00
|
|
|
#endif /* CONFIG_PROC_SYSCTL */
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* No sense putting this after each symbol definition, twice,
|
|
|
|
* exception granted :-)
|
|
|
|
*/
|
|
|
|
EXPORT_SYMBOL(proc_dointvec);
|
|
|
|
EXPORT_SYMBOL(proc_dointvec_jiffies);
|
|
|
|
EXPORT_SYMBOL(proc_dointvec_minmax);
|
|
|
|
EXPORT_SYMBOL(proc_dointvec_userhz_jiffies);
|
|
|
|
EXPORT_SYMBOL(proc_dointvec_ms_jiffies);
|
|
|
|
EXPORT_SYMBOL(proc_dostring);
|
|
|
|
EXPORT_SYMBOL(proc_doulongvec_minmax);
|
|
|
|
EXPORT_SYMBOL(proc_doulongvec_ms_jiffies_minmax);
|