Commit graph

538 commits

Author SHA1 Message Date
Greg Kroah-Hartman
cc3d2b7361 This is the 4.4.77 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAllp5y8ACgkQONu9yGCS
 aT6g8hAApzYi9TwiaF6wyYXsrp7YvOm4NyMaVBl4t7v/nFql6VsUL+qWaJKB5EL9
 o72ybYPUzbxGTVWCm/wiBO31VWea0ak0pBbyywBiowGgwAcgG/jpqZobale4Y2TE
 15jEpmpA5+3BmXpMkrv/dz4LHZ4jm65/ADhMbkPGRZqUJ3mHmyVoi50l67dpTE5+
 xWQIErycwlVMppJGnXPeFFgeD7Etch7OJ9CishQRNMb3F8H58WiQrMWWe1NfL0DO
 H2g18IBHMsxEYJqnRqxviTOMe8S96Km+lKGX0LOTRYt+2OQpfIF7buU6N+6C96rO
 7V2n2G02m2mOFVUFlDYF1RQ9IBrxHJf9iGkaZBwsaxX7XAK63ZjRxgjnEL7gMPU/
 TMCOWZ53BdZezz2eAmdhySsV+4Xt6MmJJE8rR47AgsM2Le3tgK421zmraunmA0fR
 eoJS99YHcftAHXCD3puGLafEwGVe0G4eQbY4L7mj1Y9VjaAbmmsWq9rlNOQMZRgH
 JTNyYik1C7yGPJX1iKi9hLAKldzBwPuM3GfZMZQIOjA4t2VtSon7in5iKrihRg3N
 BSKXr6+orNw32tsqcC4kpLPbFUFb6zx3EKELwSJwD9ICN7swJEk7gFw7w/F/SOxI
 C1W4Ulm6EcYTWHDePERQ4zHlllHAmyJup61d9HnwA6HhPOLaff4=
 =oeNk
 -----END PGP SIGNATURE-----

Merge 4.4.77 into android-4.4

Changes in 4.4.77
	fs: add a VALID_OPEN_FLAGS
	fs: completely ignore unknown open flags
	driver core: platform: fix race condition with driver_override
	bgmac: reset & enable Ethernet core before using it
	mm: fix classzone_idx underflow in shrink_zones()
	tracing/kprobes: Allow to create probe with a module name starting with a digit
	drm/virtio: don't leak bo on drm_gem_object_init failure
	usb: dwc3: replace %p with %pK
	USB: serial: cp210x: add ID for CEL EM3588 USB ZigBee stick
	Add USB quirk for HVR-950q to avoid intermittent device resets
	usb: usbip: set buffer pointers to NULL after free
	usb: Fix typo in the definition of Endpoint[out]Request
	mac80211_hwsim: Replace bogus hrtimer clockid
	sysctl: don't print negative flag for proc_douintvec
	sysctl: report EINVAL if value is larger than UINT_MAX for proc_douintvec
	pinctrl: sh-pfc: r8a7791: Fix SCIF2 pinmux data
	pinctrl: meson: meson8b: fix the NAND DQS pins
	pinctrl: sunxi: Fix SPDIF function name for A83T
	pinctrl: mxs: atomically switch mux and drive strength config
	pinctrl: sh-pfc: Update info pointer after SoC-specific init
	USB: serial: option: add two Longcheer device ids
	USB: serial: qcserial: new Sierra Wireless EM7305 device ID
	gfs2: Fix glock rhashtable rcu bug
	x86/tools: Fix gcc-7 warning in relocs.c
	x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
	ath10k: override CE5 config for QCA9377
	KEYS: Fix an error code in request_master_key()
	RDMA/uverbs: Check port number supplied by user verbs cmds
	mqueue: fix a use-after-free in sys_mq_notify()
	tools include: Add a __fallthrough statement
	tools string: Use __fallthrough in perf_atoll()
	tools strfilter: Use __fallthrough
	perf top: Use __fallthrough
	perf intel-pt: Use __fallthrough
	perf thread_map: Correctly size buffer used with dirent->dt_name
	perf scripting perl: Fix compile error with some perl5 versions
	perf tests: Avoid possible truncation with dirent->d_name + snprintf
	perf bench numa: Avoid possible truncation when using snprintf()
	perf tools: Use readdir() instead of deprecated readdir_r()
	perf thread_map: Use readdir() instead of deprecated readdir_r()
	perf script: Use readdir() instead of deprecated readdir_r()
	perf tools: Remove duplicate const qualifier
	perf annotate browser: Fix behaviour of Shift-Tab with nothing focussed
	perf pmu: Fix misleadingly indented assignment (whitespace)
	perf dwarf: Guard !x86_64 definitions under #ifdef else clause
	perf trace: Do not process PERF_RECORD_LOST twice
	perf tests: Remove wrong semicolon in while loop in CQM test
	perf tools: Use readdir() instead of deprecated readdir_r() again
	md: fix incorrect use of lexx_to_cpu in does_sb_need_changing
	md: fix super_offset endianness in super_1_rdev_size_change
	tcp: fix tcp_mark_head_lost to check skb len before fragmenting
	staging: vt6556: vnt_start Fix missing call to vnt_key_init_table.
	staging: comedi: fix clean-up of comedi_class in comedi_init()
	ext4: check return value of kstrtoull correctly in reserved_clusters_store
	x86/mm/pat: Don't report PAT on CPUs that don't support it
	saa7134: fix warm Medion 7134 EEPROM read
	Linux 4.4.77

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-07-15 13:29:08 +02:00
Liping Zhang
a2148222e3 sysctl: report EINVAL if value is larger than UINT_MAX for proc_douintvec
commit 425fffd886bae3d127a08fa6a17f2e31e24ed7ff upstream.

Currently, inputting the following command will succeed but actually the
value will be truncated:

  # echo 0x12ffffffff > /proc/sys/net/ipv4/tcp_notsent_lowat

This is not friendly to the user, so instead, we should report error
when the value is larger than UINT_MAX.

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15 11:57:46 +02:00
Liping Zhang
e8505e6432 sysctl: don't print negative flag for proc_douintvec
commit 5380e5644afbba9e3d229c36771134976f05c91e upstream.

I saw some very confusing sysctl output on my system:
  # cat /proc/sys/net/core/xfrm_aevent_rseqth
  -2
  # cat /proc/sys/net/core/xfrm_aevent_etime
  -10
  # cat /proc/sys/net/ipv4/tcp_notsent_lowat
  -4294967295

Because we forget to set the *negp flag in proc_douintvec, so it will
become a garbage value.

Since the value related to proc_douintvec is always an unsigned integer,
so we can set *negp to false explictily to fix this issue.

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15 11:57:46 +02:00
Greg Kroah-Hartman
64a73ff728 This is the 4.4.76 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAllc3f0ACgkQONu9yGCS
 aT4fmA/+OHeYbhpaMRKqrUpsxB3NpROr2Z47ow6vaVjYZzd0irrODLlfIfDQ6EEo
 N3v28povu16VeYXk+4h8bsAP2K2j6/BlRaSi2hB6dmnY8GDMaXEfRojPYAlzVz50
 qnK/6152siDDarUx1h5Zc8GcmX/tEl6h3bOOxDcwLR+RvyIcWxenuR+uqRM/AV6o
 BPEiOuMu7P6LjID7KYgBTFNajVBMLrDXt4SCWdzOZmlNt0QXgKB9yw68vTcc+edC
 ZcXqa0M6nEWSDvwobbwBZhFL8H2dJjzweyjeFBgxnxgmOrRh6kvZG2wsz2c8O3/P
 g8TuMxU7siu+I3lFwKy+dgZ/1REz+6Q3oFBqXsuddrcPYu23rV6mz/GxqWy4cerb
 M4eTWz6L9vA2GoYpvBaWi0tKC9tkNM49g48Y24a6CW1O4dJWlz3RrpTiZmequbNF
 mo8EKomSXn4kYAm1xT03DGljQkK/i2JtyI5sk2hLEqqxKvZ/3q9xxLLKOVx8dPvs
 PIbfpapfYMXXMWgR6e+UKueNLgevfWE12X/OU4SgvSY4n/07/mH40XEd3zd82IsZ
 1Mw0qj3JnqCAFDBBMsDYa+OvABaGD1dHARuiv+aeqW8tqoBglFHxWqF+SQVNXLIE
 qTLiKz78vjQpH0zGpkA3HEOh/h4L7a0y3qRMECsk5SUxXsgu1gg=
 =bwNU
 -----END PGP SIGNATURE-----

Merge 4.4.76 into android-4.4

Changes in 4.4.76
	ipv6: release dst on error in ip6_dst_lookup_tail
	net: don't call strlen on non-terminated string in dev_set_alias()
	decnet: dn_rtmsg: Improve input length sanitization in dnrmg_receive_user_skb
	net: Zero ifla_vf_info in rtnl_fill_vfinfo()
	af_unix: Add sockaddr length checks before accessing sa_family in bind and connect handlers
	Fix an intermittent pr_emerg warning about lo becoming free.
	net: caif: Fix a sleep-in-atomic bug in cfpkt_create_pfx
	igmp: acquire pmc lock for ip_mc_clear_src()
	igmp: add a missing spin_lock_init()
	ipv6: fix calling in6_ifa_hold incorrectly for dad work
	net/mlx5: Wait for FW readiness before initializing command interface
	decnet: always not take dst->__refcnt when inserting dst into hash table
	net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev
	sfc: provide dummy definitions of vswitch functions
	ipv6: Do not leak throw route references
	rtnetlink: add IFLA_GROUP to ifla_policy
	netfilter: xt_TCPMSS: add more sanity tests on tcph->doff
	netfilter: synproxy: fix conntrackd interaction
	NFSv4: fix a reference leak caused WARNING messages
	drm/ast: Handle configuration without P2A bridge
	mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()
	MIPS: Avoid accidental raw backtrace
	MIPS: pm-cps: Drop manual cache-line alignment of ready_count
	MIPS: Fix IRQ tracing & lockdep when rescheduling
	ALSA: hda - Fix endless loop of codec configure
	ALSA: hda - set input_path bitmap to zero after moving it to new place
	drm/vmwgfx: Free hash table allocated by cmdbuf managed res mgr
	usb: gadget: f_fs: Fix possibe deadlock
	sysctl: enable strict writes
	block: fix module reference leak on put_disk() call for cgroups throttle
	mm: numa: avoid waiting on freed migrated pages
	KVM: x86: fix fixing of hypercalls
	scsi: sd: Fix wrong DPOFUA disable in sd_read_cache_type
	scsi: lpfc: Set elsiocb contexts to NULL after freeing it
	qla2xxx: Fix erroneous invalid handle message
	ARM: dts: BCM5301X: Correct GIC_PPI interrupt flags
	net: mvneta: Fix for_each_present_cpu usage
	MIPS: ath79: fix regression in PCI window initialization
	net: korina: Fix NAPI versus resources freeing
	MIPS: ralink: MT7688 pinmux fixes
	MIPS: ralink: fix USB frequency scaling
	MIPS: ralink: Fix invalid assignment of SoC type
	MIPS: ralink: fix MT7628 pinmux typos
	MIPS: ralink: fix MT7628 wled_an pinmux gpio
	mtd: bcm47xxpart: limit scanned flash area on BCM47XX (MIPS) only
	bgmac: fix a missing check for build_skb
	mtd: bcm47xxpart: don't fail because of bit-flips
	bgmac: Fix reversed test of build_skb() return value.
	net: bgmac: Fix SOF bit checking
	net: bgmac: Start transmit queue in bgmac_open
	net: bgmac: Remove superflous netif_carrier_on()
	powerpc/eeh: Enable IO path on permanent error
	gianfar: Do not reuse pages from emergency reserve
	Btrfs: fix truncate down when no_holes feature is enabled
	virtio_console: fix a crash in config_work_handler
	swiotlb-xen: update dev_addr after swapping pages
	xen-netfront: Fix Rx stall during network stress and OOM
	scsi: virtio_scsi: Reject commands when virtqueue is broken
	platform/x86: ideapad-laptop: handle ACPI event 1
	amd-xgbe: Check xgbe_init() return code
	net: dsa: Check return value of phy_connect_direct()
	drm/amdgpu: check ring being ready before using
	vfio/spapr: fail tce_iommu_attach_group() when iommu_data is null
	virtio_net: fix PAGE_SIZE > 64k
	vxlan: do not age static remote mac entries
	ibmveth: Add a proper check for the availability of the checksum features
	kernel/panic.c: add missing \n
	HID: i2c-hid: Add sleep between POWER ON and RESET
	scsi: lpfc: avoid double free of resource identifiers
	spi: davinci: use dma_mapping_error()
	mac80211: initialize SMPS field in HT capabilities
	x86/mpx: Use compatible types in comparison to fix sparse error
	coredump: Ensure proper size of sparse core files
	swiotlb: ensure that page-sized mappings are page-aligned
	s390/ctl_reg: make __ctl_load a full memory barrier
	be2net: fix status check in be_cmd_pmac_add()
	perf probe: Fix to show correct locations for events on modules
	net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
	sctp: check af before verify address in sctp_addr_id2transport
	ravb: Fix use-after-free on `ifconfig eth0 down`
	jump label: fix passing kbuild_cflags when checking for asm goto support
	xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY
	xfrm: NULL dereference on allocation failure
	xfrm: Oops on error in pfkey_msg2xfrm_state()
	watchdog: bcm281xx: Fix use of uninitialized spinlock.
	sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
	ARM64/ACPI: Fix BAD_MADT_GICC_ENTRY() macro implementation
	ARM: 8685/1: ensure memblock-limit is pmd-aligned
	x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
	x86/mm: Fix flush_tlb_page() on Xen
	ocfs2: o2hb: revert hb threshold to keep compatible
	iommu/vt-d: Don't over-free page table directories
	iommu: Handle default domain attach failure
	iommu/amd: Fix incorrect error handling in amd_iommu_bind_pasid()
	cpufreq: s3c2416: double free on driver init error path
	KVM: x86: fix emulation of RSM and IRET instructions
	KVM: x86/vPMU: fix undefined shift in intel_pmu_refresh()
	KVM: x86: zero base3 of unusable segments
	KVM: nVMX: Fix exception injection
	Linux 4.4.76

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-07-05 16:16:58 +02:00
Kees Cook
2449a71eb9 sysctl: enable strict writes
commit 41662f5cc55335807d39404371cfcbb1909304c4 upstream.

SYSCTL_WRITES_WARN was added in commit f4aacea2f5 ("sysctl: allow for
strict write position handling"), and released in v3.16 in August of
2014.  Since then I can find only 1 instance of non-zero offset
writing[1], and it was fixed immediately in CRIU[2].  As such, it
appears safe to flip this to the strict state now.

[1] https://www.google.com/search?q="when%20file%20position%20was%20not%200"
[2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:37:16 +02:00
Dietmar Eggemann
242695407a sched: Remove sysctl_sched_is_big_little
With the new wakeup approach this sysctl is not necessary any more.

Change-Id: I52114b3c918791f6a4f9f30f50002919ccbc1a9c
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 885c0d503bcdf0ef4e9b46822496f16b20aa3bbd)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-06-02 08:01:53 -07:00
Greg Kroah-Hartman
e4528dd775 This is the 4.4.65 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlkFXvsACgkQONu9yGCS
 aT6kPg//QqrRCxSUBYahQ1Jp16AVLiEWjJ3umzBhGGSPn7FfsWF8951R1WBHGlFI
 lEUa3Pfi0U1sh0q4v6pTmQ/AYoa67DcKorxQegH9JoaRp0IvWpSaGMSfbmKP5pDl
 PQyRL6DmOFkf/6X0dvby5ybbt2Kp59zTm8RFeFLRo3LTUK30w/tBTVvouk+UW3KA
 KtjeL70OSOHgWoHXhNWDX1JTTBGFFTI2x0jlFeUtq10t2kRxAMDZpB/IY0VJ3ZTe
 iso6+hC8JyzsXUYP82ZfZ7BAv/hSWBV3ErHyrUmhqWfE/Px7PFEeo3OyG3Bqosu6
 aZW78jwFoqZcAhkVTQepWMHonUT+XLHUgCzc2MqFR4HW6JoQhKDdIqlt1Lqp6y1O
 XsYOrPU1WqHhyoO9E3YwmAIjlYBHxYSUiCnqI9WtvvExJUhXXk/wwzgXUFrZPD01
 berofViH2LJAxde0sqpidpNRg98m+MAK47M03I/tZUUykjGDi8NPTvM4FBbNCEty
 3qaVVCUm7o8YzZg54QF61O+ciceoQdnsQJVy94EV3n2pgdN/7pG0v1KikBRKfsPK
 1Wp+l0tdLkms56ElXyt/lHtF5Pre5i4sE6SdnZa3RHTUV168PFVYqJUCqWRwCD50
 QMs+yLvRHwCFst+ix29Xn+c7KYKcMyqPvCrI8oczfokV/tvMVd8=
 =1GiA
 -----END PGP SIGNATURE-----

Merge 4.4.65 into android-4.4

Changes in 4.4.65:
	tipc: make sure IPv6 header fits in skb headroom
	tipc: make dist queue pernet
	tipc: re-enable compensation for socket receive buffer double counting
	tipc: correct error in node fsm
	tty: nozomi: avoid a harmless gcc warning
	hostap: avoid uninitialized variable use in hfa384x_get_rid
	gfs2: avoid uninitialized variable warning
	tipc: fix random link resets while adding a second bearer
	tipc: fix socket timer deadlock
	mnt: Add a per mount namespace limit on the number of mounts
	xc2028: avoid use after free
	netfilter: nfnetlink: correctly validate length of batch messages
	tipc: check minimum bearer MTU
	vfio/pci: Fix integer overflows, bitmask check
	staging/android/ion : fix a race condition in the ion driver
	ping: implement proper locking
	perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
	Linux 4.4.65

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-04-30 07:30:52 +02:00
Eric W. Biederman
c50fd34e10 mnt: Add a per mount namespace limit on the number of mounts
commit d29216842a85c7970c536108e093963f02714498 upstream.

CAI Qian <caiqian@redhat.com> pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent <raven@themaw.net> described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000.  This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts.  Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
[bwh: Backported to 4.4: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-30 05:49:28 +02:00
Dmitry Shmidt
c8da41f0dc This is the 4.4.46 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAliRjsUACgkQONu9yGCS
 aT53lxAAzSDKqsc4eiGDuRW6A+hWUveObsOsAldQz8PLIhIEh/NiQPNsYisuWA+3
 K5nOwSU4E7LiFYemN/9wGGctq5aPtC7nWyDe43Xdeek0B3keu/KSqOhbOBywuAGr
 pdfEPcyIzgbzgIrygn5g8RUNCcgNkw5IYgmaiz2RCqNnjeQItAQQdm37svz/XDGt
 D6i0MaOHEkGfk3Z3ty4PJWSa/+Gd61M4OYTWiWWCY9gf1CCjQA59RlGhynm6bj+C
 Zq58tsnEzL8bOz9PDHoXKyrLc43mLhn7Q4Nhxs+rT00Yl7yxE/aFm9vd0EMIP3/V
 HUVobz4RmVyopvJPKqcPlD063BAYEm9BpxxXhc3tJNP3AaCjJodvKGkYKFKWPfG5
 6h0agCRCfeCS81y8JKeccw0tVBsD8ChuzO95tSHNulCWk/pN/7OD52b0jiuCMTFs
 lato8AKn2Ygtip/MWkK43yiuPd4qYjtFBSKx6x5we+Vj1HZz2DVjM8prkIqkfzM9
 FpgoyLOPAKXKbECzyURngLsXCckneS96ErxL/a/L/0/2u8Zv4RR9y9jcE38eAhZv
 IGhq8txZXnHD1Qyd6u0nbTeaLDswN06P1stsIUX8fR3qUa5QnlrhG9LzY9wihxlk
 oloLWuhIsUussgO07xgD/9NM8V7goN/DkQ+MCTShXxsNqb87WY0=
 =pPNg
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.46' into android-4.4.y

This is the 4.4.46 stable release
2017-02-01 12:55:09 -08:00
Eric Dumazet
d1b232c2ce sysctl: fix proc_doulongvec_ms_jiffies_minmax()
commit ff9f8a7cf935468a94d9927c68b00daae701667e upstream.

We perform the conversion between kernel jiffies and ms only when
exporting kernel value to user space.

We need to do the opposite operation when value is written by user.

Only matters when HZ != 1000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-01 08:30:52 +01:00
Dmitry Shmidt
14de94f03d This is the 4.4.24 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJX96H2AAoJEDjbvchgkmk+MqcQAJuhiwLmCsXRKXGujGByPi5P
 vk+mnkt8o2UpamvT4KVRnWQrJuN8EHDHg29esGXKHV9Ahdmw/UfnXvbz3P0auet9
 GvMi4rKZpL3vD/dcMxshQchRKF7SwUbNNMkyJv1WYCjLex7W1LU/NOQV5VLx21i6
 9E/R/ARrazGhrGqanfm4NhIZYOR9QAWCrsc8pbJiE2OSty1HbQCLapA8gWg2PSnz
 sTlH0BF9tJ2kKSnyYjXM1Xb1zHbZuj83qEELhSnXsGK71Sq/8jIH9a5SiUwSDtdt
 szGp+vLODPqIMYa01qyLtFA2tvkusvKDUps8vtZ5mp9t38u2R8TDA+CAbz6w19mb
 C0d9abvZ9l1pIe/96OdgZkdSmGG2DC5hxnk3eaxhsyHn6RkIXfB9igH5+Fk7r/nm
 Yq15xxOIu6DCcuoQesUcAHoIR2961kbo/ZnnUEy2hRsqvR3/21X7qW3oj68uTdnB
 QQtMc1jq32toaZFk21ojLDtxKAlVqVHuslQ0hsMMgKtADZAveWpqZj408aNlPOi8
 CFHoEAxYXCQItOhRCoQeC1mljahvhEBI9N+5Zbpf30q5imLKen9hQphytbKABEWN
 CVJ6h6YndrdnlN7cS/AQ62+SNDk4kLmeMomgXfB701WTJ1cvI6eW4q6WUSS+54DZ
 q+brnDATt0K3nUmsrpGM
 =JaSn
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.24' into android-4.4.y

This is the 4.4.24 stable release
2016-10-14 13:34:43 -07:00
Subash Abhinov Kasiviswanathan
70cd763eb1 sysctl: handle error writing UINT_MAX to u32 fields
commit e7d316a02f683864a12389f8808570e37fb90aa3 upstream.

We have scripts which write to certain fields on 3.18 kernels but this
seems to be failing on 4.4 kernels.  An entry which we write to here is
xfrm_aevent_rseqth which is u32.

  echo 4294967295  > /proc/sys/net/core/xfrm_aevent_rseqth

Commit 230633d109 ("kernel/sysctl.c: detect overflows when converting
to int") prevented writing to sysctl entries when integer overflow
occurs.  However, this does not apply to unsigned integers.

Heinrich suggested that we introduce a new option to handle 64 bit
limits and set min as 0 and max as UINT_MAX.  This might not work as it
leads to issues similar to __do_proc_doulongvec_minmax.  Alternatively,
we would need to change the datatype of the entry to 64 bit.

  static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
  {
      i = (unsigned long *) data;   //This cast is causing to read beyond the size of data (u32)
      vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.

Introduce a new proc handler proc_douintvec.  Individual proc entries
will need to be updated to use the new handler.

[akpm@linux-foundation.org: coding-style fixes]
Fixes: 230633d109 ("kernel/sysctl.c:detect overflows when converting to int")
Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-07 15:23:46 +02:00
Dmitry Shmidt
441e10ac4c Merge remote-tracking branch 'common/android-4.4' into android-4.4.y 2016-09-13 14:47:50 -07:00
Dmitry Shmidt
341e02d8bb Merge remote-tracking branch 'linaro-ext/EAS/v4.4-easv5.2+aosp-changes' into android-4.4
Change-Id: Ic24b43ee867bc4f70b31bedaad734717b64b86a1
2016-09-08 17:07:42 -07:00
Srinath Sridharan
519c62750e sched/walt: Accounting for number of irqs pending on each core
Schedules on a core whose irq count is less than a threshold.
Improves I/O performance of EAS.

Change-Id: I08ff7dd0d22502a0106fc636b1af2e6fe9e758b5
2016-08-11 14:26:43 -07:00
Srivatsa Vaddagiri
efb86bd08a sched: Introduce Window Assisted Load Tracking (WALT)
use a window based view of time in order to track task
demand and CPU utilization in the scheduler.

Window Assisted Load Tracking (WALT) implementation credits:
 Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,
 Pavan Kumar Kondeti, Olav Haugan

2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla
            and Todd Kjos

Change-Id: I21408236836625d4e7d7de1843d20ed5ff36c708

Includes fixes for issues:

eas/walt: Use walt_ktime_clock() instead of ktime_get_ns() to avoid a
race resulting in watchdog resets
BUG: 29353986
Change-Id: Ic1820e22a136f7c7ebd6f42e15f14d470f6bbbdb

Handle walt accounting anomoly during resume

During resume, there is a corner case where on wakeup, a task's
prev_runnable_sum can go negative. This is a workaround that
fixes the condition and warns (instead of crashing).

BUG: 29464099
Change-Id: I173e7874324b31a3584435530281708145773508

Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-11 14:26:43 -07:00
Todd Kjos
c50cc2299c sched/fair: add tunable to set initial task load
The choice of initial task load upon fork has a large influence
on CPU and OPP selection when scheduler-driven DVFS is in use.
Make this tuneable by adding a new sysctl "sched_initial_task_util".

If the sched governor is not used, the default remains at SCHED_LOAD_SCALE
Otherwise, the value from the sysctl is used. This defaults to 0.

Signed-off-by: "Todd Kjos <tkjos@google.com>"
2016-08-10 15:15:55 -07:00
Juri Lelli
4a5e890ec6 sched/fair: add tunable to force selection at cpu granularity
EAS assumes that clusters with smaller capacity cores are more
energy-efficient. This may not be true on non-big-little devices,
so EAS can make incorrect cluster selections when finding a CPU
to wake. The "sched_is_big_little" hint can be used to cause a
cpu-based selection instead of cluster-based selection.

This change incorporates the addition of the sync hint enable patch

EAS did not honour synchronous wakeup hints, a new sysctl is
created to ask EAS to use this information when selecting a CPU.
The control is called "sched_sync_hint_enable".

Also contains:

EAS: sched/fair: for SMP bias toward idle core with capacity

For SMP devices, on wakeup bias towards idle cores that have capacity
vs busy devices that need a higher OPP

eas: favor idle cpus for boosted tasks

BUG: 29533997
BUG: 29512132
Change-Id: I0cc9a1b1b88fb52916f18bf2d25715bdc3634f9c
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>

eas/sched/fair: Favoring busy cpus with low OPPs

BUG: 29533997
BUG: 29512132
Change-Id: I9305b3239698d64278db715a2e277ea0bb4ece79

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2016-08-10 15:15:46 -07:00
Srinath Sridharan
2e9abbc942 sched: EAS: take cstate into account when selecting idle core
Introduce a new sysctl for this option, 'sched_cstate_aware'.
When this is enabled, select_idle_sibling in CFS is modified to
choose the idle CPU in the sibling group which has the lowest
idle state index - idle state indexes are assumed to increase
as sleep depth and hence wakeup latency increase. In this way,
we attempt to minimise wakeup latency when an idle CPU is
required.

Signed-off-by: Srinath Sridharan <srinathsr@google.com>

Includes:
sched: EAS: fix select_idle_sibling

when sysctl_sched_cstate_aware is enabled, best_idle cpu will not be chosen
in the original flow because it will goto done directly

Bug: 30107557
Change-Id: Ie09c2e3960cafbb976f8d472747faefab3b4d6ac
Signed-off-by: martin_liu <martin_liu@htc.com>
2016-08-10 15:15:39 -07:00
Dmitry Shmidt
b558f17a13 This is the 4.4.16 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXmOXmAAoJEDjbvchgkmk+QYIP/1S8oBZsvjfDzvH8t63HyLeH
 i43MFlYoFAqUIZc002XpluSvZ8uHoG+r7R8Hq3wmv48wxe3M6OBnMdBVTht6mPw+
 t5OLTZr40lWaJm2EIi4aekueMIrCgmL+Et+IFYv7ZVBuYLteVcfny+zdq4EqGmgj
 /a19+L/sTTr4SHtJIhHxWhiVJ9fVMgQk/N3VgQmIiNF2+lVbiFI7QQiDPLbFl0KK
 CM4ETO22HxHCYilGpzhpSMsHCxv12VqNaXNLAsPAepGGW7PqvUmrEWAqgwsbOfRc
 GxTLNk0dUgJqMrfEpQ8ZOMlgzvCAYG2jZuNSuT+nuzrWSUP+WOGRi9TTTxp1CYuZ
 PHlhNTH7ZnqosxJUUZS2d9N5ygpqD48Rhlfl824YzOWCy94VeUnedkVLb20uJwPF
 Y5aQ5WjktBC9why5e4OgGQERvx/U9KTk8E1zRfZZPc2oft9My0YxuemjjKAKZiYN
 ne4WhXbgOJTQkAoZwh2xqny3bWyEaoSrWpQ3R7bBJ9SIRLEOdCKzKpduDbAnbMP7
 QWgQOQC/6qA1mKqjrqF4KPA1Quo9PcUK2Ajh523ewMGCowgY90vyejAgh4Q8g0GC
 fKlx+jJDoKVDbQ8v4hc9PPHMsNNIKT9a1ptwVS3lE+bq1D5Ffm57A4/uOTMYHVab
 gKqu8h1CA0MCVBsH3nNA
 =nY8S
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.16' into android-4.4.y

This is the 4.4.16 stable release

Change-Id: Ibaf7b7e03695e1acebc654a2ca1a4bfcc48fcea4
2016-08-01 15:57:55 -07:00
Willy Tarreau
fa6d0ba12a pipe: limit the per-user amount of pages allocated in pipes
commit 759c01142a5d0f364a462346168a56de28a80f52 upstream.

On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Moritz Muehlenhoff <moritz@wikimedia.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:35 -07:00
Patrick Bellasi
13001f47c9 sched/tune: add initial support for CGroups based boosting
To support task performance boosting, the usage of a single knob has the
advantage to be a simple solution, both from the implementation and the
usability standpoint.  However, on a real system it can be difficult to
identify a single value for the knob which fits the needs of multiple
different tasks. For example, some kernel threads and/or user-space
background services should be better managed the "standard" way while we
still want to be able to boost the performance of specific workloads.

In order to improve the flexibility of the task boosting mechanism this
patch is the first of a small series which extends the previous
implementation to introduce a "per task group" support.
This first patch introduces just the basic CGroups support, a new
"schedtune" CGroups controller is added which allows to configure
different boost value for different groups of tasks.
To keep the implementation simple but still effective for a boosting
strategy, the new controller:
  1. allows only a two layer hierarchy
  2. supports only a limited number of boost groups

A two layer hierarchy allows to place each task either:
  a) in the root control group
     thus being subject to a system-wide boosting value
  b) in a child of the root group
     thus being subject to the specific boost value defined by that
     "boost group"

The limited number of "boost groups" supported is mainly motivated by
the observation that in a real system it could be useful to have only
few classes of tasks which deserve different treatment.
For example, background vs foreground or interactive vs low-priority.
As an additional benefit, a limited number of boost groups allows also
to have a simpler implementation especially for the code required to
compute the boost value for CPUs which have runnable tasks belonging to
different boost groups.

cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
Patrick Bellasi
344f4ec429 sched/tune: add sysctl interface to define a boost value
The current (CFS) scheduler implementation does not allow "to boost"
tasks performance by running them at a higher OPP compared to the
minimum required to meet their workload demands.

To support tasks performance boosting the scheduler should provide a
"knob" which allows to tune how much the system is going to be optimised
for energy efficiency vs performance.

This patch is the first of a series which provides a simple interface to
define a tuning knob. One system-wide "boost" tunable is exposed via:
  /proc/sys/kernel/sched_cfs_boost
which can be configured in the range [0..100], to define a percentage
where:
  - 0%   boost requires to operate in "standard" mode by scheduling
         tasks at the minimum capacities required by the workload demand
  - 100% boost requires to push at maximum the task performances,
         "regardless" of the incurred energy consumption

A boost value in between these two boundaries is used to bias the
power/performance trade-off, the higher the boost value the more the
scheduler is biased toward performance boosting instead of energy
efficiency.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2016-05-10 16:53:24 +08:00
dcashman
d49d88766b FROMLIST: mm: mmap: Add new /proc tunable for mmap_base ASLR.
(cherry picked from commit https://lkml.org/lkml/2015/12/21/337)

ASLR  only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such
a way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.

Bug: 24047224
Signed-off-by: Daniel Cashman <dcashman@android.com>
Signed-off-by: Daniel Cashman <dcashman@google.com>
Change-Id: Ibf9ed3d4390e9686f5cc34f605d509a20d40e6c2
2016-02-16 13:54:14 -08:00
Rik van Riel
f8ade3666c add extra free kbytes tunable
Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.

This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period.  In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.

It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.

[ccross]
Revived for use on old kernels where no other solution exists.
The tunable will be removed on kernels that do better at avoiding
direct reclaim.

Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e
Signed-off-by: Rik van Riel<riel@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
2016-02-16 13:54:12 -08:00
Don Zickus
ac1f591249 kernel/watchdog.c: add sysctl knob hardlockup_panic
The only way to enable a hardlockup to panic the machine is to set
'nmi_watchdog=panic' on the kernel command line.

This makes it awkward for end users and folks who want to run automate
tests (like myself).

Mimic the softlockup_panic knob and create a /proc/sys/kernel/hardlockup_panic
knob.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Jiri Kosina
55537871ef kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup
In many cases of hardlockup reports, it's actually not possible to know
why it triggered, because the CPU that got stuck is usually waiting on a
resource (with IRQs disabled) in posession of some other CPU is holding.

IOW, we are often looking at the stacktrace of the victim and not the
actual offender.

Introduce sysctl / cmdline parameter that makes it possible to have
hardlockup detector perform all-CPU backtrace.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Alexei Starovoitov
1be7f75d16 bpf: enable non-root eBPF programs
In order to let unprivileged users load and execute eBPF programs
teach verifier to prevent pointer leaks.
Verifier will prevent
- any arithmetic on pointers
  (except R10+Imm which is used to compute stack addresses)
- comparison of pointers
  (except if (map_value_ptr == 0) ... )
- passing pointers to helper functions
- indirectly passing pointers in stack to helper functions
- returning pointer from bpf program
- storing pointers into ctx or maps

Spill/fill of pointers into stack is allowed, but mangling
of pointers stored in the stack or reading them byte by byte is not.

Within bpf programs the pointers do exist, since programs need to
be able to access maps, pass skb pointer to LD_ABS insns, etc
but programs cannot pass such pointer values to the outside
or obfuscate them.

Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
so that socket filters (tcpdump), af_packet (quic acceleration)
and future kcm can use it.
tracing and tc cls/act program types still require root permissions,
since tracing actually needs to be able to see all kernel pointers
and tc is for root only.

For example, the following unprivileged socket filter program is allowed:
int bpf_prog1(struct __sk_buff *skb)
{
  u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
  u64 *value = bpf_map_lookup_elem(&my_map, &index);

  if (value)
	*value += skb->len;
  return 0;
}

but the following program is not:
int bpf_prog1(struct __sk_buff *skb)
{
  u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
  u64 *value = bpf_map_lookup_elem(&my_map, &index);

  if (value)
	*value += (u64) skb;
  return 0;
}
since it would leak the kernel address into the map.

Unprivileged socket filter bpf programs have access to the
following helper functions:
- map lookup/update/delete (but they cannot store kernel pointers into them)
- get_random (it's already exposed to unprivileged user space)
- get_smp_processor_id
- tail_call into another socket filter program
- ktime_get_ns

The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
This toggle defaults to off (0), but can be set true (1).  Once true,
bpf programs and maps cannot be accessed from unprivileged process,
and the toggle cannot be set back to false.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:13:35 -07:00
Ilya Dryomov
9a5bc726d5 sysctl: fix int -> unsigned long assignments in INT_MIN case
The following

    if (val < 0)
        *lvalp = (unsigned long)-val;

is incorrect because the compiler is free to assume -val to be positive
and use a sign-extend instruction for extending the bit pattern.  This is
a problem if val == INT_MIN:

    # echo -2147483648 >/proc/sys/dev/scsi/logging_level
    # cat /proc/sys/dev/scsi/logging_level
    -18446744071562067968

Cast to unsigned long before negation - that way we first sign-extend and
then negate an unsigned, which is well defined.  With this:

    # cat /proc/sys/dev/scsi/logging_level
    -2147483648

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cc: Mikulas Patocka <mikulas@twibright.com>
Cc: Robert Xiao <nneonneo@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Dave Young
2965faa5e0 kexec: split kexec_load syscall from kexec core code
There are two kexec load syscalls, kexec_load another and kexec_file_load.
 kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Linus Torvalds
0cbee99269 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace updates from Eric Biederman:
 "Long ago and far away when user namespaces where young it was realized
  that allowing fresh mounts of proc and sysfs with only user namespace
  permissions could violate the basic rule that only root gets to decide
  if proc or sysfs should be mounted at all.

  Some hacks were put in place to reduce the worst of the damage could
  be done, and the common sense rule was adopted that fresh mounts of
  proc and sysfs should allow no more than bind mounts of proc and
  sysfs.  Unfortunately that rule has not been fully enforced.

  There are two kinds of gaps in that enforcement.  Only filesystems
  mounted on empty directories of proc and sysfs should be ignored but
  the test for empty directories was insufficient.  So in my tree
  directories on proc, sysctl and sysfs that will always be empty are
  created specially.  Every other technique is imperfect as an ordinary
  directory can have entries added even after a readdir returns and
  shows that the directory is empty.  Special creation of directories
  for mount points makes the code in the kernel a smidge clearer about
  it's purpose.  I asked container developers from the various container
  projects to help test this and no holes were found in the set of mount
  points on proc and sysfs that are created specially.

  This set of changes also starts enforcing the mount flags of fresh
  mounts of proc and sysfs are consistent with the existing mount of
  proc and sysfs.  I expected this to be the boring part of the work but
  unfortunately unprivileged userspace winds up mounting fresh copies of
  proc and sysfs with noexec and nosuid clear when root set those flags
  on the previous mount of proc and sysfs.  So for now only the atime,
  read-only and nodev attributes which userspace happens to keep
  consistent are enforced.  Dealing with the noexec and nosuid
  attributes remains for another time.

  This set of changes also addresses an issue with how open file
  descriptors from /proc/<pid>/ns/* are displayed.  Recently readlink of
  /proc/<pid>/fd has been triggering a WARN_ON that has not been
  meaningful since it was added (as all of the code in the kernel was
  converted) and is not now actively wrong.

  There is also a short list of issues that have not been fixed yet that
  I will mention briefly.

  It is possible to rename a directory from below to above a bind mount.
  At which point any directory pointers below the renamed directory can
  be walked up to the root directory of the filesystem.  With user
  namespaces enabled a bind mount of the bind mount can be created
  allowing the user to pick a directory whose children they can rename
  to outside of the bind mount.  This is challenging to fix and doubly
  so because all obvious solutions must touch code that is in the
  performance part of pathname resolution.

  As mentioned above there is also a question of how to ensure that
  developers by accident or with purpose do not introduce exectuable
  files on sysfs and proc and in doing so introduce security regressions
  in the current userspace that will not be immediately obvious and as
  such are likely to require breaking userspace in painful ways once
  they are recognized"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  vfs: Remove incorrect debugging WARN in prepend_path
  mnt: Update fs_fully_visible to test for permanently empty directories
  sysfs: Create mountpoints with sysfs_create_mount_point
  sysfs: Add support for permanently empty directories to serve as mount points.
  kernfs: Add support for always empty directories.
  proc: Allow creating permanently empty directories that serve as mount points
  sysctl: Allow creating permanently empty directories that serve as mountpoints.
  fs: Add helper functions for permanently empty directories.
  vfs: Ignore unlocked mounts in fs_fully_visible
  mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
  mnt: Refactor the logic for mounting sysfs and proc in a user namespace
2015-07-03 15:20:57 -07:00
Eric W. Biederman
f9bd6733d3 sysctl: Allow creating permanently empty directories that serve as mountpoints.
Add a magic sysctl table sysctl_mount_point that when used to
create a directory forces that directory to be permanently empty.

Update the code to use make_empty_dir_inode when accessing permanently
empty directories.

Update the code to not allow adding to permanently empty directories.

Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:39 -05:00
Chris Metcalf
fe4ba3c343 watchdog: add watchdog_cpumask sysctl to assist nohz
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running on
with a new kernel.watchdog_cpumask sysctl.

In the current system, the watchdog subsystem runs a periodic timer that
schedules the watchdog kthread to run.  However, nohz_full cores are
designed to allow userspace application code running on those cores to
have 100% access to the CPU.  So the watchdog system prevents the
nohz_full application code from being able to run the way it wants to,
thus the motivation to suppress the watchdog on nohz_full cores, which
this patchset provides by default.

However, if we disable the watchdog globally, then the housekeeping
cores can't benefit from the watchdog functionality.  So we allow
disabling it only on some cores.  See Documentation/lockup-watchdogs.txt
for more information.

[jhubbard@nvidia.com: fix a watchdog crash in some configurations]
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:40 -07:00
Thomas Gleixner
bc7a34b8b9 timer: Reduce timer migration overhead if disabled
Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.

We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.

The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.

With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.

Before:
  47.76%  hog       [.] main
  14.84%  [kernel]  [k] _raw_spin_lock_irqsave
   9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.71%  [kernel]  [k] mod_timer
   6.24%  [kernel]  [k] lock_timer_base.isra.38
   3.76%  [kernel]  [k] detach_if_pending
   3.71%  [kernel]  [k] del_timer
   2.50%  [kernel]  [k] internal_add_timer
   1.51%  [kernel]  [k] get_nohz_timer_target
   1.28%  [kernel]  [k] __internal_add_timer
   0.78%  [kernel]  [k] timerfn
   0.48%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu


Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 15:18:28 +02:00
Heinrich Schuchardt
230633d109 kernel/sysctl.c: detect overflows when converting to int
When converting unsigned long to int overflows may occur.  These currently
are not detected when writing to the sysctl file system.

E.g. on a system where int has 32 bits and long has 64 bits
  echo 0x800001234 > /proc/sys/kernel/threads-max
has the same effect as
  echo 0x1234 > /proc/sys/kernel/threads-max

The patch adds the missing check in do_proc_dointvec_conv.

With the patch an overflow will result in an error EINVAL when writing to
the the sysctl file system.

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:08 -04:00
Heinrich Schuchardt
16db3d3f11 kernel/sysctl.c: threads-max observe limits
Users can change the maximum number of threads by writing to
/proc/sys/kernel/threads-max.

With the patch the value entered is checked against the same limits that
apply when fork_init is called.

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:07 -04:00
Eric B Munson
5bbe3547aa mm: allow compaction of unevictable pages
Currently, pages which are marked as unevictable are protected from
compaction, but not from other types of migration.  The POSIX real time
extension explicitly states that mlock() will prevent a major page
fault, but the spirit of this is that mlock() should give a process the
ability to control sources of latency, including minor page faults.
However, the mlock manpage only explicitly says that a locked page will
not be written to swap and this can cause some confusion.  The
compaction code today does not give a developer who wants to avoid swap
but wants to have large contiguous areas available any method to achieve
this state.  This patch introduces a sysctl for controlling compaction
behavior with respect to the unevictable lru.  Users who demand no page
faults after a page is present can set compact_unevictable_allowed to 0
and users who need the large contiguous areas can enable compaction on
locked memory by leaving the default value of 1.

To illustrate this problem I wrote a quick test program that mmaps a
large number of 1MB files filled with random data.  These maps are
created locked and read only.  Then every other mmap is unmapped and I
attempt to allocate huge pages to the static huge page pool.  When the
compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
after fragmenting memory.  When the value is set to 1, allocations
succeed.

Signed-off-by: Eric B Munson <emunson@akamai.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:17 -07:00
Linus Torvalds
1dcf58d6e6 Merge branch 'akpm' (patches from Andrew)
Merge first patchbomb from Andrew Morton:

 - arch/sh updates

 - ocfs2 updates

 - kernel/watchdog feature

 - about half of mm/

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (122 commits)
  Documentation: update arch list in the 'memtest' entry
  Kconfig: memtest: update number of test patterns up to 17
  arm: add support for memtest
  arm64: add support for memtest
  memtest: use phys_addr_t for physical addresses
  mm: move memtest under mm
  mm, hugetlb: abort __get_user_pages if current has been oom killed
  mm, mempool: do not allow atomic resizing
  memcg: print cgroup information when system panics due to panic_on_oom
  mm: numa: remove migrate_ratelimited
  mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
  mm: split ET_DYN ASLR from mmap ASLR
  s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
  mm: expose arch_mmap_rnd when available
  s390: standardize mmap_rnd() usage
  powerpc: standardize mmap_rnd() usage
  mips: extract logic for mmap_rnd()
  arm64: standardize mmap_rnd() usage
  x86: standardize mmap_rnd() usage
  arm: factor out mmap ASLR into mmap_rnd
  ...
2015-04-14 16:49:17 -07:00
Ulrich Obergfell
195daf665a watchdog: enable the new user interface of the watchdog mechanism
With the current user interface of the watchdog mechanism it is only
possible to disable or enable both lockup detectors at the same time.
This series introduces new kernel parameters and changes the semantics of
some existing kernel parameters, so that the hard lockup detector and the
soft lockup detector can be disabled or enabled individually.  With this
series applied, the user interface is as follows.

- parameters in /proc/sys/kernel

  . soft_watchdog
    This is a new parameter to control and examine the run state of
    the soft lockup detector.

  . nmi_watchdog
    The semantics of this parameter have changed. It can now be used
    to control and examine the run state of the hard lockup detector.

  . watchdog
    This parameter is still available to control the run state of both
    lockup detectors at the same time. If this parameter is examined,
    it shows the logical OR of soft_watchdog and nmi_watchdog.

  . watchdog_thresh
    The semantics of this parameter are not affected by the patch.

- kernel command line parameters

  . nosoftlockup
    The semantics of this parameter have changed. It can now be used
    to disable the soft lockup detector at boot time.

  . nmi_watchdog=0 or nmi_watchdog=1
    Disable or enable the hard lockup detector at boot time. The patch
    introduces '=1' as a new option.

  . nowatchdog
    The semantics of this parameter are not affected by the patch. It
    is still available to disable both lockup detectors at boot time.

Also, remove the proc_dowatchdog() function which is no longer needed.

[dzickus@redhat.com: wrote changelog]
[dzickus@redhat.com: update documentation for kernel params and sysctl]
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Linus Torvalds
ca2ec32658 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:
 "Part one:

   - struct filename-related cleanups

   - saner iov_iter_init() replacements (and switching the syscalls to
     use of those)

   - ntfs switch to ->write_iter() (Anton)

   - aio cleanups and splitting iocb into common and async parts
     (Christoph)

   - assorted fixes (me, bfields, Andrew Elble)

  There's a lot more, including the completion of switchover to
  ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
  race fixes, etc, but that goes after #for-davem merge.  David has
  pulled it, and once it's in I'll send the next vfs pull request"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
  sg_start_req(): use import_iovec()
  sg_start_req(): make sure that there's not too many elements in iovec
  blk_rq_map_user(): use import_single_range()
  sg_io(): use import_iovec()
  process_vm_access: switch to {compat_,}import_iovec()
  switch keyctl_instantiate_key_common() to iov_iter
  switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
  aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
  vmsplice_to_user(): switch to import_iovec()
  kill aio_setup_single_vector()
  aio: simplify arguments of aio_setup_..._rw()
  aio: lift iov_iter_init() into aio_setup_..._rw()
  lift iov_iter into {compat_,}do_readv_writev()
  NFS: fix BUG() crash in notify_change() with patch to chown_common()
  dcache: return -ESTALE not -EBUSY on distributed fs race
  NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
  VFS: Add iov_iter_fault_in_multipages_readable()
  drop bogus check in file_open_root()
  switch security_inode_getattr() to struct path *
  constify tomoyo_realpath_from_path()
  ...
2015-04-14 15:31:03 -07:00
Christoph Hellwig
e2e40f2c1e fs: move struct kiocb to fs.h
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-25 20:28:11 -04:00
Theodore Ts'o
1efff914af fs: add dirtytime_expire_seconds sysctl
Add a tuning knob so we can adjust the dirtytime expiration timeout,
which is very useful for testing lazytime.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2015-03-17 12:23:32 -04:00
Andrey Ryabinin
3cd7645de6 mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?
Commit ed4d4902eb ("mm, hugetlb: remove hugetlb_zero and
hugetlb_infinity") replaced 'unsigned long hugetlb_zero' with 'int zero'
leading to out-of-bounds access in proc_doulongvec_minmax().  Use
'.extra1 = NULL' instead of '.extra1 = &zero'.  Passing NULL is
equivalent to passing minimal value, which is 0 for unsigned types.

Fixes: ed4d4902eb ("mm, hugetlb: remove hugetlb_zero and hugetlb_infinity")
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10 14:30:34 -08:00
Linus Torvalds
a7c180aa7e As the merge window is still open, and this code was not as complex
as I thought it might be. I'm pushing this in now.
 
 This will allow Thomas to debug his irq work for 3.20.
 
 This adds two new features:
 
 1) Allow traceopoints to be enabled right after mm_init(). By passing
 in the trace_event= kernel command line parameter, tracepoints can be
 enabled at boot up. For debugging things like the initialization of
 interrupts, it is needed to have tracepoints enabled very early. People
 have asked about this before and this has been on my todo list. As it
 can be helpful for Thomas to debug his upcoming 3.20 IRQ work, I'm
 pushing this now. This way he can add tracepoints into the IRQ set up
 and have users enable them when things go wrong.
 
 2) Have the tracepoints printed via printk() (the console) when they
 are triggered. If the irq code locks up or reboots the box, having the
 tracepoint output go into the kernel ring buffer is useless for
 debugging. But being able to add the tp_printk kernel command line
 option along with the trace_event= option will have these tracepoints
 printed as they occur, and that can be really useful for debugging
 early lock up or reboot problems.
 
 This code is not that intrusive and it passed all my tests. Thomas tried
 them out too and it works for his needs.
 
  Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUjv3kAAoJEEjnJuOKh9ldLNsIANAe5EmDCBw0WjR72n+G3qOH
 NC8calXfkjqHU0bv8Q3dRv20KH4MHOy6l4+EiV9/ovt71LOF3NEyUJ3HuShf9a8b
 sWcUhYbX3D1hViQe5sOzv9AWhBCFlKQGoNmQnydX9xa8ivRsBaTGJIGktWlHcwBE
 jF1i3fj3l3vRQSS8qZFXp3bzreunlGyPoSHcT6eWQeos+utj4sKwQWTLXTLQeM+6
 oQtFKRx7E5yX04qO1qFczS8qIEC6JH2C2jIRYEKUGepaELlnGkb8O7jQV/RaLF4/
 6P8VhZFG9YLS7fn7vWu0SnAN+Zwz5LzgjXAZt0FhGtIhLc18Oj8ouHH1UORsdQM=
 =Z4Un
 -----END PGP SIGNATURE-----

Merge tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "As the merge window is still open, and this code was not as complex as
  I thought it might be.  I'm pushing this in now.

  This will allow Thomas to debug his irq work for 3.20.

  This adds two new features:

  1) Allow traceopoints to be enabled right after mm_init().

     By passing in the trace_event= kernel command line parameter,
     tracepoints can be enabled at boot up.  For debugging things like
     the initialization of interrupts, it is needed to have tracepoints
     enabled very early.  People have asked about this before and this
     has been on my todo list.  As it can be helpful for Thomas to debug
     his upcoming 3.20 IRQ work, I'm pushing this now.  This way he can
     add tracepoints into the IRQ set up and have users enable them when
     things go wrong.

  2) Have the tracepoints printed via printk() (the console) when they
     are triggered.

     If the irq code locks up or reboots the box, having the tracepoint
     output go into the kernel ring buffer is useless for debugging.
     But being able to add the tp_printk kernel command line option
     along with the trace_event= option will have these tracepoints
     printed as they occur, and that can be really useful for debugging
     early lock up or reboot problems.

  This code is not that intrusive and it passed all my tests.  Thomas
  tried them out too and it works for his needs.

   Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org"

* tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Add tp_printk cmdline to have tracepoints go to printk()
  tracing: Move enabling tracepoints to just after rcu_init()
2014-12-16 12:53:59 -08:00
Steven Rostedt (Red Hat)
0daa230296 tracing: Add tp_printk cmdline to have tracepoints go to printk()
Add the kernel command line tp_printk option that will have tracepoints
that are active sent to printk() as well as to the trace buffer.

Passing "tp_printk" will activate this. To turn it off, the sysctl
/proc/sys/kernel/tracepoint_printk can have '0' echoed into it. Note,
this only works if the cmdline option is used. Echoing 1 into the sysctl
file without the cmdline option will have no affect.

Note, this is a dangerous option. Having high frequency tracepoints send
their data to printk() can possibly cause a live lock. This is another
reason why this is only active if the command line option is used.

Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-15 10:17:38 -05:00
Prarit Bhargava
9e3961a097 kernel: add panic_on_warn
There have been several times where I have had to rebuild a kernel to
cause a panic when hitting a WARN() in the code in order to get a crash
dump from a system.  Sometimes this is easy to do, other times (such as
in the case of a remote admin) it is not trivial to send new images to
the user.

A much easier method would be a switch to change the WARN() over to a
panic.  This makes debugging easier in that I can now test the actual
image the WARN() was seen on and I do not have to engage in remote
debugging.

This patch adds a panic_on_warn kernel parameter and
/proc/sys/kernel/panic_on_warn calls panic() in the
warn_slowpath_common() path.  The function will still print out the
location of the warning.

An example of the panic_on_warn output:

The first line below is from the WARN_ON() to output the WARN_ON()'s
location.  After that the panic() output is displayed.

    WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
    Kernel panic - not syncing: panic_on_warn set ...

    CPU: 30 PID: 11698 Comm: insmod Tainted: G        W  OE  3.17.0+ #57
    Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
     0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
     0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
     ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
    Call Trace:
     [<ffffffff81665190>] dump_stack+0x46/0x58
     [<ffffffff8165e2ec>] panic+0xd0/0x204
     [<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module]
     [<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0
     [<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module]
     [<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20
     [<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module]
     [<ffffffff81002144>] do_one_initcall+0xd4/0x210
     [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
     [<ffffffff810f8889>] load_module+0x16a9/0x1b30
     [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
     [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
     [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
     [<ffffffff8166cf29>] system_call_fastpath+0x12/0x17

Successfully tested by me.

hpa said: There is another very valid use for this: many operators would
rather a machine shuts down than being potentially compromised either
functionally or security-wise.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:10 -08:00
Kirill Tkhai
6419265899 sched/fair: Fix division by zero sysctl_numa_balancing_scan_size
File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.

This bash command reproduces problem:

$ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
	   echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done

	divide error: 0000 [#1] SMP
	Modules linked in:
	CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
	task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
	RIP: 0010:[<ffffffff81074191>]  [<ffffffff81074191>] task_scan_min+0x21/0x50
	RSP: 0000:ffff880037a6bce0  EFLAGS: 00010246
	RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
	RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
	RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
	R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
	R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
	FS:  00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
	CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
	Stack:
	 ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
	 ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
	 ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
	Call Trace:
	 [<ffffffff810741d1>] task_scan_max+0x11/0x40
	 [<ffffffff81077ef7>] task_numa_fault+0x1f7/0xae0
	 [<ffffffff8115a896>] ? migrate_misplaced_page+0x276/0x300
	 [<ffffffff81134a4d>] handle_mm_fault+0x62d/0xba0
	 [<ffffffff8103e2f1>] __do_page_fault+0x191/0x510
	 [<ffffffff81030122>] ? native_smp_send_reschedule+0x42/0x60
	 [<ffffffff8106dc00>] ? check_preempt_curr+0x80/0xa0
	 [<ffffffff8107092c>] ? wake_up_new_task+0x11c/0x1a0
	 [<ffffffff8104887d>] ? do_fork+0x14d/0x340
	 [<ffffffff811799bb>] ? get_unused_fd_flags+0x2b/0x30
	 [<ffffffff811799df>] ? __fd_install+0x1f/0x60
	 [<ffffffff8103e67c>] do_page_fault+0xc/0x10
	 [<ffffffff8150d322>] page_fault+0x22/0x30
	RIP  [<ffffffff81074191>] task_scan_min+0x21/0x50
	RSP <ffff880037a6bce0>
	---[ end trace 9a826d16936c04de ]---

Also fix race in task_scan_min (it depends on compiler behaviour).

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:46:04 +01:00
Linus Torvalds
d6dd50e07c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - changes related to No-CBs CPUs and NO_HZ_FULL

   - RCU-tasks implementation

   - torture-test updates

   - miscellaneous fixes

   - locktorture updates

   - RCU documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
  workqueue: Use cond_resched_rcu_qs macro
  workqueue: Add quiescent state between work items
  locktorture: Cleanup header usage
  locktorture: Cannot hold read and write lock
  locktorture: Fix __acquire annotation for spinlock irq
  locktorture: Support rwlocks
  rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
  locktorture: Document boot/module parameters
  rcutorture: Rename rcutorture_runnable parameter
  locktorture: Add test scenario for rwsem_lock
  locktorture: Add test scenario for mutex_lock
  locktorture: Make torture scripting account for new _runnable name
  locktorture: Introduce torture context
  locktorture: Support rwsems
  locktorture: Add infrastructure for torturing read locks
  torture: Address race in module cleanup
  locktorture: Make statistics generic
  locktorture: Teach about lock debugging
  locktorture: Support mutexes
  locktorture: Add documentation
  ...
2014-10-13 15:44:12 +02:00
Johannes Weiner
1f13ae399c mm: remove noisy remainder of the scan_unevictable interface
The deprecation warnings for the scan_unevictable interface triggers by
scripts doing `sysctl -a | grep something else'.  This is annoying and not
helpful.

The interface has been defunct since 264e56d824 ("mm: disable user
interface to manually rescue unevictable pages"), which was in 2011, and
there haven't been any reports of usecases for it, only reports that the
deprecation warnings are annying.  It's unlikely that anybody is using
this interface specifically at this point, so remove it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:55 -04:00
Paul E. McKenney
59da22a020 rcutorture: Rename rcutorture_runnable parameter
This commit changes rcutorture_runnable to torture_runnable, which is
consistent with the names of the other parameters and is a bit shorter
as well.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-16 13:41:44 -07:00