-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAltYMlwACgkQONu9yGCS
aT5ZmxAAjAWUndXt7fTUyHgxkoG61sEkdX4jcsp6NFwQMudU0UHx4/kcZE+HdMjL
VU8BZtdUg+jMLXM4erVBpQRKY9YHIPi8nWMTm1UjduMCxVD6dVL1HU6/RXl1cYIx
rf/opYOimqT9lYCeffmd9ai2zEEJKSt7/avddcJY4qHiqLan27gbUdAq2H26aM/5
LUzAaSBzhq3VYo9Q5zv03b1+tORAxh2BIffZjGEFe8SQQl1o63WqwV4RxEhV/Bjt
hBgl/6B/+EHtQnYnbnoOT/an9Ma15ik4/z3vVv6yRLNK+hS5T31OKcYCsUrjp6O+
TQVaVLWWmn/VpIHAMkrhBs9Xxg5GmRziF77AkzyC506tK268M2+IoY77ursVl1YK
STaOwUcLUlKLbl5OADqMpYtNU9ybkP+MmgDZsIEXz9UiCZM721fL5Au2PHuzaYOD
2nE2EQb04It4k9GN8FStv2KPIiKUCEXi9MlNsHGPs6Mc+fliIigoKPhpU5JG+sxR
eJgPMNv4OWhwXWTd1wf0Gy5X+i0lQlwlGgIHFfSB8vzArJ0Y/yuPj2a6xhQshOza
Ivq7JudHvxYxhDSWYoCKgtTgzMdSBbJ3xjOoUUHy4ryamYeyaMvgFjsaCTMr0dsw
76BkgNTbpsip+I77a9h4Ozlk5QE7h61EsqjmZBkGVqLYjrUQ/IU=
=X4tZ
-----END PGP SIGNATURE-----
Merge 4.4.144 into android-4.4
Changes in 4.4.144
KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel.
x86/MCE: Remove min interval polling limitation
fat: fix memory allocation failure handling of match_strdup()
ALSA: rawmidi: Change resized buffers atomically
ARC: Fix CONFIG_SWAP
ARC: mm: allow mprotect to make stack mappings executable
mm: memcg: fix use after free in mem_cgroup_iter()
ipv4: Return EINVAL when ping_group_range sysctl doesn't map to user ns
ipv6: fix useless rol32 call on hash
lib/rhashtable: consider param->min_size when setting initial table size
net/ipv4: Set oif in fib_compute_spec_dst
net: phy: fix flag masking in __set_phy_supported
ptp: fix missing break in switch
tg3: Add higher cpu clock for 5762.
net: Don't copy pfmemalloc flag in __copy_skb_header()
skbuff: Unconditionally copy pfmemalloc in __skb_clone()
xhci: Fix perceived dead host due to runtime suspend race with event handler
x86/paravirt: Make native_save_fl() extern inline
x86/cpufeatures: Add CPUID_7_EDX CPUID leaf
x86/cpufeatures: Add Intel feature bits for Speculation Control
x86/cpufeatures: Add AMD feature bits for Speculation Control
x86/msr: Add definitions for new speculation control MSRs
x86/pti: Do not enable PTI on CPUs which are not vulnerable to Meltdown
x86/cpufeature: Blacklist SPEC_CTRL/PRED_CMD on early Spectre v2 microcodes
x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support
x86/cpufeatures: Clean up Spectre v2 related CPUID flags
x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel
x86/pti: Mark constant arrays as __initconst
x86/asm/entry/32: Simplify pushes of zeroed pt_regs->REGs
x86/entry/64/compat: Clear registers for compat syscalls, to reduce speculation attack surface
x86/speculation: Update Speculation Control microcode blacklist
x86/speculation: Correct Speculation Control microcode blacklist again
x86/speculation: Clean up various Spectre related details
x86/speculation: Fix up array_index_nospec_mask() asm constraint
x86/speculation: Add <asm/msr-index.h> dependency
x86/xen: Zero MSR_IA32_SPEC_CTRL before suspend
x86/mm: Factor out LDT init from context init
x86/mm: Give each mm TLB flush generation a unique ID
x86/speculation: Use Indirect Branch Prediction Barrier in context switch
x86/spectre_v2: Don't check microcode versions when running under hypervisors
x86/speculation: Use IBRS if available before calling into firmware
x86/speculation: Move firmware_restrict_branch_speculation_*() from C to CPP
x86/speculation: Remove Skylake C2 from Speculation Control microcode blacklist
selftest/seccomp: Fix the flag name SECCOMP_FILTER_FLAG_TSYNC
selftest/seccomp: Fix the seccomp(2) signature
xen: set cpu capabilities from xen_start_kernel()
x86/amd: don't set X86_BUG_SYSRET_SS_ATTRS when running under Xen
x86/nospec: Simplify alternative_msr_write()
x86/bugs: Concentrate bug detection into a separate function
x86/bugs: Concentrate bug reporting into a separate function
x86/bugs: Read SPEC_CTRL MSR during boot and re-use reserved bits
x86/bugs, KVM: Support the combination of guest and host IBRS
x86/cpu: Rename Merrifield2 to Moorefield
x86/cpu/intel: Add Knights Mill to Intel family
x86/bugs: Expose /sys/../spec_store_bypass
x86/cpufeatures: Add X86_FEATURE_RDS
x86/bugs: Provide boot parameters for the spec_store_bypass_disable mitigation
x86/bugs/intel: Set proper CPU features and setup RDS
x86/bugs: Whitelist allowed SPEC_CTRL MSR values
x86/bugs/AMD: Add support to disable RDS on Fam[15, 16, 17]h if requested
x86/speculation: Create spec-ctrl.h to avoid include hell
prctl: Add speculation control prctls
x86/process: Optimize TIF checks in __switch_to_xtra()
x86/process: Correct and optimize TIF_BLOCKSTEP switch
x86/process: Optimize TIF_NOTSC switch
x86/process: Allow runtime control of Speculative Store Bypass
x86/speculation: Add prctl for Speculative Store Bypass mitigation
nospec: Allow getting/setting on non-current task
proc: Provide details on speculation flaw mitigations
seccomp: Enable speculation flaw mitigations
prctl: Add force disable speculation
seccomp: Use PR_SPEC_FORCE_DISABLE
seccomp: Add filter flag to opt-out of SSB mitigation
seccomp: Move speculation migitation control to arch code
x86/speculation: Make "seccomp" the default mode for Speculative Store Bypass
x86/bugs: Rename _RDS to _SSBD
proc: Use underscores for SSBD in 'status'
Documentation/spec_ctrl: Do some minor cleanups
x86/bugs: Fix __ssb_select_mitigation() return type
x86/bugs: Make cpu_show_common() static
x86/bugs: Fix the parameters alignment and missing void
x86/cpu: Make alternative_msr_write work for 32-bit code
x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP
x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS
x86/cpufeatures: Disentangle SSBD enumeration
x86/cpu/AMD: Fix erratum 1076 (CPB bit)
x86/cpufeatures: Add FEATURE_ZEN
x86/speculation: Handle HT correctly on AMD
x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL
x86/speculation: Add virtualized speculative store bypass disable support
x86/speculation: Rework speculative_store_bypass_update()
x86/bugs: Unify x86_spec_ctrl_{set_guest, restore_host}
x86/bugs: Expose x86_spec_ctrl_base directly
x86/bugs: Remove x86_spec_ctrl_set()
x86/bugs: Rework spec_ctrl base and mask logic
x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG
x86/bugs: Rename SSBD_NO to SSB_NO
x86/xen: Add call of speculative_store_bypass_ht_init() to PV paths
x86/cpu: Re-apply forced caps every time CPU caps are re-read
block: do not use interruptible wait anywhere
clk: tegra: Fix PLL_U post divider and initial rate on Tegra30
ubi: Introduce vol_ignored()
ubi: Rework Fastmap attach base code
ubi: Be more paranoid while seaching for the most recent Fastmap
ubi: Fix races around ubi_refill_pools()
ubi: Fix Fastmap's update_vol()
ubi: fastmap: Erase outdated anchor PEBs during attach
Linux 4.4.144
Change-Id: Ia3e9b2b7bc653cba68b76878d34f8fcbbc007a13
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 7bbf1373e228840bb0295a2ca26d548ef37f448e upstream
Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than
current.
This is needed both for /proc/$pid/status queries and for seccomp (since
thread-syncing can trigger seccomp in non-current threads).
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
Reviewed-by: Matt Helsley (VMware) <matt.helsley@gmail.com>
Reviewed-by: Alexey Makhalov <amakhalov@vmware.com>
Reviewed-by: Bo Gan <ganb@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b617cfc858161140d69cc0b5cc211996b557a1c7 upstream
Add two new prctls to control aspects of speculation related vulnerabilites
and their mitigations to provide finer grained control over performance
impacting mitigations.
PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
the following meaning:
Bit Define Description
0 PR_SPEC_PRCTL Mitigation can be controlled per task by
PR_SET_SPECULATION_CTRL
1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is
disabled
2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is
enabled
If all bits are 0 the CPU is not affected by the speculation misfeature.
If PR_SPEC_PRCTL is set, then the per task control of the mitigation is
available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation
misfeature will fail.
PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which
is selected by arg2 of prctl(2) per task. arg3 is used to hand in the
control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.
The common return values are:
EINVAL prctl is not implemented by the architecture or the unused prctl()
arguments are not 0
ENODEV arg2 is selecting a not supported speculation misfeature
PR_SET_SPECULATION_CTRL has these additional return values:
ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE
ENXIO prctl control of the selected speculation misfeature is disabled
The first supported controlable speculation misfeature is
PR_SPEC_STORE_BYPASS. Add the define so this can be shared between
architectures.
Based on an initial patch from Tim Chen and mostly rewritten.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
Reviewed-by: Matt Helsley (VMware) <matt.helsley@gmail.com>
Reviewed-by: Alexey Makhalov <amakhalov@vmware.com>
Reviewed-by: Bo Gan <ganb@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlsOO14ACgkQONu9yGCS
aT4ulhAAhMVYSRa/cOFm0BHxSL/59WmJTa3Na8TJqkTrJy+LRluBiKCywyiMZknp
4rIffv4jcxcFNCpqYTjNTSStGLWCCkBLNSzxuzFv5M89Jdx4Gz1Ww1hzMESP3gxK
puHUewSJQm7qtVOiC2l4YcW3Q6nFK0kqbCWpSkHoGVfZoX9JS2P1V8n+KFZpUH1a
UyhVW48ainUpXfhSKJZ5xABiWYM2hcSq52RW1edNZvwuKwulZ+2EME26HgGCK7ff
WHzGHECE6Lem+iunR26J/QtbTo8LKEyU0F039X21E7FIxf33S0xyPx+MGjJfWBOo
Q6A23mAEWwEhlMomNKzdd/iUzSVlWSzKe8LJa7GI5G6BxftN8Z0TGTnKzIDkw++M
T6RfK03CP6c9rQ756d0fTPxdZh6ae9EN8WSot/Sbbc9SvGSfy6o4I8Y/uJygShmF
j13JfMweC+t7/6fyUqc5dcgY0Xy7LUFiWqfPxQj6axDiT82Mx2AvQaczrPUAKr1K
KQsetmyhHC+Cpy7ILrhUGYjEWlvQm11ZiFoX8BkocFLFWk736QA63iB7mOUpCOQR
SKLK00dF163GJdQC6nb4wCtyBxnCg4pSoP/72Z1foPtaSd3ccJ4CLsIE6GY5sP/I
sDlPnIlnzEDfDPIxtVfKC8e1JINP6awXwtoJJo6MnuCuP3LDb58=
=ogZQ
-----END PGP SIGNATURE-----
Merge 4.4.134 into android-4.4
Changes in 4.4.134
MIPS: ptrace: Expose FIR register through FP regset
MIPS: Fix ptrace(2) PTRACE_PEEKUSR and PTRACE_POKEUSR accesses to o32 FGRs
KVM: Fix spelling mistake: "cop_unsuable" -> "cop_unusable"
affs_lookup(): close a race with affs_remove_link()
aio: fix io_destroy(2) vs. lookup_ioctx() race
ALSA: timer: Fix pause event notification
mmc: sdhci-iproc: fix 32bit writes for TRANSFER_MODE register
libata: Blacklist some Sandisk SSDs for NCQ
libata: blacklist Micron 500IT SSD with MU01 firmware
xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent
Revert "ipc/shm: Fix shmat mmap nil-page protection"
ipc/shm: fix shmat() nil address after round-down when remapping
kasan: fix memory hotplug during boot
kernel/sys.c: fix potential Spectre v1 issue
kernel/signal.c: avoid undefined behaviour in kill_something_info
xfs: remove racy hasattr check from attr ops
do d_instantiate/unlock_new_inode combinations safely
firewire-ohci: work around oversized DMA reads on JMicron controllers
NFSv4: always set NFS_LOCK_LOST when a lock is lost.
ALSA: hda - Use IS_REACHABLE() for dependency on input
ASoC: au1x: Fix timeout tests in au1xac97c_ac97_read()
kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
tracing/hrtimer: Fix tracing bugs by taking all clock bases and modes into account
PCI: Add function 1 DMA alias quirk for Marvell 9128
tools lib traceevent: Simplify pointer print logic and fix %pF
perf callchain: Fix attr.sample_max_stack setting
tools lib traceevent: Fix get_field_str() for dynamic strings
dm thin: fix documentation relative to low water mark threshold
nfs: Do not convert nfs_idmap_cache_timeout to jiffies
watchdog: sp5100_tco: Fix watchdog disable bit
kconfig: Don't leak main menus during parsing
kconfig: Fix automatic menu creation mem leak
kconfig: Fix expr_free() E_NOT leak
mac80211_hwsim: fix possible memory leak in hwsim_new_radio_nl()
ipmi/powernv: Fix error return code in ipmi_powernv_probe()
Btrfs: set plug for fsync
btrfs: Fix out of bounds access in btrfs_search_slot
Btrfs: fix scrub to repair raid6 corruption
scsi: fas216: fix sense buffer initialization
HID: roccat: prevent an out of bounds read in kovaplus_profile_activated()
jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes
powerpc/numa: Ensure nodes initialized for hotplug
RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
ntb_transport: Fix bug with max_mw_size parameter
ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid
ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute
ocfs2: return error when we attempt to access a dirty bh in jbd2
mm/mempolicy: fix the check of nodemask from user
mm/mempolicy: add nodes_empty check in SYSC_migrate_pages
asm-generic: provide generic_pmdp_establish()
mm: pin address_space before dereferencing it while isolating an LRU page
IB/ipoib: Fix for potential no-carrier state
x86/power: Fix swsusp_arch_resume prototype
firmware: dmi_scan: Fix handling of empty DMI strings
ACPI: processor_perflib: Do not send _PPC change notification if not ready
bpf: fix selftests/bpf test_kmod.sh failure when CONFIG_BPF_JIT_ALWAYS_ON=y
MIPS: TXx9: use IS_BUILTIN() for CONFIG_LEDS_CLASS
xen-netfront: Fix race between device setup and open
xen/grant-table: Use put_page instead of free_page
RDS: IB: Fix null pointer issue
arm64: spinlock: Fix theoretical trylock() A-B-A with LSE atomics
proc: fix /proc/*/map_files lookup
cifs: silence compiler warnings showing up with gcc-8.0.0
bcache: properly set task state in bch_writeback_thread()
bcache: fix for allocator and register thread race
bcache: fix for data collapse after re-attaching an attached device
bcache: return attach error when no cache set exist
tools/libbpf: handle issues with bpf ELF objects containing .eh_frames
locking/qspinlock: Ensure node->count is updated before initialising node
irqchip/gic-v3: Change pr_debug message to pr_devel
scsi: ufs: Enable quirk to ignore sending WRITE_SAME command
scsi: bnx2fc: Fix check in SCSI completion handler for timed out request
scsi: sym53c8xx_2: iterator underflow in sym_getsync()
scsi: mptfusion: Add bounds check in mptctl_hp_targetinfo()
scsi: qla2xxx: Avoid triggering undefined behavior in qla2x00_mbx_completion()
ARC: Fix malformed ARC_EMUL_UNALIGNED default
usb: gadget: f_uac2: fix bFirstInterface in composite gadget
usb: gadget: fsl_udc_core: fix ep valid checks
usb: dwc2: Fix dwc2_hsotg_core_init_disconnected()
selftests: memfd: add config fragment for fuse
scsi: storvsc: Increase cmd_per_lun for higher speed devices
scsi: aacraid: fix shutdown crash when init fails
scsi: qla4xxx: skip error recovery in case of register disconnect.
ARM: OMAP2+: timer: fix a kmemleak caused in omap_get_timer_dt
ARM: OMAP3: Fix prm wake interrupt for resume
ARM: OMAP1: clock: Fix debugfs_create_*() usage
NFC: llcp: Limit size of SDP URI
mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4
md raid10: fix NULL deference in handle_write_completed()
drm/exynos: fix comparison to bitshift when dealing with a mask
usb: musb: fix enumeration after resume
locking/xchg/alpha: Add unconditional memory barrier to cmpxchg()
md: raid5: avoid string overflow warning
kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
powerpc/bpf/jit: Fix 32-bit JIT for seccomp_data access
s390/cio: fix return code after missing interrupt
s390/cio: clear timer when terminating driver I/O
ARM: OMAP: Fix dmtimer init for omap1
smsc75xx: fix smsc75xx_set_features()
regulatory: add NUL to request alpha2
locking/xchg/alpha: Fix xchg() and cmpxchg() memory ordering bugs
x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations
media: dmxdev: fix error code for invalid ioctls
md/raid1: fix NULL pointer dereference
batman-adv: fix packet checksum in receive path
batman-adv: invalidate checksum on fragment reassembly
netfilter: ebtables: convert BUG_ONs to WARN_ONs
nvme-pci: Fix nvme queue cleanup if IRQ setup fails
clocksource/drivers/fsl_ftm_timer: Fix error return checking
r8152: fix tx packets accounting
virtio-gpu: fix ioctl and expose the fixed status to userspace.
dmaengine: rcar-dmac: fix max_chunk_size for R-Car Gen3
bcache: fix kcrashes with fio in RAID5 backend dev
sit: fix IFLA_MTU ignored on NEWLINK
gianfar: Fix Rx byte accounting for ndev stats
net/tcp/illinois: replace broken algorithm reference link
xen/pirq: fix error path cleanup when binding MSIs
Btrfs: send, fix issuing write op when processing hole in no data mode
selftests/powerpc: Skip the subpage_prot tests if the syscall is unavailable
KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backing
watchdog: f71808e_wdt: Fix magic close handling
e1000e: Fix check_for_link return value with autoneg off
e1000e: allocate ring descriptors with dma_zalloc_coherent
usb: musb: call pm_runtime_{get,put}_sync before reading vbus registers
scsi: mpt3sas: Do not mark fw_event workqueue as WQ_MEM_RECLAIM
scsi: sd: Keep disk read-only when re-reading partition
fbdev: Fixing arbitrary kernel leak in case FBIOGETCMAP_SPARC in sbusfb_ioctl_helper().
xen: xenbus: use put_device() instead of kfree()
USB: OHCI: Fix NULL dereference in HCDs using HCD_LOCAL_MEM
netfilter: ebtables: fix erroneous reject of last rule
bnxt_en: Check valid VNIC ID in bnxt_hwrm_vnic_set_tpa().
workqueue: use put_device() instead of kfree()
ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu
sunvnet: does not support GSO for sctp
net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off
batman-adv: fix header size check in batadv_dbg_arp()
vti4: Don't count header length twice on tunnel setup
vti4: Don't override MTU passed on link creation via IFLA_MTU
perf/cgroup: Fix child event counting bug
RDMA/ucma: Correct option size check using optlen
mm/mempolicy.c: avoid use uninitialized preferred_node
selftests: ftrace: Add probe event argument syntax testcase
selftests: ftrace: Add a testcase for string type with kprobe_event
selftests: ftrace: Add a testcase for probepoint
batman-adv: fix multicast-via-unicast transmission with AP isolation
batman-adv: fix packet loss for broadcasted DHCP packets to a server
ARM: 8748/1: mm: Define vdso_start, vdso_end as array
net: qmi_wwan: add BroadMobi BM806U 2020:2033
net/usb/qmi_wwan.c: Add USB id for lt4120 modem
net-usb: add qmi_wwan if on lte modem wistron neweb d18q1
llc: properly handle dev_queue_xmit() return value
mm/kmemleak.c: wait for scan completion before disabling free
net: Fix untag for vlan packets without ethernet header
net: mvneta: fix enable of all initialized RXQs
sh: fix debug trap failure to process signals before return to user
x86/pgtable: Don't set huge PUD/PMD on non-leaf entries
fs/proc/proc_sysctl.c: fix potential page fault while unregistering sysctl table
swap: divide-by-zero when zero length swap file on ssd
sr: get/drop reference to device in revalidate and check_events
Force log to disk before reading the AGF during a fstrim
cpufreq: CPPC: Initialize shared perf capabilities of CPUs
scsi: aacraid: Insure command thread is not recursively stopped
dp83640: Ensure against premature access to PHY registers after reset
mm/ksm: fix interaction with THP
mm: fix races between address_space dereference and free in page_evicatable
Btrfs: bail out on error during replay_dir_deletes
Btrfs: fix NULL pointer dereference in log_dir_items
btrfs: Fix possible softlock on single core machines
ocfs2/dlm: don't handle migrate lockres if already in shutdown
sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
KVM: VMX: raise internal error for exception during invalid protected mode state
fscache: Fix hanging wait on page discarded by writeback
sparc64: Make atomic_xchg() an inline function rather than a macro.
rtc: snvs: Fix usage of snvs_rtc_enable
net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
Bluetooth: btusb: Add USB ID 7392:a611 for Edimax EW-7611ULB
btrfs: tests/qgroup: Fix wrong tree backref level
Btrfs: fix copy_items() return value when logging an inode
btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers
xen/acpi: off by one in read_acpi_id()
ACPI: acpi_pad: Fix memory leak in power saving threads
powerpc/mpic: Check if cpu_possible() in mpic_physmask()
m68k: set dma and coherent masks for platform FEC ethernets
parisc/pci: Switch LBA PCI bus from Hard Fail to Soft Fail mode
hwmon: (nct6775) Fix writing pwmX_mode
rtc: hctosys: Ensure system time doesn't overflow time_t
powerpc/perf: Prevent kernel address leak to userspace via BHRB buffer
powerpc/perf: Fix kernel address leak via sampling registers
tools/thermal: tmon: fix for segfault
selftests: Print the test we're running to /dev/kmsg
net/mlx5: Protect from command bit overflow
ath10k: Fix kernel panic while using worker (ath10k_sta_rc_update_wk)
ima: Fix Kconfig to select TPM 2.0 CRB interface
ima: Fallback to the builtin hash algorithm
virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
arm: dts: socfpga: fix GIC PPI warning
usb: dwc3: Update DWC_usb31 GTXFIFOSIZ reg fields
cpufreq: cppc_cpufreq: Fix cppc_cpufreq_init() failure path
clk: Don't show the incorrect clock phase
zorro: Set up z->dev.dma_mask for the DMA API
bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
ACPICA: Events: add a return on failure from acpi_hw_register_read
ACPICA: acpi: acpica: fix acpi operand cache leak in nseval.c
i2c: mv64xxx: Apply errata delay only in standard mode
KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use
xhci: zero usb device slot_id member when disabling and freeing a xhci slot
MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset
PCI: Restore config space on runtime resume despite being unbound
ipmi_ssif: Fix kernel panic at msg_done_handler
usb: dwc2: Fix interval type issue
usb: gadget: ffs: Let setup() return USB_GADGET_DELAYED_STATUS
usb: gadget: ffs: Execute copy_to_user() with USER_DS set
powerpc: Add missing prototype for arch_irq_work_raise()
ASoC: topology: create TLV data for dapm widgets
perf/core: Fix perf_output_read_group()
hwmon: (pmbus/max8688) Accept negative page register values
hwmon: (pmbus/adm1275) Accept negative page register values
cdrom: do not call check_disk_change() inside cdrom_open()
gfs2: Fix fallocate chunk size
usb: gadget: udc: change comparison to bitshift when dealing with a mask
usb: gadget: composite: fix incorrect handling of OS desc requests
x86/devicetree: Initialize device tree before using it
x86/devicetree: Fix device IRQ settings in DT
ALSA: vmaster: Propagate slave error
media: cx23885: Override 888 ImpactVCBe crystal frequency
media: cx23885: Set subdev host data to clk_freq pointer
media: s3c-camif: fix out-of-bounds array access
dmaengine: pl330: fix a race condition in case of threaded irqs
media: em28xx: USB bulk packet size fix
clk: rockchip: Prevent calculating mmc phase if clock rate is zero
enic: enable rq before updating rq descriptors
hwrng: stm32 - add reset during probe
staging: rtl8192u: return -ENOMEM on failed allocation of priv->oldaddr
rtc: tx4939: avoid unintended sign extension on a 24 bit shift
serial: xuartps: Fix out-of-bounds access through DT alias
serial: samsung: Fix out-of-bounds access through serial port index
serial: mxs-auart: Fix out-of-bounds access through serial port index
serial: imx: Fix out-of-bounds access through serial port index
serial: fsl_lpuart: Fix out-of-bounds access through DT alias
serial: arc_uart: Fix out-of-bounds access through DT alias
PCI: Add function 1 DMA alias quirk for Marvell 88SE9220
udf: Provide saner default for invalid uid / gid
media: cx25821: prevent out-of-bounds read on array card
clk: samsung: s3c2410: Fix PLL rates
clk: samsung: exynos5260: Fix PLL rates
clk: samsung: exynos5433: Fix PLL rates
clk: samsung: exynos5250: Fix PLL rates
clk: samsung: exynos3250: Fix PLL rates
crypto: sunxi-ss - Add MODULE_ALIAS to sun4i-ss
audit: return on memory error to avoid null pointer dereference
MIPS: Octeon: Fix logging messages with spurious periods after newlines
drm/rockchip: Respect page offset for PRIME mmap calls
x86/apic: Set up through-local-APIC mode on the boot CPU if 'noapic' specified
perf tests: Use arch__compare_symbol_names to compare symbols
perf report: Fix memory corruption in --branch-history mode --branch-history
selftests/net: fixes psock_fanout eBPF test case
netlabel: If PF_INET6, check sk_buff ip header version
scsi: lpfc: Fix issue_lip if link is disabled
scsi: lpfc: Fix soft lockup in lpfc worker thread during LIP testing
scsi: lpfc: Fix frequency of Release WQE CQEs
regulator: of: Add a missing 'of_node_put()' in an error handling path of 'of_regulator_match()'
ASoC: samsung: i2s: Ensure the RCLK rate is properly determined
Bluetooth: btusb: Add device ID for RTL8822BE
kdb: make "mdr" command repeat
s390/ftrace: use expoline for indirect branches
Linux 4.4.134
Change-Id: Iababaf9b89bc8d0437b95e1368d8b0a9126a178c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 23d6aef74da86a33fa6bb75f79565e0a16ee97c2 upstream.
`resource' can be controlled by user-space, hence leading to a potential
exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
kernel/sys.c:1474 __do_compat_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)
kernel/sys.c:1455 __do_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)
Fix this by sanitizing *resource* before using it to index
current->signal->rlim
Notice that given that speculation windows are large, the policy is to
kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
Link: http://lkml.kernel.org/r/20180515030038.GA11822@embeddedor.com
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This backports da8b44d5a9f8bf26da637b7336508ca534d6b319 from upstream.
This patchset introduces a /proc/<pid>/timerslack_ns interface which
would allow controlling processes to be able to set the timerslack value
on other processes in order to save power by avoiding wakeups (Something
Android currently does via out-of-tree patches).
The first patch tries to fix the internal timer_slack_ns usage which was
defined as a long, which limits the slack range to ~4 seconds on 32bit
systems. It converts it to a u64, which provides the same basically
unlimited slack (500 years) on both 32bit and 64bit machines.
The second patch introduces the /proc/<pid>/timerslack_ns interface
which allows the full 64bit slack range for a task to be read or set on
both 32bit and 64bit machines.
With these two patches, on a 32bit machine, after setting the slack on
bash to 10 seconds:
$ time sleep 1
real 0m10.747s
user 0m0.001s
sys 0m0.005s
The first patch is a little ugly, since I had to chase the slack delta
arguments through a number of functions converting them to u64s. Let me
know if it makes sense to break that up more or not.
Other than that things are fairly straightforward.
This patch (of 2):
The timer_slack_ns value in the task struct is currently a unsigned
long. This means that on 32bit applications, the maximum slack is just
over 4 seconds. However, on 64bit machines, its much much larger (~500
years).
This disparity could make application development a little (as well as
the default_slack) to a u64. This means both 32bit and 64bit systems
have the same effective internal slack range.
Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
the interface as a unsigned long, so we preserve that limitation on
32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
actually larger then what can be stored by an unsigned long.
This patch also modifies hrtimer functions which specified the slack
delta as a unsigned long.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit ddf1d398e517e660207e2c807f76a90df543a217 upstream.
An unprivileged user can trigger an oops on a kernel with
CONFIG_CHECKPOINT_RESTORE.
proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
start/end values. These get sanity checked as follows:
BUG_ON(arg_start > arg_end);
BUG_ON(env_start > env_end);
These can be changed by prctl_set_mm. Turns out also takes the semaphore for
reading, effectively rendering it useless. This results in:
kernel BUG at fs/proc/base.c:240!
invalid opcode: 0000 [#1] SMP
Modules linked in: virtio_net
CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
RIP: proc_pid_cmdline_read+0x520/0x530
RSP: 0018:ffff8800784d3db8 EFLAGS: 00010206
RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
FS: 00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
Call Trace:
__vfs_read+0x37/0x100
vfs_read+0x82/0x130
SyS_read+0x58/0xd0
entry_SYSCALL_64_fastpath+0x12/0x76
Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
RIP proc_pid_cmdline_read+0x520/0x530
---[ end trace 97882617ae9c6818 ]---
Turns out there are instances where the code just reads aformentioned
values without locking whatsoever - namely environ_read and get_cmdline.
Interestingly these functions look quite resilient against bogus values,
but I don't believe this should be relied upon.
The first patch gets rid of the oops bug by grabbing mmap_sem for
writing.
The second patch is optional and puts locking around aformentioned
consumers for safety. Consumers of other fields don't seem to benefit
from similar treatment and are left untouched.
This patch (of 2):
The code was taking the semaphore for reading, which does not protect
against readers nor concurrent modifications.
The problem could cause a sanity checks to fail in procfs's cmdline
reader, resulting in an OOPS.
Note that some functions perform an unlocked read of various mm fields,
but they seem to be fine despite possible modificaton.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anshuman Khandual <anshuman.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Update vma_merge() call in private anonymous memory prctl,
introduced in AOSP commit ee8c5f78f09a
"mm: add a field to store names for private anonymous memory",
so as to align with changes from upstream commit 19a809afe2
"userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx".
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory. When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.
This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma. vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.
The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas. If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".
The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen. This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging. The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.
Includes fix from Jed Davis <jld@mozilla.com> for typo in
prctl_set_vma_anon_name, which could attempt to set the name
across two vmas at the same time due to a typo, which might
corrupt the vma list. Fix it to use tmp instead of end to limit
the name setting to a single vma at a time.
Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
Make PR_SET_TIMERSLACK_PID consider pid namespace and resolve the
target pid in the caller's namespace. Otherwise, calls from pid
namespace other than init would fail or affect the wrong task.
Change-Id: I1da15196abc4096536713ce03714e99d2e63820a
Signed-off-by: Micha Kalfon <micha@cellrox.com>
Acked-by: Oren Laadan <orenl@cellrox.com>
The case clause for the PR_SET_TIMERSLACK_PID option was placed inside
the an internal switch statement for PR_MCE_KILL (see commits 37a591d4
and 8ae872f1) . This commit moves it to the right place.
Change-Id: I63251669d7e2f2aa843d1b0900e7df61518c3dea
Signed-off-by: Micha Kalfon <micha@cellrox.com>
Acked-by: Oren Laadan <orenl@cellrox.com>
Adds a capable() check to make sure that arbitary apps do not change the
timer slack for other apps.
Bug: 15000427
Change-Id: I558a2551a0e3579c7f7e7aae54b28aa9d982b209
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
Second argument is similar to PR_SET_TIMERSLACK, if non-zero then the
slack is set to that value otherwise sets it to the default for the thread.
Takes PID of the thread as the third argument.
This allows power/performance management software to set timer slack for
other threads according to its policy for the thread (such as when the
thread is designated foreground vs. background activity)
Change-Id: I744d451ff4e60dae69f38f53948ff36c51c14a3f
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of
the current pid namespace. This is in contrast to both the other modes of
setpriority and the example of kill(-1). Fix this. getpriority and
ioprio have the same failure mode, fix them too.
Eric said:
: After some more thinking about it this patch sounds justifiable.
:
: My goal with namespaces is not to build perfect isolation mechanisms
: as that can get into ill defined territory, but to build well defined
: mechanisms. And to handle the corner cases so you can use only
: a single namespace with well defined results.
:
: In this case you have found the two interfaces I am aware of that
: identify processes by uid instead of by pid. Which quite frankly is
: weird. Unfortunately the weird unexpected cases are hard to handle
: in the usual way.
:
: I was hoping for a little more information. Changes like this one we
: have to be careful of because someone might be depending on the current
: behavior. I don't think they are and I do think this make sense as part
: of the pid namespace.
Signed-off-by: Ben Segall <bsegall@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ambrose Feinstein <ambrose@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Today proc and sysfs do not contain any executable files. Several
applications today mount proc or sysfs without noexec and nosuid and
then depend on there being no exectuables files on proc or sysfs.
Having any executable files show on proc or sysfs would cause
a user space visible regression, and most likely security problems.
Therefore commit to never allowing executables on proc and sysfs by
adding a new flag to mark them as filesystems without executables and
enforce that flag.
Test the flag where MNT_NOEXEC is tested today, so that the only user
visible effect will be that exectuables will be treated as if the
execute bit is cleared.
The filesystems proc and sysfs do not currently incoporate any
executable files so this does not result in any user visible effects.
This makes it unnecessary to vet changes to proc and sysfs tightly for
adding exectuable files or changes to chattr that would modify
existing files, as no matter what the individual file say they will
not be treated as exectuable files by the vfs.
Not having to vet changes to closely is important as without this we
are only one proc_create call (or another goof up in the
implementation of notify_change) from having problematic executables
on proc. Those mistakes are all too easy to make and would create
a situation where there are security issues or the assumptions of
some program having to be broken (and cause userspace regressions).
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Individual prctl(PR_SET_MM_*) calls do some checking to maintain a
consistent view of mm->arg_start et al fields, but not enough. In
particular PR_SET_MM_ARG_START/PR_SET_MM_ARG_END/ R_SET_MM_ENV_START/
PR_SET_MM_ENV_END only check that the address lies in an existing VMA,
but don't check that the start address is lower than the end address _at
all_.
Consolidate all consistency checks, so there will be no difference in
the future between PR_SET_MM_MAP and individual PR_SET_MM_* calls.
The program below makes both ARGV and ENVP areas be reversed. It makes
/proc/$PID/cmdline show garbage (it doesn't oops by luck).
#include <sys/mman.h>
#include <sys/prctl.h>
#include <unistd.h>
enum {PAGE_SIZE=4096};
int main(void)
{
void *p;
p = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
#define PR_SET_MM 35
#define PR_SET_MM_ARG_START 8
#define PR_SET_MM_ARG_END 9
#define PR_SET_MM_ENV_START 10
#define PR_SET_MM_ENV_END 11
prctl(PR_SET_MM, PR_SET_MM_ARG_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ARG_END, (unsigned long)p, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ENV_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ENV_END, (unsigned long)p, 0, 0);
pause();
return 0;
}
[akpm@linux-foundation.org: tidy code, tweak comment]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The MPX code can only work on the current task. You can not,
for instance, enable MPX management in another process or
thread. You can also not handle a fault for another process or
thread.
Despite this, we pass a task_struct around prolifically. This
patch removes all of the task struct passing for code paths
where the code can not deal with another task (which turns out
to be all of them).
This has no functional changes. It's just a cleanup.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bp@alien8.de
Link: http://lkml.kernel.org/r/20150607183702.6A81DA2C@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Oleg cleverly suggested using xchg() to set the new mm->exe_file instead
of calling set_mm_exe_file() which requires some form of serialization --
mmap_sem in this case. For archs that do not have atomic rmw instructions
we still fallback to a spinlock alternative, so this should always be
safe. As such, we only need the mmap_sem for looking up the backing
vm_file, which can be done sharing the lock. Naturally, this means we
need to manually deal with both the new and old file reference counting,
and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can
probably be deleted in the future anyway.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root. For these systems,
supporting multiple users is not necessary.
This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional. It is enabled
under CONFIG_EXPERT menu.
When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.
The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.
Also, groups.c is compiled out completely.
In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.
This change saves about 25 KB on a defconfig build. The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB. (The 25k goes down a bit with allnoconfig, but not that much.
The kernel was booted in Qemu. All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.
Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a uname workaround for broken userspace which can't handle kernel
versions of 3.x. Update it for 4.x.
Signed-off-by: Jon DeVree <nuxi@vault24.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull MIPS updates from Ralf Baechle:
"This is the main pull request for MIPS:
- a number of fixes that didn't make the 3.19 release.
- a number of cleanups.
- preliminary support for Cavium's Octeon 3 SOCs which feature up to
48 MIPS64 R3 cores with FPU and hardware virtualization.
- support for MIPS R6 processors.
Revision 6 of the MIPS architecture is a major revision of the MIPS
architecture which does away with many of original sins of the
architecture such as branch delay slots. This and other changes in
R6 require major changes throughout the entire MIPS core
architecture code and make up for the lion share of this pull
request.
- finally some preparatory work for eXtendend Physical Address
support, which allows support of up to 40 bit of physical address
space on 32 bit processors"
[ Ahh, MIPS can't leave the PAE brain damage alone. It's like
every CPU architect has to make that mistake, but pee in the snow
by changing the TLA. But whether it's called PAE, LPAE or XPA,
it's horrid crud - Linus ]
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (114 commits)
MIPS: sead3: Corrected get_c0_perfcount_int
MIPS: mm: Remove dead macro definitions
MIPS: OCTEON: irq: add CIB and other fixes
MIPS: OCTEON: Don't do acknowledge operations for level triggered irqs.
MIPS: OCTEON: More OCTEONIII support
MIPS: OCTEON: Remove setting of processor specific CVMCTL icache bits.
MIPS: OCTEON: Core-15169 Workaround and general CVMSEG cleanup.
MIPS: OCTEON: Update octeon-model.h code for new SoCs.
MIPS: OCTEON: Implement DCache errata workaround for all CN6XXX
MIPS: OCTEON: Add little-endian support to asm/octeon/octeon.h
MIPS: OCTEON: Implement the core-16057 workaround
MIPS: OCTEON: Delete unused COP2 saving code
MIPS: OCTEON: Use correct instruction to read 64-bit COP0 register
MIPS: OCTEON: Save and restore CP2 SHA3 state
MIPS: OCTEON: Fix FP context save.
MIPS: OCTEON: Save/Restore wider multiply registers in OCTEON III CPUs
MIPS: boot: Provide more uImage options
MIPS: Remove unneeded #ifdef __KERNEL__ from asm/processor.h
MIPS: ip22-gio: Remove legacy suspend/resume support
mips: pci: Add ifdef around pci_proc_domain
...
Userland code may be built using an ABI which permits linking to objects
that have more restrictive floating point requirements. For example,
userland code may be built to target the O32 FPXX ABI. Such code may be
linked with other FPXX code, or code built for either one of the more
restrictive FP32 or FP64. When linking with more restrictive code, the
overall requirement of the process becomes that of the more restrictive
code. The kernel has no way to know in advance which mode the process
will need to be executed in, and indeed it may need to change during
execution. The dynamic loader is the only code which will know the
overall required mode, and so it needs to have a means to instruct the
kernel to switch the FP mode of the process.
This patch introduces 2 new options to the prctl syscall which provide
such a capability. The FP mode of the process is represented as a
simple bitmask combining a number of mode bits mirroring those present
in the hardware. Userland can either retrieve the current FP mode of
the process:
mode = prctl(PR_GET_FP_MODE);
or modify the current FP mode of the process:
err = prctl(PR_SET_FP_MODE, new_mode);
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: Matthew Fortune <matthew.fortune@imgtec.com>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/8899/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Description from Michael Kerrisk. He suggested an identical patch
to one I had already coded up and tested.
commit fe3d197f84 "x86, mpx: On-demand kernel allocation of bounds
tables" added two new prctl() operations, PR_MPX_ENABLE_MANAGEMENT and
PR_MPX_DISABLE_MANAGEMENT. However, no checks were included to ensure
that unused arguments are zero, as is done in many existing prctl()s
and as should be done for all new prctl()s. This patch adds the
required checks.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20150108223022.7F56FD13@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is really the meat of the MPX patch set. If there is one patch to
review in the entire series, this is the one. There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner. (small FAQ below).
Long Description:
This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers. The prctl() is an
explicit signal from userspace.
PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.
PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.
PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address. Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.
Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive. Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.
==== Why does the hardware even have these Bounds Tables? ====
MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".
They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.
The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.
==== Why not do this in userspace? ====
This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.
Q: Can virtual space simply be reserved for the bounds tables so
that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
area needs 4*X GB of virtual space, plus 2GB for the bounds
directory. If we were to preallocate them for the 128TB of
user virtual address space, we would need to reserve 512TB+2GB,
which is larger than the entire virtual address space today.
This means they can not be reserved ahead of time. Also, a
single process's pre-popualated bounds directory consumes 2GB
of virtual *AND* physical memory. IOW, it's completely
infeasible to prepopulate bounds directories.
Q: Can we preallocate bounds table space at the same time memory
is allocated which might contain pointers that might eventually
need bounds tables?
A: This would work if we could hook the site of each and every
memory allocation syscall. This can be done for small,
constrained applications. But, it isn't practical at a larger
scale since a given app has no way of controlling how all the
parts of the app might allocate memory (think libraries). The
kernel is really the only place to intercept these calls.
Q: Could a bounds fault be handed to userspace and the tables
allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
handler functions and even if mmap() would work it still
requires locking or nasty tricks to keep track of the
allocation state there.
Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.
Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)
- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)
- sched/numa updates and optimizations (Rik van Riel)
- sysbench speedup (Vincent Guittot)
- capacity calculation cleanups/refactoring (Vincent Guittot)
- Various cleanups to thread group iteration (Oleg Nesterov)
- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)
- various sched/deadline fixes
... and lots of other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...
Fix undefined behavior and compiler warning by replacing right shift 32
with upper_32_bits macro
Signed-off-by: Scotty Bauer <sbauer@eng.utah.edu>
Cc: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix minor errors and warning messages in kernel/sys.c. These errors were
reported by checkpatch while working with some modifications in sys.c
file. Fixing this first will help me to improve my further patches.
ERROR: trailing whitespace - 9
ERROR: do not use assignment in if condition - 4
ERROR: spaces required around that '?' (ctx:VxO) - 10
ERROR: switch and case should be at the same indent - 3
total 26 errors & 3 warnings fixed.
Signed-off-by: vishnu.ps <vishnu.ps@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dump the contents of the relevant struct_mm when we hit the bug condition.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During development of c/r we've noticed that in case if we need to support
user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
...) call, in particular once new user namespace is created
capable(CAP_SYS_RESOURCE) no longer passes.
A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
in one bundle, which would allow the kernel to make more intensive test
for sanity of values and same time allow us to support checkpoint/restore
of user namespaces.
Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
prctl_mm_map structure which carries all the members to be updated.
prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)
struct prctl_mm_map {
__u64 start_code;
__u64 end_code;
__u64 start_data;
__u64 end_data;
__u64 start_brk;
__u64 brk;
__u64 start_stack;
__u64 arg_start;
__u64 arg_end;
__u64 env_start;
__u64 env_end;
__u64 *auxv;
__u32 auxv_size;
__u32 exe_fd;
};
All members except @exe_fd correspond ones of struct mm_struct. To figure
out which available values these members may take here are meanings of the
members.
- start_code, end_code: represent bounds of executable code area
- start_data, end_data: represent bounds of data area
- start_brk, brk: used to calculate bounds for brk() syscall
- start_stack: used when accounting space needed for command
line arguments, environment and shmat() syscall
- arg_start, arg_end, env_start, env_end: represent memory area
supplied for command line arguments and environment variables
- auxv, auxv_size: carries auxiliary vector, Elf format specifics
- exe_fd: file descriptor number for executable link (/proc/self/exe)
Thus we apply the following requirements to the values
1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
interval.
2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
VMAs (say a program maps own new .text and .data segments during execution)
the rest of members should belong to VMA which must exist.
3) Addresses must be ordered, ie @start_ member must not be greater or
equal to appropriate @end_ member.
4) As in regular Elf loading procedure we require that @start_brk and
@brk be greater than @end_data.
5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
exceed existing limit. Same applies to RLIMIT_STACK.
6) Auxiliary vector size must not exceed existing one (which is
predefined as AT_VECTOR_SIZE and depends on architecture).
7) File descriptor passed in @exe_file should be pointing
to executable file (because we use existing prctl_set_mm_exe_file_locked
helper it ensures that the file we are going to use as exe link has all
required permission granted).
Now about where these members are involved inside kernel code:
- @start_code and @end_code are used in /proc/$pid/[stat|statm] output;
- @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
also they are considered if there enough space for brk() syscall
result if RLIMIT_DATA is set;
- @start_brk shown in /proc/$pid/stat output and accounted in brk()
syscall if RLIMIT_DATA is set; also this member is tested to
find a symbolic name of mmap event for perf system (we choose
if event is generated for "heap" area); one more aplication is
selinux -- we test if a process has PROCESS__EXECHEAP permission
if trying to make heap area being executable with mprotect() syscall;
- @brk is a current value for brk() syscall which lays inside heap
area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
provides new memory area to a user space upon brk() completion the
mm::brk is updated to carry new value;
Both @start_brk and @brk are actively used in /proc/$pid/maps
and /proc/$pid/smaps output to find a symbolic name "heap" for
VMA being scanned;
- @start_stack is printed out in /proc/$pid/stat and used to
find a symbolic name "stack" for task and threads in
/proc/$pid/maps and /proc/$pid/smaps output, and as the same
as with @start_brk -- perf system uses it for event naming.
Also kernel treat this member as a start address of where
to map vDSO pages and to check if there is enough space
for shmat() syscall;
- @arg_start, @arg_end, @env_start and @env_end are printed out
in /proc/$pid/stat. Another access to the data these members
represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
Any attempt to read these areas kernel tests with access_process_vm
helper so a user must have enough rights for this action;
- @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
speaking kernel doesn't care much about which exactly data is
sitting there because it is solely for userspace;
- @exe_fd is referred from /proc/$pid/exe and when generating
coredump. We uses prctl_set_mm_exe_file_locked helper to update
this member, so exe-file link modification remains one-shot
action.
Still note that updating exe-file link now doesn't require sys-resource
capability anymore, after all there is no much profit in preventing setup
own file link (there are a number of ways to execute own code -- ptrace,
ld-preload, so that the only reliable way to find which exactly code is
executed is to inspect running program memory). Still we require the
caller to be at least user-namespace root user.
I believe the old interface should be deprecated and ripped off in a
couple of kernel releases if no one against.
To test if new interface is implemented in the kernel one can pass
PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
supported struct prctl_mm_map.
[akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Andrew Vagin <avagin@openvz.org>
Tested-by: Andrew Vagin <avagin@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Julien Tinnes <jln@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out
and rename the helper to prctl_set_mm_exe_file_locked(). This will allow
to reuse this function in a next patch.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Julien Tinnes <jln@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.
The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).
Protecting that with a lock results in times() and clock_gettime() being
completely serialized on large systems.
This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime(), slightly simplifying the calling
functions.
In the case of posix_cpu_clock_get_task() things can be simplified a
lot, because the calling function already ensures that the task sticks
around, and the rest is now taken care of in thread_group_cputime().
This way the statistics reporting code can run lockless.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guillaume Morin <guillaume@morinfr.org>
Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
also adds a prctl control which allows us to set the THP disable bit in
mm->def_flags so that VMs will pick up the setting as they are created.
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can kill either task->did_exec or PF_FORKNOEXEC, they are mutually
exclusive. The patch kills ->did_exec because it has a single user.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 15d94b8256 ("reboot: move shutdown/reboot related functions to
kernel/reboot.c") moved all kexec-related functionality to
kernel/reboot.c, so kernel/sys.c no longer needs to include
<linux/kexec.h>.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
CAP_SETGID. For the existing users it doesn't noticably simplify things and
from the suggested patches I have seen it encourages people to do the wrong
thing. So remove nsown_capable.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This patch is preparatory. It moves reboot related syscall, etc
functions from kernel/sys.c to kernel/reboot.c.
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the prior patch's #define for easier backporting to the stable
releases.
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move __set_special_pids() from exit.c to sys.c close to its single caller
and make it static.
And rename it to set_special_pids(), another helper with this name has
gone away.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change do_sysinfo() to use get_monotonic_boottime() instead of
do_posix_clock_monotonic_gettime() + monotonic_to_bootbased().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Tomas Janousek <tjanouse@redhat.com>
Cc: Tomas Smetana <tsmetana@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If LINUX_REBOOT_CMD_HALT for reboot failed, the message "cannot halt" will
stay on the same line with the next message, so append a '\n'.
Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We recently noticed that reboot of a 1024 cpu machine takes approx 16
minutes of just stopping the cpus. The slowdown was tracked to commit
f96972f2dc ("kernel/sys.c: call disable_nonboot_cpus() in
kernel_restart()").
The current implementation does all the work of hot removing the cpus
before halting the system. We are switching to just migrating to the
boot cpu and then continuing with shutdown/reboot.
This also has the effect of not breaking x86's command line parameter
for specifying the reboot cpu. Note, this code was shamelessly copied
from arch/x86/kernel/reboot.c with bits removed pertaining to the
reboot_cpu command line parameter.
Signed-off-by: Robin Holt <holt@sgi.com>
Tested-by: Shawn Guo <shawn.guo@linaro.org>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russ Anderson <rja@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions
The purpose of this patch is to allow privileged processes to set
their own per-memory memory-region fields:
start_code, end_code, start_data, end_data, start_brk, brk,
start_stack, arg_start, arg_end, env_start, env_end.
This functionality is needed by any application or package that needs to
reconstruct Linux processes, that is, to start them in any way other than
by means of an "execve()" from an executable file. This includes:
1. Restoring processes from a checkpoint-file (by all potential
user-level checkpointing packages, not only CRIU's).
2. Restarting processes on another node after process migration.
3. Starting duplicated copies of a running process (for reliability
and high-availablity).
4. Starting a process from an executable format that is not supported
by Linux, thus requiring a "manual execve" by a user-level utility.
5. Similarly, starting a process from a networked and/or crypted
executable that, for confidentiality, licensing or other reasons,
may not be written to the local file-systems.
The code that does that was already included in the Linux kernel by the
CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
normally disabled. The patch removes those ifdefs.
Signed-off-by: Amnon Shiloh <u3557@miso.sublimeip.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>