Pull scheduler changes from Ingo Molnar:
"Main changes:
- scheduler side full-dynticks (user-space execution is undisturbed
and receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready, from Frederic
Weisbecker.
- Initial sched.h split-up changes, by Clark Williams
- select_idle_sibling() performance improvement by Mike Galbraith:
" 1 tbench pair (worst case) in a 10 core + SMT package:
pre 15.22 MB/sec 1 procs
post 252.01 MB/sec 1 procs "
- sched_rr_get_interval() ABI fix/change. We think this detail is not
used by apps (so it's not an ABI in practice), but lets keep it
under observation.
- misc RT scheduling cleanups, optimizations"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h>
cputime: Remove irqsave from seqlock readers
sched, powerpc: Fix sched.h split-up build failure
cputime: Restore CPU_ACCOUNTING config defaults for PPC64
sched/rt: Move rt specific bits into new header file
sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
sched: Move sched.h sysctl bits into separate header
sched: Fix signedness bug in yield_to()
sched: Fix select_idle_sibling() bouncing cow syndrome
sched/rt: Further simplify pick_rt_task()
sched/rt: Do not account zero delta_exec in update_curr_rt()
cputime: Safely read cputime of full dynticks CPUs
kvm: Prepare to add generic guest entry/exit callbacks
cputime: Use accessors to read task cputime stats
cputime: Allow dynamic switch between tick/virtual based cputime accounting
cputime: Generic on-demand virtual cputime accounting
cputime: Move default nsecs_to_cputime() to jiffies based cputime file
cputime: Librarize per nsecs resolution cputime definitions
cputime: Avoid multiplication overflow on utime scaling
context_tracking: Export context state for generic vtime
...
Fix up conflict in kernel/context_tracking.c due to comment additions.
Pull perf changes from Ingo Molnar:
"There are lots of improvements, the biggest changes are:
Main kernel side changes:
- Improve uprobes performance by adding 'pre-filtering' support, by
Oleg Nesterov.
- Make some POWER7 events available in sysfs, equivalent to what was
done on x86, from Sukadev Bhattiprolu.
- tracing updates by Steve Rostedt - mostly misc fixes and smaller
improvements.
- Use perf/event tracing to report PCI Express advanced errors, by
Tony Luck.
- Enable northbridge performance counters on AMD family 15h, by Jacob
Shin.
- This tracing commit:
tracing: Remove the extra 4 bytes of padding in events
changes the ABI. All involved parties (PowerTop in particular)
seem to agree that it's safe to do now with the introduction of
libtraceevent, but the devil is in the details ...
Main tooling side changes:
- Add 'event group view', from Namyung Kim:
To use it, 'perf record' should group events when recording. And
then perf report parses the saved group relation from file header
and prints them together if --group option is provided. You can
use the 'perf evlist' command to see event group information:
$ perf record -e '{ref-cycles,cycles}' noploop 1
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.385 MB perf.data (~16807 samples) ]
$ perf evlist --group
{ref-cycles,cycles}
With this example, default perf report will show you each event
separately.
You can use --group option to enable event group view:
$ perf report --group
...
# group: {ref-cycles,cycles}
# ========
# Samples: 7K of event 'anon group { ref-cycles, cycles }'
# Event count (approx.): 6876107743
#
# Overhead Command Shared Object Symbol
# ................ ....... ................. ..........................
99.84% 99.76% noploop noploop [.] main
0.07% 0.00% noploop ld-2.15.so [.] strcmp
0.03% 0.00% noploop [kernel.kallsyms] [k] timerqueue_del
0.03% 0.03% noploop [kernel.kallsyms] [k] sched_clock_cpu
0.02% 0.00% noploop [kernel.kallsyms] [k] account_user_time
0.01% 0.00% noploop [kernel.kallsyms] [k] __alloc_pages_nodemask
0.00% 0.00% noploop [kernel.kallsyms] [k] native_write_msr_safe
0.00% 0.11% noploop [kernel.kallsyms] [k] _raw_spin_lock
0.00% 0.06% noploop [kernel.kallsyms] [k] find_get_page
0.00% 0.02% noploop [kernel.kallsyms] [k] rcu_check_callbacks
0.00% 0.02% noploop [kernel.kallsyms] [k] __current_kernel_time
As you can see the Overhead column now contains both of ref-cycles
and cycles and header line shows group information also - 'anon
group { ref-cycles, cycles }'. The output is sorted by period of
group leader first.
- Initial GTK+ annotate browser, from Namhyung Kim.
- Add option for runtime switching perf data file in perf report,
just press 's' and a menu with the valid files found in the current
directory will be presented, from Feng Tang.
- Add support to display whole group data for raw columns, from Jiri
Olsa.
- Add per processor socket count aggregation in perf stat, from
Stephane Eranian.
- Add interval printing in 'perf stat', from Stephane Eranian.
- 'perf test' improvements
- Add support for wildcards in tracepoint system name, from Jiri
Olsa.
- Add anonymous huge page recognition, from Joshua Zhu.
- perf build-id cache now can show DSOs present in a perf.data file
that are not in the cache, to integrate with build-id servers being
put in place by organizations such as Fedora.
- perf top now shares more of the evsel config/creation routines with
'record', paving the way for further integration like 'top'
snapshots, etc.
- perf top now supports DWARF callchains.
- Fix mmap limitations on 32-bit, fix from David Miller.
- 'perf bench numa mem' NUMA performance measurement suite
- ... and lots of fixes, performance improvements, cleanups and other
improvements I failed to list - see the shortlog and git log for
details."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (270 commits)
perf/x86/amd: Enable northbridge performance counters on AMD family 15h
perf/hwbp: Fix cleanup in case of kzalloc failure
perf tools: Fix build with bison 2.3 and older.
perf tools: Limit unwind support to x86 archs
perf annotate: Make it to be able to skip unannotatable symbols
perf gtk/annotate: Fail early if it can't annotate
perf gtk/annotate: Show source lines with gray color
perf gtk/annotate: Support multiple event annotation
perf ui/gtk: Implement basic GTK2 annotation browser
perf annotate: Fix warning message on a missing vmlinux
perf buildid-cache: Add --update option
uprobes/perf: Avoid uprobe_apply() whenever possible
uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE
uprobes/perf: Teach trace_uprobe/perf code to pre-filter
uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's
uprobes: Introduce uprobe_apply()
perf: Introduce hw_perf_event->tp_target and ->tp_list
uprobes/perf: Always increment trace_uprobe->nhit
uprobes/tracing: Kill uprobe_trace_consumer, embed uprobe_consumer into trace_uprobe
uprobes/tracing: Introduce is_trace_uprobe_enabled()
...
On AMD family 15h processors, there are 4 new performance
counters (in addition to 6 core performance counters) that can
be used for counting northbridge events (i.e. DRAM accesses).
Their bit fields are almost identical to the core performance
counters. However, unlike the core performance counters, these
MSRs are shared between multiple cores (that share the same
northbridge).
We will reuse the same code path as existing family 10h
northbridge event constraints handler logic to enforce
this sharing.
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Acked-by: Stephane Eranian <eranian@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1360171589-6381-7-git-send-email-jacob.shin@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a HP ProLiant DL980 G7 Server boots a regular kernel,
there will be intermittent lost interrupts which could
result in a hang or (in extreme cases) data loss.
The reason is that this system only supports x2apic physical
mode, while the kernel boots with a logical-cluster default
setting.
This bug can be worked around by specifying the "x2apic_phys" or
"nox2apic" boot option, but we want to handle this system
without requiring manual workarounds.
The BIOS sets ACPI_FADT_APIC_PHYSICAL in FADT table.
As all apicids are smaller than 255, BIOS need to pass the
control to the OS with xapic mode, according to x2apic-spec,
chapter 2.9.
Current code handle x2apic when BIOS pass with xapic mode
enabled:
When user specifies x2apic_phys, or FADT indicates PHYSICAL:
1. During madt oem check, apic driver is set with xapic logical
or xapic phys driver at first.
2. enable_IR_x2apic() will enable x2apic_mode.
3. if user specifies x2apic_phys on the boot line, x2apic_phys_probe()
will install the correct x2apic phys driver and use x2apic phys mode.
Otherwise it will skip the driver will let x2apic_cluster_probe to
take over to install x2apic cluster driver (wrong one) even though FADT
indicates PHYSICAL, because x2apic_phys_probe does not check
FADT PHYSICAL.
Add checking x2apic_fadt_phys in x2apic_phys_probe() to fix the
problem.
Signed-off-by: Stoney Wang <song-bo.wang@hp.com>
[ updated the changelog and simplified the code ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1360263182-16226-1-git-send-email-yinghai@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change handle_swbp() to set regs->ip = bp_vaddr in advance, this is
what consumer->handler() needs but uprobe_get_swbp_addr() is not
exported.
This also simplifies the code and makes it more consistent across
the supported architectures. handle_swbp() becomes the only caller
of uprobe_get_swbp_addr().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
__skip_sstep() doesn't update regs->ip. Currently this is correct
but only "by accident" and it doesn't skip the whole insn. Change
it to advance ->ip by the length of the detected 0x66*0x90 sequence.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Similar to config_base and event_base, allow architecture
specific RDPMC ECX values.
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Acked-by: Stephane Eranian <eranian@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1360171589-6381-6-git-send-email-jacob.shin@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move counter index to MSR address offset calculation to
architecture specific files. This prepares the way for
perf_event_amd to enable counter addresses that are not
contiguous -- for example AMD Family 15h processors have 6 core
performance counters starting at 0xc0010200 and 4 northbridge
performance counters starting at 0xc0010240.
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1360171589-6381-5-git-send-email-jacob.shin@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update these AMD bit field names to be consistent with naming
convention followed by the rest of the file.
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Acked-by: Stephane Eranian <eranian@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1360171589-6381-4-git-send-email-jacob.shin@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Generalize northbridge constraints code for family 10h so that
later we can reuse the same code path with other AMD processor
families that have the same northbridge event constraints.
Signed-off-by: Robert Richter <rric@kernel.org>
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1360171589-6381-3-git-send-email-jacob.shin@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Typical cputime stats infrastructure relies on the timer tick and
its periodic polling on the CPU to account the amount of time
spent by the CPUs and the tasks per high level domains such as
userspace, kernelspace, guest, ...
Now we are preparing to implement full dynticks capability on
Linux for Real Time and HPC users who want full CPU isolation.
This feature requires a cputime accounting that doesn't depend
on the timer tick.
To implement it, this new cputime infrastructure plugs into
kernel/user/guest boundaries to take snapshots of cputime and
flush these to the stats when needed. This performs pretty
much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
and cputime snaphots are synchronized between write and read
side such that the latter can safely retrieve the pending tickless
cputime of a task and add it to its latest cputime snapshot to
return the correct result to the user.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJRBsKnAAoJEIUkVEdQjox3lMgP/2R6DU2f8PyGIao3hne4M3Pu
L3q+mAG53b24Dy014KeW7gd8yv45fE7wp/rs8CGLte9VzbLkRCDSFQPgBuXVagRj
tV5nfAuqD0wHTnA+HhBE3l3C2RKAPGIu79rBpnIR/QIPPl8Z3Dby8YgmxEQKDf8G
j7MEBu2LthSuqEi2ZXemnO5r0oEnQAzAp4TTi/M38k0Fmt59nOGyjLnI+xHYCBMa
1pnz7j3jjR9NJExGu8iVvbo+jupuQngP8qmkLXHvYnj/TEJNwzO1hHVoSwOpjYpS
9ycl+T8IKQLbAkBywLtq3Mzde43xt/t8wYyGZ0oAV+Z7MIpz/9YIfDJwqQeqoNbD
dAdbNjKMbsxCgmrnyqSagfMQg/r3CPZ4vf40TMCaN4gNUJC4Ie+E4kPRKRh59+PB
Ukthmqujn0f40LAa+HXTUuzafd3b0s/ewH+8FuQ6LAG9b5+WnoN8JTJ5u6+ydokO
ZleeOowuRZZEg+abQ8Sm2GRm/BzN29gi/npb//I+ZDXWv/+3yccgsiPjCRzCAAaO
g1RmYryFSRUwHQbGNNypVWVuOLWvrBQ4jqbGO7BBuBByZMSHryKxR6mb+inH3qLE
xIDM9SdSJisc292OzoFKwVZki4MaXaadJXJduVvqYlZQvXXs7eAa4wo3euhtVITD
NLQO5OZXE4oIQmDFb0FV
=1Tzp
-----END PGP SIGNATURE-----
Merge tag 'full-dynticks-cputime-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core
Pull full-dynticks (user-space execution is undisturbed and
receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready,
from Frederic Weisbecker:
"This implements the cputime accounting on full dynticks CPUs.
Typical cputime stats infrastructure relies on the timer tick and
its periodic polling on the CPU to account the amount of time
spent by the CPUs and the tasks per high level domains such as
userspace, kernelspace, guest, ...
Now we are preparing to implement full dynticks capability on
Linux for Real Time and HPC users who want full CPU isolation.
This feature requires a cputime accounting that doesn't depend
on the timer tick.
To implement it, this new cputime infrastructure plugs into
kernel/user/guest boundaries to take snapshots of cputime and
flush these to the stats when needed. This performs pretty
much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
and cputime snaphots are synchronized between write and read
side such that the latter can safely retrieve the pending tickless
cputime of a task and add it to its latest cputime snapshot to
return the correct result to the user."
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Three small fixlets"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel/cacheinfo: Shut up annoying warning
x86, doc: Boot protocol 2.12 is in 3.8
x86-64: Replace left over sti/cli in ia32 audit exit code
Pull perf fixes from Ingo Molnar:
"Three fixlets and two small (and low risk) hw-enablement changes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix event group context move
x86/perf: Add IvyBridge EP support
perf/x86: Fix P6 driver section warning
arch/x86/tools/insn_sanity.c: Identify source of messages
perf/x86: Enable Intel Lincroft/Penwell/Cloverview Atom support
I've been getting the following warning when doing randbuilds
since forever. Now it finally pissed me off just the perfect
amount so that I can fix it.
arch/x86/kernel/cpu/intel_cacheinfo.c:489:27: warning: ‘cache_disable_0’ defined but not used [-Wunused-variable]
arch/x86/kernel/cpu/intel_cacheinfo.c:491:27: warning: ‘cache_disable_1’ defined but not used [-Wunused-variable] arch/x86/kernel/cpu/intel_cacheinfo.c:524:27: warning: ‘subcaches’ defined but not used [-Wunused-variable]
It happens because in randconfigs where CONFIG_SYSFS is not set,
the whole sysfs-interface to L3 cache index disabling is
remaining unused and gcc correctly warns about it. Make it
optional, depending on CONFIG_SYSFS too, as is the case with
other sysfs-related machinery in this file.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/1359969195-27362-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename EVENT_ATTR() to PMU_EVENT_ATTR() and make it global so it is
available to all architectures.
Further to allow architectures flexibility, have PMU_EVENT_ATTR() pass
in the variable name as a parameter.
Changelog[v2]
- [Jiri Olsa] No need to define PMU_EVENT_PTR()
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: linuxppc-dev@ozlabs.org
Link: http://lkml.kernel.org/r/20130123062422.GC13720@us.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull x86 fixes from Peter Anvin:
"This is a collection of miscellaneous fixes, the most important one is
the fix for the Samsung laptop bricking issue (auto-blacklisting the
samsung-laptop driver); the efi_enabled() changes you see below are
prerequisites for that fix.
The other issues fixed are booting on OLPC XO-1.5, an UV fix, NMI
debugging, and requiring CAP_SYS_RAWIO for MSR references, just as
with I/O port references."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
samsung-laptop: Disable on EFI hardware
efi: Make 'efi_enabled' a function to query EFI facilities
smp: Fix SMP function call empty cpu mask race
x86/msr: Add capabilities check
x86/dma-debug: Bump PREALLOC_DMA_DEBUG_ENTRIES
x86/olpc: Fix olpc-xo1-sci.c build errors
arch/x86/platform/uv: Fix incorrect tlb flush all issue
x86-64: Fix unwind annotations in recent NMI changes
x86-32: Start out cr0 clean, disable paging before modifying cr3/4
Originally 'efi_enabled' indicated whether a kernel was booted from
EFI firmware. Over time its semantics have changed, and it now
indicates whether or not we are booted on an EFI machine with
bit-native firmware, e.g. 64-bit kernel with 64-bit firmware.
The immediate motivation for this patch is the bug report at,
https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557
which details how running a platform driver on an EFI machine that is
designed to run under BIOS can cause the machine to become
bricked. Also, the following report,
https://bugzilla.kernel.org/show_bug.cgi?id=47121
details how running said driver can also cause Machine Check
Exceptions. Drivers need a new means of detecting whether they're
running on an EFI machine, as sadly the expression,
if (!efi_enabled)
hasn't been a sufficient condition for quite some time.
Users actually want to query 'efi_enabled' for different reasons -
what they really want access to is the list of available EFI
facilities.
For instance, the x86 reboot code needs to know whether it can invoke
the ResetSystem() function provided by the EFI runtime services, while
the ACPI OSL code wants to know whether the EFI config tables were
mapped successfully. There are also checks in some of the platform
driver code to simply see if they're running on an EFI machine (which
would make it a bad idea to do BIOS-y things).
This patch is a prereq for the samsung-laptop fix patch.
Cc: David Airlie <airlied@linux.ie>
Cc: Corentin Chary <corentincj@iksaif.net>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Peter Jones <pjones@redhat.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Steve Langasek <steve.langasek@canonical.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
At the moment the MSR driver only relies upon file system
checks. This means that anything as root with any capability set
can write to MSRs. Historically that wasn't very interesting but
on modern processors the MSRs are such that writing to them
provides several ways to execute arbitary code in kernel space.
Sample code and documentation on doing this is circulating and
MSR attacks are used on Windows 64bit rootkits already.
In the Linux case you still need to be able to open the device
file so the impact is fairly limited and reduces the security of
some capability and security model based systems down towards
that of a generic "root owns the box" setup.
Therefore they should require CAP_SYS_RAWIO to prevent an
elevation of capabilities. The impact of this is fairly minimal
on most setups because they don't have heavy use of
capabilities. Those using SELinux, SMACK or AppArmor rules might
want to consider if their rulesets on the MSR driver could be
tighter.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Horses <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I ran out of free entries when I had CONFIG_DMA_API_DEBUG
enabled. Some other archs seem to default to 65536, so increase
this limit for x86 too.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Link: http://lkml.kernel.org/r/50A612AA.7040206@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
----
Running the perf utility on a Ivybridge EP server we encounter
"not supported" events:
<not supported> L1-dcache-loads
<not supported> L1-dcache-load-misses
<not supported> L1-dcache-stores
<not supported> L1-dcache-store-misses
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
This patch adds support for this processor.
Signed-off-by: Youquan Song <youquan.song@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Youquan Song <youquan.song@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1355851223-27705-1-git-send-email-youquan.song@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull tracing updates from Steve Rostedt.
This commit:
tracing: Remove the extra 4 bytes of padding in events
changes the ABI. All involved parties seem to agree that it's safe to
do now, but the devil is in the details ...
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While in one case a plain annotation is necessary, in the other
case the stack adjustment can simply be folded into the
immediately preceding RESTORE_ALL, thus getting the correct
annotation for free.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander van Heukelum <heukelum@mailshack.com>
Link: http://lkml.kernel.org/r/51010C9302000078000B9045@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
. revert 20b279 - require exclude_guest to use PEBS - kernel side,
now older binaries will continue working for things like cycles:pp
without needing to pass extra modifiers, from David Ahern.
. Fix building from 'make perf-*-src-pkg' tarballs, broken by UAPI, from
Sebastian Andrzej Siewior
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJQ7yQhAAoJENZQFvNTUqpAjj4P/RR+8WeXpV02yzndhry+Yjav
L/WQH0CkRRmyUA5akTcpgFrohJCEOi+auHl7ivmDX4XWFavcAkX3H1Yz1FytCOkb
Cbb9Lv1rGdlTno8fTVUn8mNyTG64AlWrAw3ICixWw6q6I/k6SO7EkKigCxPhmY+2
BE2EvkZmOY/PEUXgM6HtUdifORatX48p1toS7p3CDQ31cxBN5OVNZUXa1FakJpyH
7R+1imKLsjuyi/G7Bt061LyWQkOh7L/ITWN+5Rx4RsUwRRT3vm1H9nlqUBsPS0PW
qfkktkCmn/cFpKbfBipuglnt16jHPMfI/pghKvzx8n2uJMNEGXbfFDJefpzcdih9
wIRgB6a5bvA8VF6Xpcn0I5JhqLAcnWTer07JgjZevjqYCdZStpbJjvE5131JjTLw
Dnm7UshE+VFBcA3iXNX64p/X7WDJSk+SIDsJDuNe57dktFVLw76Ibb55XG18Ex7e
c9QcIEhD1P19VzOniDZQZNEJhqnu5Vjle/eG+JRVRCm/BgoQFyJuD3EooKPN7hHR
Op4oqf5RhDf7XNH0+Y4rOdRMZRiumdfcEl6kdcQGPJPycxpD7xCJNzBLWK/BvQgT
Kl0AEkRC0KE2c5LyFttW+g1Byu1rctlMz2TVJDTskTm0XGOQ9mTzsQBP5rgBVt+b
hQsMMWNVSI+jfm8bTJwx
=LTjZ
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
. revert 20b279 - require exclude_guest to use PEBS - kernel side, now
older binaries will continue working for things like cycles:pp
without needing to pass extra modifiers, from David Ahern.
. Fix building from 'make perf-*-src-pkg' tarballs, broken by UAPI,
from Sebastian Andrzej Siewior
[ Pulling directly, Ingo would normally pull but has been unresponsive ]
* tag 'perf-urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf tools: Fix building from 'make perf-*-src-pkg' tarballs
perf x86: revert 20b279 - require exclude_guest to use PEBS - kernel side
putreg() assumes that the tracee is not running and pt_regs_access() can
safely play with its stack. However a killed tracee can return from
ptrace_stop() to the low-level asm code and do RESTORE_REST, this means
that debugger can actually read/modify the kernel stack until the tracee
does SAVE_REST again.
set_task_blockstep() can race with SIGKILL too and in some sense this
race is even worse, the very fact the tracee can be woken up breaks the
logic.
As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace()
call, this ensures that nobody can ever wakeup the tracee while the
debugger looks at it. Not only this fixes the mentioned problems, we
can do some cleanups/simplifications in arch_ptrace() paths.
Probably ptrace_unfreeze_traced() needs more callers, for example it
makes sense to make the tracee killable for oom-killer before
access_process_vm().
While at it, add the comment into may_ptrace_stop() to explain why
ptrace_stop() still can't rely on SIGKILL and signal_pending_state().
Reported-by: Salman Qazi <sqazi@google.com>
Reported-by: Suleiman Souhlal <suleiman@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch
5a5a51db78 x86-32: Start out eflags and cr4 clean
... made x86-32 match x86-64 in that we initialize %eflags and %cr4
from scratch. This broke OLPC XO-1.5, because the XO enters the
kernel with paging enabled, which the kernel doesn't expect.
Since we no longer support 386 (the source of most of the variability
in %cr0 configuration), we can simply match further x86-64 and
initialize %cr0 to a fixed value -- the one variable part remaining in
%cr0 is for FPU control, but all that is handled later on in
initialization; in particular, configuring %cr0 as if the FPU is
present until proven otherwise is correct and necessary for the probe
to work.
To deal with the XO case sanely, explicitly disable paging in %cr0
before we muck with %cr3, %cr4 or EFER -- those operations are
inherently unsafe with paging enabled.
NOTE: There is still a lot of 386-related junk in head_32.S which we
can and should get rid of, however, this is intended as a minimal fix
whereas the cleanup can be deferred to the next merge window.
Reported-by: Andres Salomon <dilinger@queued.net>
Tested-by: Daniel Drake <dsd@laptop.org>
Link: http://lkml.kernel.org/r/50FA0661.2060400@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
- CVE-2013-0190/XSA-40 (or stack corruption for 32-bit PV kernels)
- Fix racy vma access spotted by Al Viro
- Fix mmap batch ioctl potentially resulting in large O(n) page allcations.
- Fix vcpu online/offline BUG:scheduling while atomic..
- Fix unbound buffer scanning for more than 32 vCPUs.
- Fix grant table being incorrectly initialized
- Fix incorrect check in pciback
- Allow privcmd in backend domains.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQEcBAABAgAGBQJQ+L7qAAoJEFjIrFwIi8fJLNIH/jUsneraEggWeh0L4GGWZvWL
cNCf0zjQt/pi1Q5drbleW2/6Wv6s6N1QA9pGRsJ+rrliC73HVTqIWFh0TjpwmCVy
hZal7jDXOuFVIR7GbGEPn004T6mkEnYDb/O2fyojwMVg0NQYwtMYJfTBkKdjKnmV
z6sWpQPVqO3/nZ17k2DipYRldbeiqS6LLOiUWd72b2W8bV4ySY5iVPVsqFusSEr6
PNyW33RPs5H0jEPR1uJlLD+l/uIbENykpEPeAS2uHGlch129+xHH5h79dwYJTbw6
x5nAOveO9VNJscUoqhpE7YbySzJmrUwxnBerZ6YTW6WCknYXrx4uiVAlfWem7uY=
=26Sq
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.8-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull Xen fixes from Konrad Rzeszutek Wilk:
- CVE-2013-0190/XSA-40 (or stack corruption for 32-bit PV kernels)
- Fix racy vma access spotted by Al Viro
- Fix mmap batch ioctl potentially resulting in large O(n) page allcations.
- Fix vcpu online/offline BUG:scheduling while atomic..
- Fix unbound buffer scanning for more than 32 vCPUs.
- Fix grant table being incorrectly initialized
- Fix incorrect check in pciback
- Allow privcmd in backend domains.
Fix up whitespace conflict due to ugly merge resolution in Xen tree in
arch/arm/xen/enlighten.c
* tag 'stable/for-linus-3.8-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen: Fix stack corruption in xen_failsafe_callback for 32bit PVOPS guests.
Revert "xen/smp: Fix CPU online/offline bug triggering a BUG: scheduling while atomic."
xen/gntdev: remove erronous use of copy_to_user
xen/gntdev: correctly unmap unlinked maps in mmu notifier
xen/gntdev: fix unsafe vma access
xen/privcmd: Fix mmap batch ioctl.
Xen: properly bound buffer access when parsing cpu/*/availability
xen/grant-table: correctly initialize grant table version 1
x86/xen : Fix the wrong check in pciback
xen/privcmd: Relax access control in privcmd_ioctl_mmap
This fixes CVE-2013-0190 / XSA-40
There has been an error on the xen_failsafe_callback path for failed
iret, which causes the stack pointer to be wrong when entering the
iret_exc error path. This can result in the kernel crashing.
In the classic kernel case, the relevant code looked a little like:
popl %eax # Error code from hypervisor
jz 5f
addl $16,%esp
jmp iret_exc # Hypervisor said iret fault
5: addl $16,%esp
# Hypervisor said segment selector fault
Here, there are two identical addls on either option of a branch which
appears to have been optimised by hoisting it above the jz, and
converting it to an lea, which leaves the flags register unaffected.
In the PVOPS case, the code looks like:
popl_cfi %eax # Error from the hypervisor
lea 16(%esp),%esp # Add $16 before choosing fault path
CFI_ADJUST_CFA_OFFSET -16
jz 5f
addl $16,%esp # Incorrectly adjust %esp again
jmp iret_exc
It is possible unprivileged userspace applications to cause this
behaviour, for example by loading an LDT code selector, then changing
the code selector to be not-present. At this point, there is a race
condition where it is possible for the hypervisor to return back to
userspace from an interrupt, fault on its own iret, and inject a
failsafe_callback into the kernel.
This bug has been present since the introduction of Xen PVOPS support
in commit 5ead97c84 (xen: Core Xen implementation), in 2.6.23.
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull x86 fixes from Peter Anvin:
"This is mainly a workaround for a bug in Sandy Bridge graphics which
causes corruption of certain memory pages."
* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/Sandy Bridge: Sandy Bridge workaround depends on CONFIG_PCI
x86/Sandy Bridge: mark arrays in __init functions as __initconst
x86/Sandy Bridge: reserve pages when integrated graphics is present
x86, efi: correct precedence of operators in setup_efi_pci
early_pci_allowed() and read_pci_config_16() are only available if
CONFIG_PCI is defined.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
SNB graphics devices have a bug that prevent them from accessing certain
memory ranges, namely anything below 1M and in the pages listed in the
table. So reserve those at boot if set detect a SNB gfx device on the
CPU to avoid GPU hangs.
Stephane Marchesin had a similar patch to the page allocator awhile
back, but rather than reserving pages up front, it leaked them at
allocation time.
[ hpa: made a number of stylistic changes, marked arrays as static
const, and made less verbose; use "memblock=debug" for full
verbosity. ]
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Pull KVM bugfixes from Marcelo Tosatti.
* git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: use dynamic percpu allocations for shared msrs area
KVM: PPC: Book3S HV: Fix compilation without CONFIG_PPC_POWERNV
powerpc: Corrected include header path in kvm_para.h
Add rcu user eqs exception hooks for async page fault
This patch is brought to you by the letter 'H'.
Commit 20b279 breaks compatiblity with older perf binaries when run with
precise modifier (:p or :pp) by requiring the exclude_guest attribute to be
set. Older binaries default exclude_guest to 0 (ie., wanting guest-based
samples) unless host only profiling is requested (:H modifier). The workaround
for older binaries is to add H to the modifier list (e.g., -e cycles:ppH -
toggles exclude_guest to 1). This was deemed unacceptable by Linus:
https://lkml.org/lkml/2012/12/12/570
Between family in town and the fresh snow in Breckenridge there is no time left
to be working on the proper fix for this over the holidays. In the New Year I
have more pressing problems to resolve -- like some memory leaks in perf which
are proving to be elusive -- although the aforementioned snow is probably why
they are proving to be elusive. Either way I do not have any spare time to work
on this and from the time I have managed to spend on it the solution is more
difficult than just moving to a new exclude_guest flag (does not work) or
flipping the logic to include_guest (which is not as trivial as one would
think).
So, two options: silently force exclude_guest on as suggested by Gleb which
means no impact to older perf binaries or revert the original patch which
caused the breakage.
This patch does the latter -- reverts the original patch that introduced the
regression. The problem can be revisited in the future as time allows.
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <robert.richter@amd.com>
Link: http://lkml.kernel.org/r/1356749767-17322-1-git-send-email-dsahern@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
markings need to be removed.
This change removes the use of __devinit, __devexit_p, __devinitconst,
and __devexit from these drivers.
Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Drake <dsd@laptop.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull signal handling cleanups from Al Viro:
"sigaltstack infrastructure + conversion for x86, alpha and um,
COMPAT_SYSCALL_DEFINE infrastructure.
Note that there are several conflicts between "unify
SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
resolution is trivial - just remove definitions of SS_ONSTACK and
SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
include/uapi/linux/signal.h contains the unified variant."
Fixed up conflicts as per Al.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
alpha: switch to generic sigaltstack
new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
generic compat_sys_sigaltstack()
introduce generic sys_sigaltstack(), switch x86 and um to it
new helper: compat_user_stack_pointer()
new helper: restore_altstack()
unify SS_ONSTACK/SS_DISABLE definitions
new helper: current_user_stack_pointer()
missing user_stack_pointer() instances
Bury the conditionals from kernel_thread/kernel_execve series
COMPAT_SYSCALL_DEFINE: infrastructure
Conditional on CONFIG_GENERIC_SIGALTSTACK; architectures that do not
select it are completely unaffected
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull one final 386 removal patch from Peter Anvin.
IRQ 13 FPU error handling is gone. That was not one of the proudest
moments in PC history.
* 'x86/nuke386' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, 386 removal: Remove support for IRQ 13 FPU error reporting
This patch adds user eqs exception hooks for async page fault page not
present code path, to exit the user eqs and re-enter it as necessary.
Async page fault is different from other exceptions that it may be
triggered from idle process, so we still need rcu_irq_enter() and
rcu_irq_exit() to exit cpu idle eqs when needed, to protect the code
that needs use rcu.
As Frederic pointed out it would be safest and simplest to protect the
whole kvm_async_pf_task_wait(). Otherwise, "we need to check all the
code there deeply for potential RCU uses and ensure it will never be
extended later to use RCU.".
However, We'd better re-enter the cpu idle eqs if we get the exception
in cpu idle eqs, by calling rcu_irq_exit() before native_safe_halt().
So the patch does what Frederic suggested for rcu_irq_*() API usage
here, except that I moved the rcu_irq_*() pair originally in
do_async_page_fault() into kvm_async_pf_task_wait().
That's because, I think it's better to have rcu_irq_*() pairs to be in
one function ( rcu_irq_exit() after rcu_irq_enter() ), especially here,
kvm_async_pf_task_wait() has other callers, which might cause
rcu_irq_exit() be called without a matching rcu_irq_enter() before it,
which is illegal if the cpu happens to be in rcu idle state.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Remove support for FPU error reporting via IRQ 13, as opposed to
exception 16 (#MF). One last remnant of i386 gone.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Pull security subsystem updates from James Morris:
"A quiet cycle for the security subsystem with just a few maintenance
updates."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
Smack: create a sysfs mount point for smackfs
Smack: use select not depends in Kconfig
Yama: remove locking from delete path
Yama: add RCU to drop read locking
drivers/char/tpm: remove tasklet and cleanup
KEYS: Use keyring_alloc() to create special keyrings
KEYS: Reduce initial permissions on keys
KEYS: Make the session and process keyrings per-thread
seccomp: Make syscall skipping and nr changes more consistent
key: Fix resource leak
keys: Fix unreachable code
KEYS: Add payload preparsing opportunity prior to key instantiate or update
This reverts commit bd52276fa1 ("x86-64/efi: Use EFI to deal with
platform wall clock (again)"), and the two supporting commits:
da5a108d05: "x86/kernel: remove tboot 1:1 page table creation code"
185034e72d: "x86, efi: 1:1 pagetable mapping for virtual EFI calls")
as they all depend semantically on commit 53b87cf088 ("x86, mm:
Include the entire kernel memory map in trampoline_pgd") that got
reverted earlier due to the problems it caused.
This was pointed out by Yinghai Lu, and verified by me on my Macbook Air
that uses EFI.
Pointed-out-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 EFI update from Peter Anvin:
"EFI tree, from Matt Fleming. Most of the patches are the new efivarfs
filesystem by Matt Garrett & co. The balance are support for EFI
wallclock in the absence of a hardware-specific driver, and various
fixes and cleanups."
* 'core-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
efivarfs: Make efivarfs_fill_super() static
x86, efi: Check table header length in efi_bgrt_init()
efivarfs: Use query_variable_info() to limit kmalloc()
efivarfs: Fix return value of efivarfs_file_write()
efivarfs: Return a consistent error when efivarfs_get_inode() fails
efivarfs: Make 'datasize' unsigned long
efivarfs: Add unique magic number
efivarfs: Replace magic number with sizeof(attributes)
efivarfs: Return an error if we fail to read a variable
efi: Clarify GUID length calculations
efivarfs: Implement exclusive access for {get,set}_variable
efivarfs: efivarfs_fill_super() ensure we clean up correctly on error
efivarfs: efivarfs_fill_super() ensure we free our temporary name
efivarfs: efivarfs_fill_super() fix inode reference counts
efivarfs: efivarfs_create() ensure we drop our reference on inode on error
efivarfs: efivarfs_file_read ensure we free data in error paths
x86-64/efi: Use EFI to deal with platform wall clock (again)
x86/kernel: remove tboot 1:1 page table creation code
x86, efi: 1:1 pagetable mapping for virtual EFI calls
x86, mm: Include the entire kernel memory map in trampoline_pgd
...