Impact: allow archs more flexibility on dynamic ftrace implementations
Dynamic ftrace has largly been developed on x86. Since x86 does not
have the same limitations as other architectures, the ftrace interaction
between the generic code and the architecture specific code was not
flexible enough to handle some of the issues that other architectures
have.
Most notably, module trampolines. Due to the limited branch distance
that archs make in calling kernel core code from modules, the module
load code must create a trampoline to jump to what will make the
larger jump into core kernel code.
The problem arises when this happens to a call to mcount. Ftrace checks
all code before modifying it and makes sure the current code is what
it expects. Right now, there is not enough information to handle modifying
module trampolines.
This patch changes the API between generic dynamic ftrace code and
the arch dependent code. There is now two functions for modifying code:
ftrace_make_nop(mod, rec, addr) - convert the code at rec->ip into
a nop, where the original text is calling addr. (mod is the
module struct if called by module init)
ftrace_make_caller(rec, addr) - convert the code rec->ip that should
be a nop into a caller to addr.
The record "rec" now has a new field called "arch" where the architecture
can add any special attributes to each call site record.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit e51af66308, which was
wrongly hoovered up and submitted about a month after a better fix had
already been merged.
The better fix is commit cbda1ba898
("PCI/iommu: blacklist DMAR on Intel G31/G33 chipsets"), where we do
this blacklisting based on the DMI identification for the offending
motherboard, since sometimes this chipset (or at least a chipset with
the same PCI ID) apparently _does_ actually have an IOMMU.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: fix crash during hibernation on 32-bit NUMA
The NUMA code on x86_32 creates special memory mapping that allows
each node's pgdat to be located in this node's memory. For this
purpose it allocates a memory area at the end of each node's memory
and maps this area so that it is accessible with virtual addresses
belonging to low memory. As a result, if there is high memory,
these NUMA-allocated areas are physically located in high memory,
although they are mapped to low memory addresses.
Our hibernation code does not take that into account and for this
reason hibernation fails on all x86_32 systems with CONFIG_NUMA=y and
with high memory present. Fix this by adding a special mapping for
the NUMA-allocated memory areas to the temporary page tables created
during the last phase of resume.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make API available to the rest of x86 platform code
Add prototype to asm/reboot.h.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This removes the acpi_irq_balance_set() interface from the PCI
interrupt link driver.
x86 used acpi_irq_balance_set() to tell the PCI interrupt link
driver to configure links to minimize IRQ sharing. But the link
driver can easily figure out whether to turn on IRQ balancing
based on the IRQ model (PIC/IOAPIC/etc), so we can get rid of
that external interface.
It's better for the driver to figure this out at init-time. If
we set it externally via the x86 code, the interface reduces
modularity, and we depend on the fact that acpi_process_madt()
happens before we process the kernel command line.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Impact: Changes reboot behavior.
If port CF9 seems to be safe to touch, attempt it before trying the
keyboard controller. Port CF9 is not available on all chipsets (a
significant but decreasing number of modern chipsets don't implement
it), but port CF9 itself should in general be safe to poke (no ill
effects if unimplemented) on any system which has PCI Configuration
Method #1 or #2, as it falls inside the PCI configuration port range
in both cases. No chipset without PCI is known to have port CF9,
either, although an explicit "pci=bios" would mean we miss this and
therefore don't use port CF9. An explicit "reboot=pci" can be used to
force the use of port CF9.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Move the IRQ stub generation to assembly to simplify it and for
consistency with 32 bits. Doing it in a C file with asm() statements
doesn't help clarity, and it prevents some optimizations.
Shrink the IRQ stubs down to just over four bytes per (we fit seven
into a 32-byte chunk.) This shrinks the total icache consumption of
the IRQ stubs down to an even kilobyte, if all of them are in active
use.
The downside is that we end up with a double jump, which could have a
negative effect on some pipelines. The double jump is always inside
the same cacheline on any modern chips.
To get the most effect, cache-align the IRQ stubs.
This makes the 64-bit code match changes already done to the 32-bit
code, and should open up irqinit*.c for unification.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Don't generate interrupt stubs for interrupt vectors below
FIRST_EXTERNAL_VECTOR, and make the table of interrupt vectors
(interrupt[]) __initconst. Both of these changes both conserve memory
and improve consistency with 64 bits.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: really halt all CPUs on halt
Function machine_halt (resp. native_machine_halt) is empty for x86
architectures. When command 'halt -f' is invoked, the message "System
halted." is displayed but this is not really true because all CPUs are
still running.
There are also similar inconsistencies for other arches (some uses
power-off for halt or forever-loop with IRQs enabled/disabled).
IMO there should be used the same approach for all architectures OR
what does the message "System halted" really mean?
This patch fixes it for x86.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add infrastructure for function-return tracing
Add low level support for ftrace return tracing.
This plug-in stores return addresses on the thread_info structure of
the current task.
The index of the current return address is initialized when the task
is the first one (init) and when a process forks (the child). It is
not needed when a task does a sys_execve because after this syscall,
it still needs to return on the kernel functions it called.
Note that the code of return_to_handler has been suggested by Steven
Rostedt as almost all of the ideas of improvements in this V3.
For purpose of security, arch/x86/kernel/process_32.c is not traced
because __switch_to() changes the current task during its execution.
That could cause inconsistency in the stored return address of this
function even if I didn't have any crash after testing with tracing on
this function enabled.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Arches that have their own irq_regs definition are expected to
define ARCH_HAS_OWN_IRQ_REGS or else a generic (unused) set
will also be defined in lib/irq_regs.c
Sparse noticed the unused generic one had no prototype:
lib/irq_regs.c:15:1: warning: symbol 'per_cpu____irq_regs' was not declared. Should it be static?
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove unused variable
I forgot to remove the now unused "cycles_t cycles" parameter from
vget_cycles() - which triggers build warnings as tsc.h is included
in a number of files.
Remove it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
a new file was accidentally added to include/asm-x86;
move it to the new arch/x86/include/asm location
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Impact: cleanup
Move rdtsc_barrier() use to vsyscall_64.c where it's relied on,
and point out its role in the context of its use.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
in scheduler-intense workloads native_read_tsc() overhead accounts for
20% of the system overhead:
659567 system_call 41222.9375
686796 schedule 435.7843
718382 __switch_to 665.1685
823875 switch_mm 4526.7857
1883122 native_read_tsc 55385.9412
9761990 total 2.8468
this is large part due to the rdtsc_barrier() that is done before
and after reading the TSC.
But sched_clock() is not a precise clock in the GTOD sense, using such
barriers is completely pointless. So remove the barriers and only use
them in vget_cycles().
This improves lat_ctx performance by about 5%.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
Revert "x86: default to reboot via ACPI"
x86: align DirectMap in /proc/meminfo
AMD IOMMU: fix lazy IO/TLB flushing in unmap path
x86: add smp_mb() before sending INVALIDATE_TLB_VECTOR
x86: remove VISWS and PARAVIRT around NR_IRQS puzzle
x86: mention ACPI in top-level Kconfig menu
x86: size NR_IRQS on 32-bit systems the same way as 64-bit
x86: don't allow nr_irqs > NR_IRQS
x86/docs: remove noirqbalance param docs
x86: don't use tsc_khz to calculate lpj if notsc is passed
x86, voyager: fix smp_intr_init() compile breakage
AMD IOMMU: fix detection of NP capable IOMMUs
Impact: fix warning message when PARAVIRT is set in config
Remove stale #ifdef components from our IRQ sizing logic.
x86/Voyager is the only holdout.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make NR_IRQS big enough for system with lots of apic/pins
If lots of IO_APIC's are there (or can be there), size the same way
as 64-bit, depending on MAX_IO_APICS and NR_CPUS.
This fixes the boot problem reported by Ben Hutchings on a 32-bit
server with 5 IO-APICs and 240 IO-APIC pins.
Signed-off-by: Yinghai <yinghai@kernel.org>
Tested-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Minor optimization.
Implement change_bit with immediate bit count as "lock xorb". This is
similar to "lock orb" and "lock andb" for set_bit and clear_bit
functions.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: improve wakeup affinity on NUMA systems, tweak SMP systems
Given the fixes+tweaks to the wakeup-buddy code, re-tweak the domain
balancing defaults on NUMA and SMP systems.
Turn on SD_WAKE_AFFINE which was off on x86 NUMA - there's no reason
why we would not want to have wakeup affinity across nodes as well.
(we already do this in the standard NUMA template.)
lat_ctx on a NUMA box is particularly happy about this change:
before:
| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.60
| 2 5.70
after:
| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.65
| 2 2.07
a 2.75x speedup.
pipe-test is similarly happy about it too:
| phoenix:~/sched-tests> ./pipe-test
| 18.26 usecs/loop.
| 14.70 usecs/loop.
| 14.38 usecs/loop.
| 10.55 usecs/loop. # +WAKE_AFFINE on domain0+domain1
| 8.63 usecs/loop.
| 8.59 usecs/loop.
| 9.03 usecs/loop.
| 8.94 usecs/loop.
| 8.96 usecs/loop.
| 8.63 usecs/loop.
Also:
- disable SD_BALANCE_NEWIDLE on NUMA and SMP domains (keep it for siblings)
- enable SD_WAKE_BALANCE on SMP domains
Sysbench+postgresql improves all around the board, quite significantly:
.28-rc3-11474e2c .28-rc3-11474e2c-tune
-------------------------------------------------
1: 571 688 +17.08%
2: 1236 1206 -2.55%
4: 2381 2642 +9.89%
8: 4958 5164 +3.99%
16: 9580 9574 -0.07%
32: 7128 8118 +12.20%
64: 7342 8266 +11.18%
128: 7342 8064 +8.95%
256: 7519 7884 +4.62%
512: 7350 7731 +4.93%
-------------------------------------------------
SUM: 55412 59341 +6.62%
So it's a win both for the runup portion, the peak area and the tail.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'io-mappings-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
io mapping: clean up #ifdefs
io mapping: improve documentation
i915: use io-mapping interfaces instead of a variety of mapping kludges
resources: add io-mapping functions to dynamically map large device apertures
x86: add iomap_atomic*()/iounmap_atomic() on 32-bit using fixmaps
Impact: build fix for non-ftrace architectures
Not all archs implement ftrace, and therefore do not have an asm/ftrace.h.
This patch corrects the problem.
The ftrace_nmi_enter/exit now must be defined for all archs that implement
dynamic ftrace. Currently, only x86 does.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix x86/Voyager build
Looks like this became static on the rest of x86. Fix it up by adding
an external definition to mach-voyager/setup.c
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Changes timebase calibration on Vmware.
Use the synthetic TSC_RELIABLE bit to workaround virtualization anomalies.
Virtual TSCs can be kept nearly in sync, but because the virtual TSC
offset is set by software, it's not perfect. So, the TSC
synchronization test can fail. Even then the TSC can be used as a
clocksource since the VMware platform exports a reliable TSC to the
guest for timekeeping purposes. Use this bit to check if we need to
skip the TSC sync checks.
Along with this also set the CONSTANT_TSC bit when on VMware, since we
still want to use TSC as clocksource on VM running over hardware which
has unsynchronized TSC's (opteron's), since the hypervisor will take
care of providing consistent TSC to the guest.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Changes timebase calibration on Vmware.
v3->v2 : Abstract the hypervisor detection and feature (tsc_freq) request
behind a hypervisor.c file
v2->v1 : Add a x86_hyper_vendor field to the cpuinfo_x86 structure.
This avoids multiple calls to the hypervisor detection function.
This patch adds function to detect if we are running under VMware.
The current way to check if we are on VMware is following,
# check if "hypervisor present bit" is set, if so read the 0x40000000
cpuid leaf and check for "VMwareVMware" signature.
# if the above fails, check the DMI vendors name for "VMware" string
if we find one we query the VMware hypervisor port to check if we are
under VMware.
The DMI + "VMware hypervisor port check" is needed for older VMware products,
which don't implement the hypervisor signature cpuid leaf.
Also note that since we are checking for the DMI signature the hypervisor
port should never be accessed on native hardware.
This patch also adds a hypervisor_get_tsc_freq function, instead of
calibrating the frequency which can be error prone in virtualized
environment, we ask the hypervisor for it. We get the frequency from
the hypervisor by accessing the hypervisor port if we are running on VMware.
Other hypervisors too can add code to the generic routine to get frequency on
their platform.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: None, bit reservation only
Add a synthetic TSC_RELIABLE feature bit which will be used to mark
TSC as reliable so that we could skip all the runtime checks for
TSC stablity, which have false positives in virtual environment.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
This patch cleans up the NMI safe code for dynamic ftrace as suggested
by Andrew Morton.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: introduce new APIs, separate kmap code from CONFIG_HIGHMEM
This takes the code used for CONFIG_HIGHMEM memory mappings except that
it's designed for dynamic IO resource mapping.
These fixmaps are available even with CONFIG_HIGHMEM turned off.
Signed-off-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change the kexec bootstrap code implementation from assembly to C
This patch transforms the kexec page tables setup code from assembler
code to C code in machine_kexec_prepare. This improves readability and
reduces code line number.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: save .text size when kexec is built in but not loaded
This patch adds an architecture specific struct kimage_arch into
struct kimage. The pointers to page table pages used by kexec are
added to struct kimage_arch. The page tables pages are dynamically
allocated in machine_kexec_prepare instead of statically from BSS
segment. This will save up to 20k memory when kexec image is not
loaded.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix on x86/Voyager
Given commits like this:
| Author: Suresh Siddha <suresh.b.siddha@intel.com>
| Date: Tue Jul 29 10:29:19 2008 -0700
|
| x86, xsave: enable xsave/xrstor on cpus with xsave support
Which deliberately expose boot cpu dependence to pieces of the system,
I think it's time to explicitly have a variable for it to prevent this
continual misassumption that the boot CPU is zero.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix crashes that can occur in NMI handlers, if their code is modified
Modifying code is something that needs special care. On SMP boxes,
if code that is being modified is also being executed on another CPU,
that CPU will have undefined results.
The dynamic ftrace uses kstop_machine to make the system act like a
uniprocessor system. But this does not address NMIs, that can still
run on other CPUs.
One approach to handle this is to make all code that are used by NMIs
not be traced. But NMIs can call notifiers that spread throughout the
kernel and this will be very hard to maintain, and the chance of missing
a function is very high.
The approach that this patch takes is to have the NMIs modify the code
if the modification is taking place. The way this works is that just
writing to code executing on another CPU is not harmful if what is
written is the same as what exists.
Two buffers are used: an IP buffer and a "code" buffer.
The steps that the patcher takes are:
1) Put in the instruction pointer into the IP buffer
and the new code into the "code" buffer.
2) Set a flag that says we are modifying code
3) Wait for any running NMIs to finish.
4) Write the code
5) clear the flag.
6) Wait for any running NMIs to finish.
If an NMI is executed, it will also write the pending code.
Multiple writes are OK, because what is being written is the same.
Then the patcher must wait for all running NMIs to finish before
going to the next line that must be patched.
This is basically the RCU approach to code modification.
Thanks to Ingo Molnar for suggesting the idea, and to Arjan van de Ven
for his guidence on what is safe and what is not.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: include file dependency cleanup
Fix compile errors of files that include asm/uv/uv_hub.h but do
not include linux/timer.h.
[ such files are not mainline right now. ]
Signed-of-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
To the unsuspecting user it is quite annoying that this broken and
inconsistent with x86-64 definition still exists.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: some new sparse warnings in e820.c etc, but no functional change.
As with regular ioremap, iounmap etc, annotate with __iomem.
Fixes the following sparse warnings, will produce some new ones
elsewhere in arch/x86 that will get worked out over time.
arch/x86/mm/ioremap.c:402:9: warning: cast removes address space of expression
arch/x86/mm/ioremap.c:406:10: warning: cast adds address space to expression (<asn:2>)
arch/x86/mm/ioremap.c:782:19: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>