Commit graph

17455 commits

Author SHA1 Message Date
Konrad Rzeszutek Wilk
ef35a4e6d9 xen/time: Add default value of -1 for IRQ and check for that.
If the timer interrupt has been de-init or is just now being
initialized, the default value of -1 should be preset as
interrupt line. Check for that and if something is odd
WARN us.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-16 16:05:14 -04:00
Konrad Rzeszutek Wilk
7918c92ae9 xen/time: Fix kasprintf splat when allocating timer%d IRQ line.
When we online the CPU, we get this splat:

smpboot: Booting Node 0 Processor 1 APIC 0x2
installing Xen timer for CPU 1
BUG: sleeping function called from invalid context at /home/konrad/ssd/konrad/linux/mm/slab.c:3179
in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/1
Pid: 0, comm: swapper/1 Not tainted 3.9.0-rc6upstream-00001-g3884fad #1
Call Trace:
 [<ffffffff810c1fea>] __might_sleep+0xda/0x100
 [<ffffffff81194617>] __kmalloc_track_caller+0x1e7/0x2c0
 [<ffffffff81303758>] ? kasprintf+0x38/0x40
 [<ffffffff813036eb>] kvasprintf+0x5b/0x90
 [<ffffffff81303758>] kasprintf+0x38/0x40
 [<ffffffff81044510>] xen_setup_timer+0x30/0xb0
 [<ffffffff810445af>] xen_hvm_setup_cpu_clockevents+0x1f/0x30
 [<ffffffff81666d0a>] start_secondary+0x19c/0x1a8

The solution to that is use kasprintf in the CPU hotplug path
that 'online's the CPU. That is, do it in in xen_hvm_cpu_notify,
and remove the call to in xen_hvm_setup_cpu_clockevents.

Unfortunatly the later is not a good idea as the bootup path
does not use xen_hvm_cpu_notify so we would end up never allocating
timer%d interrupt lines when booting. As such add the check for
atomic() to continue.

CC: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-16 16:05:07 -04:00
Yang Zhang
5a71785dde KVM: VMX: Use posted interrupt to deliver virtual interrupt
If posted interrupt is avaliable, then uses it to inject virtual
interrupt to guest.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:41 -03:00
Yang Zhang
a20ed54d6e KVM: VMX: Add the deliver posted interrupt algorithm
Only deliver the posted interrupt when target vcpu is running
and there is no previous interrupt pending in pir.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang
cf9e65b773 KVM: Set TMR when programming ioapic entry
We already know the trigger mode of a given interrupt when programming
the ioapice entry. So it's not necessary to set it in each interrupt
delivery.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang
3d81bc7e96 KVM: Call common update function when ioapic entry changed.
Both TMR and EOI exit bitmap need to be updated when ioapic changed
or vcpu's id/ldr/dfr changed. So use common function instead eoi exit
bitmap specific function.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang
01e439be77 KVM: VMX: Check the posted interrupt capability
Detect the posted interrupt feature. If it exists, then set it in vmcs_config.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang
d78f266483 KVM: VMX: Register a new IPI for posted interrupt
Posted Interrupt feature requires a special IPI to deliver posted interrupt
to guest. And it should has a high priority so the interrupt will not be
blocked by others.
Normally, the posted interrupt will be consumed by vcpu if target vcpu is
running and transparent to OS. But in some cases, the interrupt will arrive
when target vcpu is scheduled out. And host will see it. So we need to
register a dump handler to handle it.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:39 -03:00
Yang Zhang
a547c6db4d KVM: VMX: Enable acknowledge interupt on vmexit
The "acknowledge interrupt on exit" feature controls processor behavior
for external interrupt acknowledgement. When this control is set, the
processor acknowledges the interrupt controller to acquire the
interrupt vector on VM exit.

After enabling this feature, an interrupt which arrived when target cpu is
running in vmx non-root mode will be handled by vmx handler instead of handler
in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt
table to let real handler to handle it. Further, we will recognize the interrupt
and only delivery the interrupt which not belong to current vcpu through idt table.
The interrupt which belonged to current vcpu will be handled inside vmx handler.
This will reduce the interrupt handle cost of KVM.

Also, interrupt enable logic is changed if this feature is turnning on:
Before this patch, hypervior call local_irq_enable() to enable it directly.
Now IF bit is set on interrupt stack frame, and will be enabled on a return from
interrupt handler if exterrupt interrupt exists. If no external interrupt, still
call local_irq_enable() to enable it.

Refer to Intel SDM volum 3, chapter 33.2.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:39 -03:00
Konrad Rzeszutek Wilk
66ff0fe9e7 xen/smp/spinlock: Fix leakage of the spinlock interrupt line for every CPU online/offline
While we don't use the spinlock interrupt line (see for details
commit f10cd522c5 -
xen: disable PV spinlocks on HVM) - we should still do the proper
init / deinit sequence. We did not do that correctly and for the
CPU init for PVHVM guest we would allocate an interrupt line - but
failed to deallocate the old interrupt line.

This resulted in leakage of an irq_desc but more importantly this splat
as we online an offlined CPU:

genirq: Flags mismatch irq 71. 0002cc20 (spinlock1) vs. 0002cc20 (spinlock1)
Pid: 2542, comm: init.late Not tainted 3.9.0-rc6upstream #1
Call Trace:
 [<ffffffff811156de>] __setup_irq+0x23e/0x4a0
 [<ffffffff81194191>] ? kmem_cache_alloc_trace+0x221/0x250
 [<ffffffff811161bb>] request_threaded_irq+0xfb/0x160
 [<ffffffff8104c6f0>] ? xen_spin_trylock+0x20/0x20
 [<ffffffff813a8423>] bind_ipi_to_irqhandler+0xa3/0x160
 [<ffffffff81303758>] ? kasprintf+0x38/0x40
 [<ffffffff8104c6f0>] ? xen_spin_trylock+0x20/0x20
 [<ffffffff810cad35>] ? update_max_interval+0x15/0x40
 [<ffffffff816605db>] xen_init_lock_cpu+0x3c/0x78
 [<ffffffff81660029>] xen_hvm_cpu_notify+0x29/0x33
 [<ffffffff81676bdd>] notifier_call_chain+0x4d/0x70
 [<ffffffff810bb2a9>] __raw_notifier_call_chain+0x9/0x10
 [<ffffffff8109402b>] __cpu_notify+0x1b/0x30
 [<ffffffff8166834a>] _cpu_up+0xa0/0x14b
 [<ffffffff816684ce>] cpu_up+0xd9/0xec
 [<ffffffff8165f754>] store_online+0x94/0xd0
 [<ffffffff8141d15b>] dev_attr_store+0x1b/0x20
 [<ffffffff81218f44>] sysfs_write_file+0xf4/0x170
 [<ffffffff811a2864>] vfs_write+0xb4/0x130
 [<ffffffff811a302a>] sys_write+0x5a/0xa0
 [<ffffffff8167ada9>] system_call_fastpath+0x16/0x1b
cpu 1 spinlock event irq -16
smpboot: Booting Node 0 Processor 1 APIC 0x2

And if one looks at the /proc/interrupts right after
offlining (CPU1):

  70:          0          0  xen-percpu-ipi       spinlock0
  71:          0          0  xen-percpu-ipi       spinlock1
  77:          0          0  xen-percpu-ipi       spinlock2

There is the oddity of the 'spinlock1' still being present.

CC: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-16 15:11:56 -04:00
Konrad Rzeszutek Wilk
888b65b4bc xen/smp: Fix leakage of timer interrupt line for every CPU online/offline.
In the PVHVM path when we do CPU online/offline path we would
leak the timer%d IRQ line everytime we do a offline event. The
online path (xen_hvm_setup_cpu_clockevents via
x86_cpuinit.setup_percpu_clockev) would allocate a new interrupt
line for the timer%d.

But we would still use the old interrupt line leading to:

kernel BUG at /home/konrad/ssd/konrad/linux/kernel/hrtimer.c:1261!
invalid opcode: 0000 [#1] SMP
RIP: 0010:[<ffffffff810b9e21>]  [<ffffffff810b9e21>] hrtimer_interrupt+0x261/0x270
.. snip..
 <IRQ>
 [<ffffffff810445ef>] xen_timer_interrupt+0x2f/0x1b0
 [<ffffffff81104825>] ? stop_machine_cpu_stop+0xb5/0xf0
 [<ffffffff8111434c>] handle_irq_event_percpu+0x7c/0x240
 [<ffffffff811175b9>] handle_percpu_irq+0x49/0x70
 [<ffffffff813a74a3>] __xen_evtchn_do_upcall+0x1c3/0x2f0
 [<ffffffff813a760a>] xen_evtchn_do_upcall+0x2a/0x40
 [<ffffffff8167c26d>] xen_hvm_callback_vector+0x6d/0x80
 <EOI>
 [<ffffffff81666d01>] ? start_secondary+0x193/0x1a8
 [<ffffffff81666cfd>] ? start_secondary+0x18f/0x1a8

There is also the oddity (timer1) in the /proc/interrupts after
offlining CPU1:

  64:       1121          0  xen-percpu-virq      timer0
  78:          0          0  xen-percpu-virq      timer1
  84:          0       2483  xen-percpu-virq      timer2

This patch fixes it.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: stable@vger.kernel.org
2013-04-16 15:11:54 -04:00
Jan Beulich
dec02dea1c xen: drop tracking of IRQ vector
For quite a few Xen versions, this wasn't the IRQ vector anymore
anyway, and it is not being used by the kernel for anything. Hence
drop the field from struct irq_info, and respective function
parameters.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-16 15:05:45 -04:00
David Vrabel
96f28bc66a x86/xen: populate boot_params with EDD data
During early setup of a dom0 kernel, populate boot_params with the
Enhanced Disk Drive (EDD) and MBR signature data.  This makes
information on the BIOS boot device available in /sys/firmware/edd/.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-16 15:04:37 -04:00
Sergey Vlasov
3668011d4a efi: Export efi_query_variable_store() for efivars.ko
Fixes build with CONFIG_EFI_VARS=m which was broken after the commit
"x86, efivars: firmware bug workarounds should be in platform code".

Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-04-16 17:34:07 +01:00
Sergey Vlasov
f6ce500262 x86/Kconfig: Make EFI select UCS2_STRING
The commit "efi: Distinguish between "remaining space" and actually used
space" added usage of ucs2_*() functions to arch/x86/platform/efi/efi.c,
but the only thing which selected UCS2_STRING was EFI_VARS, which is
technically optional and can be built as a module.

Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-04-16 17:31:08 +01:00
Stephane Eranian
f1923820c4 perf/x86: Fix offcore_rsp valid mask for SNB/IVB
The valid mask for both offcore_response_0 and
offcore_response_1 was wrong for SNB/SNB-EP,
IVB/IVB-EP. It was possible to write to
reserved bit and cause a GP fault crashing
the kernel.

This patch fixes the problem by correctly marking the
reserved bits in the valid mask for all the processors
mentioned above.

A distinction between desktop and server parts is introduced
because bits 24-30 are only available on the server parts.

This version of the  patch is just a rebase to perf/urgent tree
and should apply to older kernels as well.

Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: gregkh@linuxfoundation.org
Cc: security@kernel.org
Cc: ak@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-16 15:02:06 +02:00
Borislav Petkov
1077c932db x86, CPU, AMD: Drop useless label
All we want to do is return from this function so stop jumping around
like a flea for no good reason.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1365436666-9837-5-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2013-04-16 11:50:51 +02:00
Borislav Petkov
682469a5db x86, AMD: Correct {rd,wr}msr_amd_safe warnings
The idea with those routines is to slowly phase them out and not call
them on anything else besides K8. They even have a check for that which,
when called too early, fails. Let me explain:

It gets the cpuinfo_x86 pointer from the per_cpu array and when this
happens for cpu0, before its boot_cpu_data has been copied back to the
per_cpu array in smp_store_boot_cpu_info(), we get an empty struct and
thus the check fails.

Use boot_cpu_data directly instead.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1365436666-9837-4-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2013-04-16 11:50:51 +02:00
Borislav Petkov
55a36b65ee x86: Fold-in trivial check_config function
Fold it into its single call site. No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1365436666-9837-3-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2013-04-16 11:50:50 +02:00
Alexandre Courbot
7fd2bf3d32 Remove GENERIC_GPIO config option
GENERIC_GPIO has been made equivalent to GPIOLIB in architecture code
and all driver code has been switch to depend on GPIOLIB. It is thus
safe to have GENERIC_GPIO removed.

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
2013-04-16 18:47:19 +09:00
Ingo Molnar
b5210b2a34 Merge branch 'uprobes/core' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into perf/core
Pull uprobes updates from Oleg Nesterov:

 - "uretprobes" - an optimization to uprobes, like kretprobes are an optimization
   to kprobes. "perf probe -x file sym%return" now works like kretprobes.

 - PowerPC fixes plus a couple of cleanups/optimizations in uprobes and trace_uprobes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-16 11:04:10 +02:00
Wang YanQing
26bfc540f6 x86/mm/gart: Drop unnecessary check
The memblock_find_in_range() return value addr is guaranteed
to be within "addr + aper_size" and not beyond GART_MAX_ADDR.

Signed-off-by: Wang YanQing <udknight@gmail.com>
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/20130416013734.GA14641@udknight
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-16 10:54:40 +02:00
Yang Zhang
aa2fbe6d44 KVM: Let ioapic know the irq line status
Userspace may deliver RTC interrupt without query the status. So we
want to track RTC EOI for this case.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-15 23:20:34 -03:00
Yang Zhang
106069193c KVM: Add reset/restore rtc_status support
restore rtc_status from migration or save/restore

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-15 23:20:34 -03:00
Yang Zhang
b4f2225c07 KVM: Return destination vcpu on interrupt injection
Add a new parameter to know vcpus who received the interrupt.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-15 23:20:34 -03:00
Yang Zhang
1fcc7890db KVM: Add vcpu info to ioapic_update_eoi()
Add vcpu info to ioapic_update_eoi, so we can know which vcpu
issued this EOI.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-15 23:20:33 -03:00
Matthew Garrett
31ff2f20d9 efi: Distinguish between "remaining space" and actually used space
EFI implementations distinguish between space that is actively used by a
variable and space that merely hasn't been garbage collected yet. Space
that hasn't yet been garbage collected isn't available for use and so isn't
counted in the remaining_space field returned by QueryVariableInfo().

Combined with commit 68d9298 this can cause problems. Some implementations
don't garbage collect until the remaining space is smaller than the maximum
variable size, and as a result check_var_size() will always fail once more
than 50% of the variable store has been used even if most of that space is
marked as available for garbage collection. The user is unable to create
new variables, and deleting variables doesn't increase the remaining space.

The problem that 68d9298 was attempting to avoid was one where certain
platforms fail if the actively used space is greater than 50% of the
available storage space. We should be able to calculate that by simply
summing the size of each available variable and subtracting that from
the total storage space. With luck this will fix the problem described in
https://bugzilla.kernel.org/show_bug.cgi?id=55471 without permitting
damage to occur to the machines 68d9298 was attempting to fix.

Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-04-15 21:33:05 +01:00
Matthew Garrett
cc5a080c5d efi: Pass boot services variable info to runtime code
EFI variables can be flagged as being accessible only within boot services.
This makes it awkward for us to figure out how much space they use at
runtime. In theory we could figure this out by simply comparing the results
from QueryVariableInfo() to the space used by all of our variables, but
that fails if the platform doesn't garbage collect on every boot. Thankfully,
calling QueryVariableInfo() while still inside boot services gives a more
reliable answer. This patch passes that information from the EFI boot stub
up to the efi platform code.

Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-04-15 21:31:09 +01:00
Tang Chen
587ff8c4ea x86/mm/hotplug: Put kernel_physical_mapping_remove() declaration in CONFIG_MEMORY_HOTREMOVE
kernel_physical_mapping_remove() is only called by
arch_remove_memory() in init_64.c, which is enclosed in
CONFIG_MEMORY_HOTREMOVE. So when we don't configure
CONFIG_MEMORY_HOTREMOVE, the compiler will give a warning:

	warning: ‘kernel_physical_mapping_remove’ defined but not used

So put kernel_physical_mapping_remove() in
CONFIG_MEMORY_HOTREMOVE.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: linux-mm@kvack.org
Cc: gregkh@linuxfoundation.org
Cc: yinghai@kernel.org
Cc: wency@cn.fujitsu.com
Cc: mgorman@suse.de
Cc: tj@kernel.org
Cc: liwanp@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1366019207-27818-3-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-15 12:03:24 +02:00
Andy Shevchenko
d50ba3687b x86/lib: Fix spelling, put space between a numeral and its units
As suggested by Peter Anvin.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: H . Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-15 11:40:32 +02:00
Andy Shevchenko
bb916ff7cd x86/lib: Fix spelling in the comments
Apparently 'byts' should be 'bytes'.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: H . Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-15 11:40:31 +02:00
Linus Torvalds
6c4c4d4bda Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Flush lazy MMU when DEBUG_PAGEALLOC is set
  x86/mm/cpa/selftest: Fix false positive in CPA self test
  x86/mm/cpa: Convert noop to functional fix
  x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metal
  x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updates
2013-04-14 11:13:24 -07:00
Linus Torvalds
ae9f4939ba Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixlets"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix error return code
  ftrace: Fix strncpy() use, use strlcpy() instead of strncpy()
  perf: Fix strncpy() use, use strlcpy() instead of strncpy()
  perf: Fix strncpy() use, always make sure it's NUL terminated
  perf: Fix ring_buffer perf_output_space() boundary calculation
  perf/x86: Fix uninitialized pt_regs in intel_pmu_drain_bts_buffer()
2013-04-14 11:10:44 -07:00
Jan Kiszka
c0d1c770c0 KVM: nVMX: Avoid reading VM_EXIT_INTR_ERROR_CODE needlessly on nested exits
We only need to update vm_exit_intr_error_code if there is a valid exit
interruption information and it comes with a valid error code.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:10 +03:00
Jan Kiszka
e8457c67a4 KVM: nVMX: Fix conditions for interrupt injection
If we are entering guest mode, we do not want L0 to interrupt this
vmentry with all its side effects on the vmcs. Therefore, injection
shall be disallowed during L1->L2 transitions, as in the previous
version. However, this check is conceptually independent of
nested_exit_on_intr, so decouple it.

If L1 traps external interrupts, we can kick the guest from L2 to L1,
also just like the previous code worked. But we no longer need to
consider L1's idt_vectoring_info_field. It will always be empty at this
point. Instead, if L2 has pending events, those are now found in the
architectural queues and will, thus, prevent vmx_interrupt_allowed from
being called at all.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:09 +03:00
Jan Kiszka
5f3d579997 KVM: nVMX: Rework event injection and recovery
The basic idea is to always transfer the pending event injection on
vmexit into the architectural state of the VCPU and then drop it from
there if it turns out that we left L2 to enter L1, i.e. if we enter
prepare_vmcs12.

vmcs12_save_pending_events takes care to transfer pending L0 events into
the queue of L1. That is mandatory as L1 may decide to switch the guest
state completely, invalidating or preserving the pending events for
later injection (including on a different node, once we support
migration).

This concept is based on the rule that a pending vmlaunch/vmresume is
not canceled. Otherwise, we would risk to lose injected events or leak
them into the wrong queues. Encode this rule via a WARN_ON_ONCE at the
entry of nested_vmx_vmexit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:07 +03:00
Jan Kiszka
3b656cf764 KVM: nVMX: Fix injection of PENDING_INTERRUPT and NMI_WINDOW exits to L1
Check if the interrupt or NMI window exit is for L1 by testing if it has
the corresponding controls enabled. This is required when we allow
direct injection from L0 to L2

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:05 +03:00
Gleb Natapov
188424ba10 KVM: emulator: mark 0xff 0x7d opcode as undefined.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 09:44:17 +03:00
Gleb Natapov
1146a78b8d KVM: emulator: Do not fail on emulation of undefined opcode
Emulation of undefined opcode should inject #UD instead of causing
emulation failure. Do that by moving Undefined flag check to emulation
stage and injection #UD there.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 09:44:17 +03:00
Gleb Natapov
991eebf9f8 KVM: VMX: do not try to reexecute failed instruction while emulating invalid guest state
During invalid guest state emulation vcpu cannot enter guest mode to try
to reexecute instruction that emulator failed to emulate, so emulation
will happen again and again.  Prevent that by telling the emulator that
instruction reexecution should not be attempted.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 09:44:17 +03:00
Gleb Natapov
0b789eee2c KVM: emulator: fix unimplemented instruction detection
Unimplemented instruction detection is broken for group instructions
since it relies on "flags" field of opcode to be zero, but all
instructions in a group inherit flags from a group encoding. Fix that by
having a separate flag for unimplemented instructions.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 09:44:16 +03:00
Anton Arapov
791eca1010 uretprobes/x86: Hijack return address
Hijack the return address and replace it with a trampoline address.

Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-04-13 15:31:55 +02:00
Dave Hansen
1de14c3c5c x86-32: Fix possible incomplete TLB invalidate with PAE pagetables
This patch attempts to fix:

	https://bugzilla.kernel.org/show_bug.cgi?id=56461

The symptom is a crash and messages like this:

	chrome: Corrupted page table at address 34a03000
	*pdpt = 0000000000000000 *pde = 0000000000000000
	Bad pagetable: 000f [#1] PREEMPT SMP

Ingo guesses this got introduced by commit 611ae8e3f5 ("x86/tlb:
enable tlb flush range support for x86") since that code started to free
unused pagetables.

On x86-32 PAE kernels, that new code has the potential to free an entire
PMD page and will clear one of the four page-directory-pointer-table
(aka pgd_t entries).

The hardware aggressively "caches" these top-level entries and invlpg
does not actually affect the CPU's copy.  If we clear one we *HAVE* to
do a full TLB flush, otherwise we might continue using a freed pmd page.
(note, we do this properly on the population side in pud_populate()).

This patch tracks whenever we clear one of these entries in the 'struct
mmu_gather', and ensures that we follow up with a full tlb flush.

BTW, I disassembled and checked that:

	if (tlb->fullmm == 0)
and
	if (!tlb->fullmm && !tlb->need_flush_all)

generate essentially the same code, so there should be zero impact there
to the !PAE case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Artem S Tashkinov <t.artem@mailcity.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-12 16:56:47 -07:00
Jiang Liu
89016506b6 x86/PCI: Implement pcibios_{add|remove}_bus() hooks
Implement pcibios_{add|remove}_bus() hooks for x86 platforms.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Yinghai Lu <yinghai@kernel.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Myron Stowe <myron.stowe@redhat.com>
2013-04-12 15:38:25 -06:00
Paul Bolle
a7e6567585 x86/mm/fixmap: Remove unused FIX_CYCLONE_TIMER
The last users of FIX_CYCLONE_TIMER were removed in v2.6.18. We
can remove this unneeded constant.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Link: http://lkml.kernel.org/r/1365698982.1427.3.camel@x61.thuisdomein
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-12 07:21:18 +02:00
Boris Ostrovsky
26564600c9 x86/mm: Flush lazy MMU when DEBUG_PAGEALLOC is set
When CONFIG_DEBUG_PAGEALLOC is set page table updates made by
kernel_map_pages() are not made visible (via TLB flush)
immediately if lazy MMU is on. In environments that support lazy
MMU (e.g. Xen) this may lead to fatal page faults, for example,
when zap_pte_range() needs to allocate pages in
__tlb_remove_page() -> tlb_next_batch().

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: konrad.wilk@oracle.com
Link: http://lkml.kernel.org/r/1365703192-2089-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-12 07:19:19 +02:00
Andrea Arcangeli
18699739b6 x86/mm/cpa/selftest: Fix false positive in CPA self test
If the pmd is not present, _PAGE_PSE will not be set anymore.
Fix the false positive.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1365687369-30802-1-git-send-email-aarcange@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-12 06:39:20 +02:00
konrad@kernel.org
4d681be3c3 x86, wakeup, sleep: Use pvops functions for changing GDT entries
We check the TSS descriptor before we try to dereference it.
Also we document what the value '9' actually means using the
AMD64 Architecture Programmer's Manual Volume 2, pg 90:
"Hex value 9: Available 64-bit TSS" and pg 91:
"The available 32-bit TSS (09h), which is redefined as the
available 64-bit TSS."

Without this, on Xen, where the GDT is available as R/O (to
protect the hypervisor from the guest modifying it), we end up
with a pagetable fault.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1365194544-14648-5-git-send-email-konrad.wilk@oracle.com
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-11 15:41:15 -07:00
Konrad Rzeszutek Wilk
357d122670 x86, xen, gdt: Remove the pvops variant of store_gdt.
The two use-cases where we needed to store the GDT were during ACPI S3 suspend
and resume. As the patches:
 x86/gdt/i386: store/load GDT for ACPI S3 or hibernation/resume path is not needed
 x86/gdt/64-bit: store/load GDT for ACPI S3 or hibernate/resume path is not needed.

have demonstrated - there are other mechanism by which the GDT is
saved and reloaded during early resume path.

Hence we do not need to worry about the pvops call-chain for saving the
GDT and can and can eliminate it. The other areas where the store_gdt is
used are never going to be hit when running under the pvops platforms.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1365194544-14648-4-git-send-email-konrad.wilk@oracle.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-11 15:40:38 -07:00
Konrad Rzeszutek Wilk
84e70971e6 x86-32, gdt: Store/load GDT for ACPI S3 or hibernation/resume path is not needed
During the ACPI S3 suspend, we store the GDT in the wakup_header (see
wakeup_asm.s) field called 'pmode_gdt'.

Which is then used during the resume path and has the same exact
value as what the store/load_gdt do with the saved_context
(which is saved/restored via save/restore_processor_state()).

The flow during resume from ACPI S3 is simpler than the 64-bit
counterpart. We only use the early bootstrap once (wakeup_gdt) and
do various checks in real mode.

After the checks are completed, we load the saved GDT ('pmode_gdt') and
continue on with the resume (by heading to startup_32 in trampoline_32.S) -
which quickly jumps to what was saved in 'pmode_entry'
aka 'wakeup_pmode_return'.

The 'wakeup_pmode_return' restores the GDT (saved_gdt) again (which was
saved in do_suspend_lowlevel initially). After that it ends up calling
the 'ret_point' which calls 'restore_processor_state()'.

We have two opportunities to remove code where we restore the same GDT
twice.

Here is the call chain:
 wakeup_start
       |- lgdtl wakeup_gdt [the work-around broken BIOSes]
       |
       | - lgdtl pmode_gdt [the real one]
       |
       \-- startup_32 (in trampoline_32.S)
              \-- wakeup_pmode_return (in wakeup_32.S)
                       |- lgdtl saved_gdt [the real one]
                       \-- ret_point
                             |..
                             |- call restore_processor_state

The hibernate path is much simpler. During the saving of the hibernation
image we call save_processor_state() and save the contents of that
along with the rest of the kernel in the hibernation image destination.
We save the EIP of 'restore_registers' (restore_jump_address) and
cr3 (restore_cr3).

During hibernate resume, the 'restore_registers' (via the
'restore_jump_address) in hibernate_asm_32.S is invoked which
restores the contents of most registers. Naturally the resume path benefits
from already being in 32-bit mode, so it does not have to reload the GDT.

It only reloads the cr3 (from restore_cr3) and continues on. Note
that the restoration of the restore image page-tables is done prior to
this.

After the 'restore_registers' it returns and we end up called
restore_processor_state() - where we reload the GDT. The reload of
the GDT is not needed as bootup kernel has already loaded the GDT
which is at the same physical location as the the restored kernel.

Note that the hibernation path assumes the GDT is correct during its
'restore_registers'. The assumption in the code is that the restored
image is the same as saved - meaning we are not trying to restore
an different kernel in the virtual address space of a new kernel.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1365194544-14648-3-git-send-email-konrad.wilk@oracle.com
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-11 15:40:17 -07:00