Before we start calling execve in contexts that honor the full
pt_regs, we need to teach it to initialize all registers.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/65a38a9edee61a1158cfd230800c61dbd963dac5.1444091584.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For the vDSO, user code wants runtime unwind info. Make sure
that, if we use .cfi directives, we generate it.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/16e29ad8855e6508197000d8c41f56adb00d7580.1444091584.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
131484c8da ("x86/debug: Remove perpetually broken, unmaintainable dwarf annotations")
removed all the manual DWARF annotations outside the vDSO. It also removed
the macros we used for the manual annotations.
Re-add these macros so that we can clean up the vDSO annotations.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/4c70bb98a8b773c8ccfaabf6745e569ff43e7f65.1444091584.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
32-bit userspace will now always see the same vDSO, which is
exactly what used to be the int80 vDSO. Subsequent patches will
clean it up and make it support SYSENTER and SYSCALL using
alternatives.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/e7e6b3526fa442502e6125fe69486aab50813c32.1444091584.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This should improve code quality a bit. It also shrinks the kernel text:
Before:
text data bss dec filename
21828379 5194760 1277952 28301091 vmlinux
After:
text data bss dec filename
21827997 5194760 1277952 28300709 vmlinux
... by 382 bytes.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/f427b8002d932e5deab9055e0074bb4e7e80ee39.1444091584.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
GCC doesn't realize that get_user(), put_user(), and their __
variants are unlikely to fail. Tell it.
I noticed this while playing with the C entry code.
Before:
text data bss dec filename
21828763 5194760 1277952 28301475 vmlinux.baseline
After:
text data bss dec filename
21828379 5194760 1277952 28301091 vmlinux.new
The generated code shrunk by 384 bytes.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/dc37bed7024319c3004d950d57151fca6aeacf97.1444091584.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the introduction of the context switch preempt_count invariant,
and the demise of PREEMPT_ACTIVE, its pointless to save/restore the
per-cpu preemption count, it must always be 2.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Assuming units of PREEMPT_DISABLE_OFFSET for preempt_count() numbers.
Now that TASK_DEAD no longer results in preempt_count() == 3 during
scheduling, we will always call context_switch() with preempt_count()
== 2.
However, we don't always end up with preempt_count() == 2 in
finish_task_switch() because new tasks get created with
preempt_count() == 1.
Create FORK_PREEMPT_COUNT and set it to 2 and use that in the right
places. Note that we cannot use INIT_PREEMPT_COUNT as that serves
another purpose (boot).
After this, preempt_count() is invariant across the context switch,
with exception of PREEMPT_ACTIVE.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Fix VM save performance regression with x86 PV guests.
- Make kexec work in x86 PVHVM guests (if Xen has the soft-reset ABI).
- Other minor fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWE8wQAAoJEFxbo/MsZsTRVTMH/0eqSg2M78wv4sBl234Y3FE9
AN8KFUdlkK7VN9v0uuLMDSKIWNUuFJIvo/2rElWGRiX2Q+/pfnQg3ZSFhub9S8uL
T4LCvmG9viRFb2oUz792ewqncSw3X98Jpto4smA820gJRjndBSWm5HUKUtPAkv1M
l5DFMEgOeHbu+wCbKD/ZPEt5K9GsIaNviSNoWtYHirZwrd00oLmNbWp+g8lIGQiT
3vLW0SaZzjL6akKxihb/p3WZ9eNmyz8yk0V7dItUEVUB9qoaDDLJ5qIRSHHWTWQD
Jza/GE32VallZLuEXGG5/D86MsnyVYHC+lZtwo2IptOGm8v7WuZRv094wI1ev5c=
=aiDw
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.3b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel:
- Fix VM save performance regression with x86 PV guests
- Make kexec work in x86 PVHVM guests (if Xen has the soft-reset ABI)
- Other minor fixes.
* tag 'for-linus-4.3b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen/p2m: hint at the last populated P2M entry
x86/xen: Do not clip xen_e820_map to xen_e820_map_entries when sanitizing map
x86/xen: Support kexec/kdump in HVM guests by doing a soft reset
xen/x86: Don't try to write syscall-related MSRs for PV guests
xen: use correct type for HYPERVISOR_memory_op()
Warn on any residual W+X mappings after setting NX
if DEBUG_WX is enabled. Introduce a separate
X86_PTDUMP_CORE config that enables the code for
dumping the page tables without enabling the debugfs
interface, so that DEBUG_WX can be enabled without
exposing the debugfs interface. Switch EFI_PGT_DUMP
to using X86_PTDUMP_CORE so that it also does not require
enabling the debugfs interface.
On success it prints this to the kernel log:
x86/mm: Checked W+X mappings: passed, no W+X pages found.
On failure it prints a warning and a count of the failed pages:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 1 at arch/x86/mm/dump_pagetables.c:226 note_page+0x610/0x7b0()
x86/mm: Found insecure W+X mapping at address ffffffff81755000/__stop___ex_table+0xfa8/0xabfa8
[...]
Call Trace:
[<ffffffff81380a5f>] dump_stack+0x44/0x55
[<ffffffff8109d3f2>] warn_slowpath_common+0x82/0xc0
[<ffffffff8109d48c>] warn_slowpath_fmt+0x5c/0x80
[<ffffffff8106cfc9>] ? note_page+0x5c9/0x7b0
[<ffffffff8106d010>] note_page+0x610/0x7b0
[<ffffffff8106d409>] ptdump_walk_pgd_level_core+0x259/0x3c0
[<ffffffff8106d5a7>] ptdump_walk_pgd_level_checkwx+0x17/0x20
[<ffffffff81063905>] mark_rodata_ro+0xf5/0x100
[<ffffffff817415a0>] ? rest_init+0x80/0x80
[<ffffffff817415bd>] kernel_init+0x1d/0xe0
[<ffffffff8174cd1f>] ret_from_fork+0x3f/0x70
[<ffffffff817415a0>] ? rest_init+0x80/0x80
---[ end trace a1f23a1e42a2ac76 ]---
x86/mm: Checked W+X mappings: FAILED, 171 W+X pages found.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1444064120-11450-1-git-send-email-sds@tycho.nsa.gov
[ Improved the Kconfig help text and made the new option default-y
if CONFIG_DEBUG_RODATA=y, because it already found buggy mappings,
so we really want people to have this on by default. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Fixes all around the map: W+X kernel mapping fix, WCHAN fixes, two
build failure fixes for corner case configs, x32 header fix and a
speling fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/headers/uapi: Fix __BITS_PER_LONG value for x32 builds
x86/mm: Set NX on gap between __ex_table and rodata
x86/kexec: Fix kexec crash in syscall kexec_file_load()
x86/process: Unify 32bit and 64bit implementations of get_wchan()
x86/process: Add proper bound checks in 64bit get_wchan()
x86, efi, kasan: Fix build failure on !KASAN && KMEMCHECK=y kernels
x86/hyperv: Fix the build in the !CONFIG_KEXEC_CORE case
x86/cpufeatures: Correct spelling of the HWP_NOTIFY flag
Merge misc fixes from Andrew Morton:
"12 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
dmapool: fix overflow condition in pool_find_page()
thermal: avoid division by zero in power allocator
memcg: remove pcp_counter_lock
kprobes: use _do_fork() in samples to make them work again
drivers/input/joystick/Kconfig: zhenhua.c needs BITREVERSE
memcg: make mem_cgroup_read_stat() unsigned
memcg: fix dirty page migration
dax: fix NULL pointer in __dax_pmd_fault()
mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault
mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1)
userfaultfd: remove kernel header include from uapi header
arch/x86/include/asm/efi.h: fix build failure
Bugs have trickled in for a new feature in 4.2 (MTRR support in guests)
so I'm reverting it all; let's not make this -rc period busier for KVM
than it's been so far. This covers the four reverts from me.
The fifth patch is being reverted because Radim found a bug in the
implementation of stable scheduler clock, *but* also managed to implement
the feature entirely without hypervisor support. So instead of fixing
the hypervisor side we can remove it completely; 4.4 will get the new
implementation.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWDXc/AAoJEL/70l94x66D8GoH/0WXeSYHn8+Ql5oZ5vI0QcCG
6MiKVixhHTOpkug2QE4DGClYoFSUPuDEB/w6D7YciNn0quDHFZbI3XEMXYtLobHN
0J9cMv9Vpy5pBVMG/LJOw9pFAJRdhSx/cHU2DW9vUiRG9dO9zuxFzBtUciWLOPAX
tSQfDumeUV30BsTP5ldi9kaIUJBM9oBD4JhES0JHx6ePBvy+9vCRmHotugzrrGx6
N+AbCmwUwxnK29PF9i7KMfex6T8l1uQG3fwWVazHoswsqbFEQyF6NpaSTYoZkjM9
6gaXEE1FQ7tRhuio4bBDos0lLu6iGesveP71p/HpULleq2sbH2ER8TpzR5iSnQA=
=zAJS
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"(Relatively) a lot of reverts, mostly.
Bugs have trickled in for a new feature in 4.2 (MTRR support in
guests) so I'm reverting it all; let's not make this -rc period busier
for KVM than it's been so far. This covers the four reverts from me.
The fifth patch is being reverted because Radim found a bug in the
implementation of stable scheduler clock, *but* also managed to
implement the feature entirely without hypervisor support. So instead
of fixing the hypervisor side we can remove it completely; 4.4 will
get the new implementation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
Update KVM homepage Url
Revert "KVM: SVM: use NPT page attributes"
Revert "KVM: svm: handle KVM_X86_QUIRK_CD_NW_CLEARED in svm_get_mt_mask"
Revert "KVM: SVM: Sync g_pat with guest-written PAT value"
Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages"
Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
This patch updates the Posted-Interrupts Descriptor when vCPU
is blocked.
pre-block:
- Add the vCPU to the blocked per-CPU list
- Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
post-block:
- Remove the vCPU from the per-CPU list
Signed-off-by: Feng Wu <feng.wu@intel.com>
[Concentrate invocation of pre/post-block hooks to vcpu_block. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Select IRQ_BYPASS_MANAGER for x86 when CONFIG_KVM is set
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds the routine to update IRTE for posted-interrupts
when guest changes the interrupt configuration.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
[Squashed in automatically generated patch from the build robot
"KVM: x86: vcpu_to_pi_desc() can be static" - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make kvm_set_msi_irq() public, we can use this function outside.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch defines a new interface kvm_intr_is_single_vcpu(),
which can returns whether the interrupt is for single-CPU or not.
It is used by VT-d PI, since now we only support single-CPU
interrupts, For lowest-priority interrupts, if user configures
it via /proc/irq or uses irqbalance to make it single-CPU, we
can use PI to deliver the interrupts to it. Full functionality
of lowest-priority support will be added later.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is friendlier to clients of the code, who are going to prepare
vcpu_data structs unconditionally, even if CONFIG_IRQ_REMAP is not
defined.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass PCOMMIT CPU feature to guest to enable PCOMMIT instruction
Currently we do not catch pcommit instruction for L1 guest and
allow L1 to catch this instruction for L2 if, as required by the spec,
L1 can enumerate the PCOMMIT instruction via CPUID:
| IA32_VMX_PROCBASED_CTLS2[53] (which enumerates support for the
| 1-setting of PCOMMIT exiting) is always the same as
| CPUID.07H:EBX.PCOMMIT[bit 22]. Thus, software can set PCOMMIT exiting
| to 1 if and only if the PCOMMIT instruction is enumerated via CPUID
The spec can be found at
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_VP_RUNTIME msr used by guest to get
"the time the virtual processor consumes running guest code,
and the time the associated logical processor spends running
hypervisor code on behalf of that guest."
Calculation of this time is performed by task_cputime_adjusted()
for vcpu task.
Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to enable userspace PIC support, the userspace PIC needs to
be able to inject local interrupts even when the APICs are in the
kernel.
KVM_INTERRUPT now supports sending local interrupts to an APIC when
APICs are in the kernel.
The ready_for_interrupt_request flag is now only set when the CPU/APIC
will immediately accept and inject an interrupt (i.e. APIC has not
masked the PIC).
When the PIC wishes to initiate an INTA cycle with, say, CPU0, it
kicks CPU0 out of the guest, and renedezvous with CPU0 once it arrives
in userspace.
When the CPU/APIC unmasks the PIC, a KVM_EXIT_IRQ_WINDOW_OPEN is
triggered, so that userspace has a chance to inject a PIC interrupt
if it had been pending.
Overall, this design can lead to a small number of spurious userspace
renedezvous. In particular, whenever the PIC transistions from low to
high while it is masked and whenever the PIC becomes unmasked while
it is low.
Note: this does not buffer more than one local interrupt in the
kernel, so the VMM needs to enter the guest in order to complete
interrupt injection before injecting an additional interrupt.
Compiles for x86.
Can pass the KVM Unit Tests.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to support a userspace IOAPIC interacting with an in kernel
APIC, the EOI exit bitmaps need to be configurable.
If the IOAPIC is in userspace (i.e. the irqchip has been split), the
EOI exit bitmaps will be set whenever the GSI Routes are configured.
In particular, for the low MSI routes are reservable for userspace
IOAPICs. For these MSI routes, the EOI Exit bit corresponding to the
destination vector of the route will be set for the destination VCPU.
The intention is for the userspace IOAPICs to use the reservable MSI
routes to inject interrupts into the guest.
This is a slight abuse of the notion of an MSI Route, given that MSIs
classically bypass the IOAPIC. It might be worthwhile to add an
additional route type to improve clarity.
Compile tested for Intel x86.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adds KVM_EXIT_IOAPIC_EOI which allows the kernel to EOI
level-triggered IOAPIC interrupts.
Uses a per VCPU exit bitmap to decide whether or not the IOAPIC needs
to be informed (which is identical to the EOI_EXIT_BITMAP field used
by modern x86 processors, but can also be used to elide kvm IOAPIC EOI
exits on older processors).
[Note: A prototype using ResampleFDs found that decoupling the EOI
from the VCPU's thread made it possible for the VCPU to not see a
recent EOI after reentering the guest. This does not match real
hardware.]
Compile tested for Intel x86.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
First patch in a series which enables the relocation of the
PIC/IOAPIC to userspace.
Adds capability KVM_CAP_SPLIT_IRQCHIP;
KVM_CAP_SPLIT_IRQCHIP enables the construction of LAPICs without the
rest of the irqchip.
Compile tested for x86.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Suggested-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We can reuse the algorithm that computes the EOI exit bitmap to figure
out which vectors are handled by the IOAPIC. The only difference
between the two is for edge-triggered interrupts other than IRQ8
that have no notifiers active; however, the IOAPIC does not have to
do anything special for these interrupts anyway.
This again limits the interactions between the IOAPIC and the LAPIC,
making it easier to move the former to userspace.
Inspired by a patch from Steve Rutherford.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This merges a cleanup of asm/apic.h, which is needed by the KVM patches
to support VT-d posted interrupts.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some CONFIG_X86_X2APIC functions, especially x2apic_enabled(), are not
declared if !CONFIG_X86_LOCAL_APIC. However, the same stubs that work
for !CONFIG_X86_X2APIC are okay even if there is no local APIC support
at all.
Avoid the introduction of #ifdefs by moving the x2apic declarations
completely outside the CONFIG_X86_LOCAL_APIC block. (Unfortunately,
diff generation messes up the actual change that this patch makes).
There is no semantic change because CONFIG_X86_X2APIC depends on
CONFIG_X86_LOCAL_APIC.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Feng Wu <feng.wu@intel.com>
Link: http://lkml.kernel.org/r/1443435991-35750-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
HYPERVISOR_memory_op() is defined to return an "int" value. This is
wrong, as the Xen hypervisor will return "long".
The sub-function XENMEM_maximum_reservation returns the maximum
number of pages for the current domain. An int will overflow for a
domain configured with 8TB of memory or more.
Correct this by using the correct type.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Shifting pvclock_vcpu_time_info.system_time on write to KVM system time
MSR is a change of ABI. Probably only 2.6.16 based SLES 10 breaks due
to its custom enhancements to kvmclock, but KVM never declared the MSR
only for one-shot initialization. (Doc says that only one write is
needed.)
This reverts commit b7e60c5aed.
And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull perf fixes from Thomas Gleixner:
"Another pile of fixes for perf:
- Plug overflows and races in the core code
- Sanitize the flow of the perf syscall so we error out before
handling the more complex and hard to undo setups
- Improve and fix Broadwell and Skylake hardware support
- Revert a fix which broke what it tried to fix in perf tools
- A couple of smaller fixes in various places of perf tools"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Fix copying of /proc/kcore
perf intel-pt: Remove no_force_psb from documentation
perf probe: Use existing routine to look for a kernel module by dso->short_name
perf/x86: Change test_aperfmperf() and test_intel() to static
tools lib traceevent: Fix string handling in heterogeneous arch environments
perf record: Avoid infinite loop at buildid processing with no samples
perf: Fix races in computing the header sizes
perf: Fix u16 overflows
perf: Restructure perf syscall point of no return
perf/x86/intel: Fix Skylake FRONTEND MSR extrareg mask
perf/x86/intel/pebs: Add PEBS frontend profiling for Skylake
perf/x86/intel: Make the CYCLE_ACTIVITY.* constraint on Broadwell more specific
perf tools: Bool functions shouldn't return -1
tools build: Add test for presence of __get_cpuid() gcc builtin
tools build: Add test for presence of numa_num_possible_cpus() in libnuma
Revert "perf symbols: Fix mismatched declarations for elf_getphdrnum"
perf stat: Fix per-pkg event reporting bug
and a few PPC bug fixes too.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWBSn7AAoJEL/70l94x66Dxd4H/RT6kWWj9x4grEYUkcJUDyK2
AXm7XcKQm04auwAic8Otr+ts/Qix/50kWmBe/TU0QLgqb8rj5Dj3yGFK6Z1y6mAz
KvaxqMJd4tZGTqN0DDvC2ItEdzjfAdeJZo/FHXqPHVspG0G14T7STLna02LTBBEJ
tNzY9qor8nFhg2fT2szqKaudUNgTqkCTpo57o2BrHE96SHG+m0WdpQCV1F5hPVpg
Te0Pb7qX9xng5n3sQ7IV/t3QYbrza1ACwNQS9XJa0Yu6iEz7JdmVmzHQASK9ynn6
hUHhsNYGx4IsPjPtfJk2GroNaRDZL+VMzw07tfcOvPx8xkS9hS63pwzmSBqfLrM=
=Ywqn
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"AMD fixes for bugs introduced in the 4.2 merge window, and a few PPC
bug fixes too"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: disable halt_poll_ns as default for s390x
KVM: x86: fix off-by-one in reserved bits check
KVM: x86: use correct page table format to check nested page table reserved bits
KVM: svm: do not call kvm_set_cr0 from init_vmcb
KVM: x86: trap AMD MSRs for the TSeg base and mask
KVM: PPC: Book3S: Take the kvm->srcu lock in kvmppc_h_logical_ci_load/store()
KVM: PPC: Book3S HV: Pass the correct trap argument to kvmhv_commence_exit
KVM: PPC: Book3S HV: Fix handling of interrupted VCPUs
kvm: svm: reset mmu on VCPU reset
We observed some performance degradation on s390x with dynamic
halt polling. Until we can provide a proper fix, let's enable
halt_poll_ns as default only for supported architectures.
Architectures are now free to set their own halt_poll_ns
default value.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On x86, cpu_relax() simply calls rep_nop(), which generates one
instruction, PAUSE (aka REP NOP).
With this config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os
gcc-4.7.2 does not always inline rep_nop(): it generates several
copies of this:
<rep_nop> (16 copies, 194 calls):
55 push %rbp
48 89 e5 mov %rsp,%rbp
f3 90 pause
5d pop %rbp
c3 retq
See: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
This patch fixes this via s/inline/__always_inline/
on rep_nop() and cpu_relax().
( Forcing inlining only on rep_nop() causes GCC to
deinline cpu_relax(), with almost no change in generated code).
text data bss dec hex filename
88118971 19905208 36421632 144445811 89c1173 vmlinux.before
88118139 19905208 36421632 144444979 89c0e33 vmlinux
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443096149-27291-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This reinstates the following commit:
2c7577a758 ("sched/x86_64: Don't save flags on context switch")
which was reverted in:
512255a2ad ("Revert 'sched/x86_64: Don't save flags on context switch'")
Historically, Linux has always saved and restored EFLAGS across
context switches. As far as I know, the only reason to do this
is because of the NT flag. In particular, if something calls
switch_to() with the NT flag set, then we don't want to leak the
NT flag into a different task that might try to IRET and fail
because NT is set.
Before this commit:
8c7aa698ba ("x86_64, entry: Filter RFLAGS.NT on entry from userspace")
we could run system call bodies with NT set. This would be a DoS or possibly
privilege escalation hole if scheduling in such a system call would leak
NT into a different task.
Importantly, we don't need to worry about NT being set while
preemptible or across page faults. The only way we can schedule
due to preemption or a page fault is in an interrupt entry that
nests inside the SYSENTER prologue. The CPU will clear NT when
entering through an interrupt gate, so we won't schedule with NT
set.
The only other interesting flags are IOPL and AC. Allowing
switch_to() to change IOPL has no effect, as the value loaded
during kernel execution doesn't matter at all except between a
SYSENTER entry and the subsequent PUSHF, and anythign that
interrupts in that window will restore IOPL on return.
If we call __switch_to() with AC set, we have bigger problems.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/d4440fdc2a89247bffb7c003d2a9a2952bd46827.1441146105.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch makes sure that atomic_{read,set}() are at least
{READ,WRITE}_ONCE().
We already had the 'requirement' that atomic_read() should use
ACCESS_ONCE(), and most archs had this, but a few were lacking.
All are now converted to use READ_ONCE().
And, by a symmetry and general paranoia argument, upgrade atomic_set()
to use WRITE_ONCE().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: james.hogan@imgtec.com
Cc: linux-kernel@vger.kernel.org
Cc: oleg@redhat.com
Cc: will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In not-instrumented code KASAN replaces instrumented memset/memcpy/memmove
with not-instrumented analogues __memset/__memcpy/__memove.
However, on x86 the EFI stub is not linked with the kernel. It uses
not-instrumented mem*() functions from arch/x86/boot/compressed/string.c
So we don't replace them with __mem*() variants in EFI stub.
On ARM64 the EFI stub is linked with the kernel, so we should replace
mem*() functions with __mem*(), because the EFI stub runs before KASAN
sets up early shadow.
So let's move these #undef mem* into arch's asm/efi.h which is also
included by the EFI stub.
Also, this will fix the warning in 32-bit build reported by kbuild test
robot:
efi-stub-helper.c:599:2: warning: implicit declaration of function 'memcpy'
[akpm@linux-foundation.org: use 80 cols in comment]
Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Reported-by: Fengguang Wu <fengguang.wu@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add 1GHz 64-bit Numachip2 clocksource timer support for accurate
system-wide timekeeping, as core TSCs are unsynchronised.
Additionally, add a per-core clockevent mechanism that interrupts via the
platform IPI vector after a programmed period.
[ tglx: Taking it through x86 due to dependencies ]
Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Steffen Persvold <sp@numascale.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: http://lkml.kernel.org/r/1442829745-29311-1-git-send-email-daniel@numascale.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When sending IPIs, first check if the non-local part of the source and
destination APIC IDs match; if so, send via the local APIC for efficiency.
Secondly, since the AMD BIOS-kernel developer guide states IPI delivery
will occur invarient of prior deliver status, avoid polling the delivery
status bit for efficiency.
Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Steffen Persvold <sp@numascale.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: http://lkml.kernel.org/r/1442768522-19217-3-git-send-email-daniel@numascale.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Introduce support for Numachip2 remote interrupts via detecting the right
ACPI SRAT signature.
Access is performed via a fixed mapping in the x86 physical address space.
Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Steffen Persvold <sp@numascale.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: http://lkml.kernel.org/r/1442768522-19217-2-git-send-email-daniel@numascale.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Drop unused code and includes in Numachip header files and APIC driver.
Additionally, use the 'numachip1' prefix on Numachip1-specific functions;
this prepares for adding Numachip2 support in later patches.
Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Steffen Persvold <sp@numascale.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: http://lkml.kernel.org/r/1442768522-19217-1-git-send-email-daniel@numascale.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>