Commit graph

23357 commits

Author SHA1 Message Date
Ruslan Ruslichenko
86915782ff x86/ioapic: Restore IO-APIC irq_chip retrigger callback
commit 020eb3daaba2857b32c4cf4c82f503d6a00a67de upstream.

commit d32932d02e removed the irq_retrigger callback from the IO-APIC
chip and did not add it to the new IO-APIC-IR irq chip.

Unfortunately the software resend fallback is not enabled on X86, so edge
interrupts which are received during the lazy disabled state of the
interrupt line are not retriggered and therefor lost.

Restore the callbacks.

[ tglx: Massaged changelog ]

Fixes: d32932d02e  ("x86/irq: Convert IOAPIC to use hierarchical irqdomain interfaces")
Signed-off-by: Ruslan Ruslichenko <rruslich@cisco.com>
Cc: xe-linux-external@cisco.com
Link: http://lkml.kernel.org/r/1484662432-13580-1-git-send-email-rruslich@cisco.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-26 08:23:48 +01:00
Bjorn Helgaas
3b434ca859 x86/PCI: Ignore _CRS on Supermicro X8DTH-i/6/iF/6F
commit 89e9f7bcd8744ea25fcf0ac671b8d72c10d7d790 upstream.

Martin reported that the Supermicro X8DTH-i/6/iF/6F advertises incorrect
host bridge windows via _CRS:

  pci_root PNP0A08:00: host bridge window [io  0xf000-0xffff]
  pci_root PNP0A08:01: host bridge window [io  0xf000-0xffff]

Both bridges advertise the 0xf000-0xffff window, which cannot be correct.

Work around this by ignoring _CRS on this system.  The downside is that we
may not assign resources correctly to hot-added PCI devices (if they are
possible on this system).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=42606
Reported-by: Martin Burnicki <martin.burnicki@meinberg.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-26 08:23:47 +01:00
Steven Rostedt
97085e2a38 ftrace/x86: Set ftrace_stub to weak to prevent gcc from using short jumps to it
commit 8329e818f14926a6040df86b2668568bde342ebf upstream.

Matt Fleming reported seeing crashes when enabling and disabling
function profiling which uses function graph tracer. Later Namhyung Kim
hit a similar issue and he found that the issue was due to the jmp to
ftrace_stub in ftrace_graph_call was only two bytes, and when it was
changed to jump to the tracing code, it overwrote the ftrace_stub that
was after it.

Masami Hiramatsu bisected this down to a binutils change:

8dcea93252a9ea7dff57e85220a719e2a5e8ab41 is the first bad commit
commit 8dcea93252a9ea7dff57e85220a719e2a5e8ab41
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 15 03:17:31 2015 -0700

    Add -mshared option to x86 ELF assembler

    This patch adds -mshared option to x86 ELF assembler.  By default,
    assembler will optimize out non-PLT relocations against defined non-weak
    global branch targets with default visibility.  The -mshared option tells
    the assembler to generate code which may go into a shared library
    where all non-weak global branch targets with default visibility can
    be preempted.  The resulting code is slightly bigger.  This option
    only affects the handling of branch instructions.

Declaring ftrace_stub as a weak call prevents gas from using two byte
jumps to it, which would be converted to a jump to the function graph
code.

Link: http://lkml.kernel.org/r/20160516230035.1dbae571@gandalf.local.home

Reported-by: Matt Fleming <matt@codeblueprint.co.uk>
Reported-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-26 08:23:46 +01:00
Lukasz Odzioba
68b97d287e x86/cpu: Fix bootup crashes by sanitizing the argument of the 'clearcpuid=' command-line option
commit dd853fd216d1485ed3045ff772079cc8689a9a4a upstream.

A negative number can be specified in the cmdline which will be used as
setup_clear_cpu_cap() argument. With that we can clear/set some bit in
memory predceeding boot_cpu_data/cpu_caps_cleared which may cause kernel
to misbehave. This patch adds lower bound check to setup_disablecpuid().

Boris Petkov reproduced a crash:

  [    1.234575] BUG: unable to handle kernel paging request at ffffffff858bd540
  [    1.236535] IP: memcpy_erms+0x6/0x10

Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: andi.kleen@intel.com
Cc: bp@alien8.de
Cc: dave.hansen@linux.intel.com
Cc: luto@kernel.org
Cc: slaoub@gmail.com
Fixes: ac72e7888a ("x86: add generic clearcpuid=... option")
Link: http://lkml.kernel.org/r/1482933340-11857-1-git-send-email-lukasz.odzioba@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:17:21 +01:00
Steve Rutherford
9d3875c0c4 KVM: x86: Introduce segmented_write_std
commit 129a72a0d3c8e139a04512325384fe5ac119e74d upstream.

Introduces segemented_write_std.

Switches from emulated reads/writes to standard read/writes in fxsave,
fxrstor, sgdt, and sidt.  This fixes CVE-2017-2584, a longstanding
kernel memory leak.

Since commit 283c95d0e389 ("KVM: x86: emulate FXSAVE and FXRSTOR",
2016-11-09), which is luckily not yet in any final release, this would
also be an exploitable kernel memory *write*!

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 96051572c8
Fixes: 283c95d0e3891b64087706b344a4b545d04a6e62
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:17:20 +01:00
Radim Krčmář
3490e72ad6 KVM: x86: emulate FXSAVE and FXRSTOR
commit 283c95d0e3891b64087706b344a4b545d04a6e62 upstream.

Internal errors were reported on 16 bit fxsave and fxrstor with ipxe.
Old Intels don't have unrestricted_guest, so we have to emulate them.

The patch takes advantage of the hardware implementation.

AMD and Intel differ in saving and restoring other fields in first 32
bytes.  A test wrote 0xff to the fxsave area, 0 to upper bits of MCSXR
in the fxsave area, executed fxrstor, rewrote the fxsave area to 0xee,
and executed fxsave:

  Intel (Nehalem):
    7f 1f 7f 7f ff 00 ff 07 ff ff ff ff ff ff 00 00
    ff ff ff ff ff ff 00 00 ff ff 00 00 ff ff 00 00
  Intel (Haswell -- deprecated FPU CS and FPU DS):
    7f 1f 7f 7f ff 00 ff 07 ff ff ff ff 00 00 00 00
    ff ff ff ff 00 00 00 00 ff ff 00 00 ff ff 00 00
  AMD (Opteron 2300-series):
    7f 1f 7f 7f ff 00 ee ee ee ee ee ee ee ee ee ee
    ee ee ee ee ee ee ee ee ff ff 00 00 ff ff 02 00

fxsave/fxrstor will only be emulated on early Intels, so KVM can't do
much to improve the situation.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:17:19 +01:00
Radim Krčmář
d9c4c1e7c2 KVM: x86: add asm_safe wrapper
commit aabba3c6abd50b05b1fc2c6ec44244aa6bcda576 upstream.

Move the existing exception handling for inline assembly into a macro
and switch its return values to X86EMUL type.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:17:19 +01:00
Radim Krčmář
4fa0090249 KVM: x86: add Align16 instruction flag
commit d3fe959f81024072068e9ed86b39c2acfd7462a9 upstream.

Needed for FXSAVE and FXRSTOR.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:17:19 +01:00
David Matlack
1fc673d96f KVM: x86: flush pending lapic jump label updates on module unload
commit cef84c302fe051744b983a92764d3fcca933415d upstream.

KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
These are implemented with delayed_work structs which can still be
pending when the KVM module is unloaded. We've seen this cause kernel
panics when the kvm_intel module is quickly reloaded.

Use the new static_key_deferred_flush() API to flush pending updates on
module unload.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:17:19 +01:00
Paolo Bonzini
816307c80d KVM: x86: fix emulation of "MOV SS, null selector"
commit 33ab91103b3415e12457e3104f0e4517ce12d0f3 upstream.

This is CVE-2017-2583.  On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.

The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.

Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.

Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com>
Fixes: 79d5b4c3cd
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:17:19 +01:00
Xiao Guangrong
7b95f36fc6 KVM: x86: reset MMU on KVM_SET_VCPU_EVENTS
commit 6ef4e07ecd2db21025c446327ecf34414366498b upstream.

Otherwise, mismatch between the smm bit in hflags and the MMU role
can cause a NULL pointer dereference.

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12 11:22:43 +01:00
Steven Rostedt (Red Hat)
2ef502e860 ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps to it
commit 847fa1a6d3d00f3bdf68ef5fa4a786f644a0dd67 upstream.

With new binutils, gcc may get smart with its optimization and change a jmp
from a 5 byte jump to a 2 byte one even though it was jumping to a global
function. But that global function existed within a 2 byte radius, and gcc
was able to optimize it. Unfortunately, that jump was also being modified
when function graph tracing begins. Since ftrace expected that jump to be 5
bytes, but it was only two, it overwrote code after the jump, causing a
crash.

This was fixed for x86_64 with commit 8329e818f149, with the same subject as
this commit, but nothing was done for x86_32.

Fixes: d61f82d066 ("ftrace: use dynamic patching for updating mcount calls")
Reported-by: Colin Ian King <colin.king@canonical.com>
Tested-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-09 08:07:50 +01:00
Jim Mattson
19aa9c1498 kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF)
commit ef85b67385436ddc1998f45f1d6a210f935b3388 upstream.

When L2 exits to L0 due to "exception or NMI", software exceptions
(#BP and #OF) for which L1 has requested an intercept should be
handled by L1 rather than L0. Previously, only hardware exceptions
were forwarded to L1.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-09 08:07:50 +01:00
Peter Zijlstra (Intel)
c4db8a7d1e perf/x86: Fix full width counter, counter overflow
commit 7f612a7f0bc13a2361a152862435b7941156b6af upstream.

Lukasz reported that perf stat counters overflow handling is broken on KNL/SLM.

Both these parts have full_width_write set, and that does indeed have
a problem. In order to deal with counter wrap, we must sample the
counter at at least half the counter period (see also the sampling
theorem) such that we can unambiguously reconstruct the count.

However commit:

  069e0c3c40 ("perf/x86/intel: Support full width counting")

sets the sampling interval to the full period, not half.

Fixing that exposes another issue, in that we must not sign extend the
delta value when we shift it right; the counter cannot have
decremented after all.

With both these issues fixed, counter overflow functions correctly
again.

Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Tested-by: Liang, Kan <kan.liang@intel.com>
Tested-by: Odzioba, Lukasz <lukasz.odzioba@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 069e0c3c40 ("perf/x86/intel: Support full width counting")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-15 08:49:22 -08:00
Andy Lutomirski
5a5f703019 x86/traps: Ignore high word of regs->cs in early_fixup_exception()
commit fc0e81b2bea0ebceb71889b61d2240856141c9ee upstream.

On the 80486 DX, it seems that some exceptions may leave garbage in
the high bits of CS.  This causes sporadic failures in which
early_fixup_exception() refuses to fix up an exception.

As far as I can tell, this has been buggy for a long time, but the
problem seems to have been exacerbated by commits:

  1e02ce4ccc ("x86: Store a per-cpu shadow copy of CR4")
  e1bfc11c5a6f ("x86/init: Fix cr4_init_shadow() on CR4-less machines")

This appears to have broken for as long as we've had early
exception handling.

[ This backport should apply to kernels from 3.4 - 4.5. ]

Fixes: 4c5023a3fa ("x86-32: Handle exception table entries during early boot")
Cc: H. Peter Anvin <hpa@zytor.com>
Reported-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08 07:15:24 +01:00
Radim Krčmář
341f9730c2 KVM: x86: check for pic and ioapic presence before use
commit df492896e6dfb44fd1154f5402428d8e52705081 upstream.

Split irqchip allows pic and ioapic routes to be used without them being
created, which results in NULL access.  Check for NULL and avoid it.
(The setup is too racy for a nicer solutions.)

Found by syzkaller:

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 3 PID: 11923 Comm: kworker/3:2 Not tainted 4.9.0-rc5+ #27
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: events irqfd_inject
  task: ffff88006a06c7c0 task.stack: ffff880068638000
  RIP: 0010:[...]  [...] __lock_acquire+0xb35/0x3380 kernel/locking/lockdep.c:3221
  RSP: 0000:ffff88006863ea20  EFLAGS: 00010006
  RAX: dffffc0000000000 RBX: dffffc0000000000 RCX: 0000000000000000
  RDX: 0000000000000039 RSI: 0000000000000000 RDI: 1ffff1000d0c7d9e
  RBP: ffff88006863ef58 R08: 0000000000000001 R09: 0000000000000000
  R10: 00000000000001c8 R11: 0000000000000000 R12: ffff88006a06c7c0
  R13: 0000000000000001 R14: ffffffff8baab1a0 R15: 0000000000000001
  FS:  0000000000000000(0000) GS:ffff88006d100000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000004abdd0 CR3: 000000003e2f2000 CR4: 00000000000026e0
  Stack:
   ffffffff894d0098 1ffff1000d0c7d56 ffff88006863ecd0 dffffc0000000000
   ffff88006a06c7c0 0000000000000000 ffff88006863ecf8 0000000000000082
   0000000000000000 ffffffff815dd7c1 ffffffff00000000 ffffffff00000000
  Call Trace:
   [...] lock_acquire+0x2a2/0x790 kernel/locking/lockdep.c:3746
   [...] __raw_spin_lock include/linux/spinlock_api_smp.h:144
   [...] _raw_spin_lock+0x38/0x50 kernel/locking/spinlock.c:151
   [...] spin_lock include/linux/spinlock.h:302
   [...] kvm_ioapic_set_irq+0x4c/0x100 arch/x86/kvm/ioapic.c:379
   [...] kvm_set_ioapic_irq+0x8f/0xc0 arch/x86/kvm/irq_comm.c:52
   [...] kvm_set_irq+0x239/0x640 arch/x86/kvm/../../../virt/kvm/irqchip.c:101
   [...] irqfd_inject+0xb4/0x150 arch/x86/kvm/../../../virt/kvm/eventfd.c:60
   [...] process_one_work+0xb40/0x1ba0 kernel/workqueue.c:2096
   [...] worker_thread+0x214/0x18a0 kernel/workqueue.c:2230
   [...] kthread+0x328/0x3e0 kernel/kthread.c:209
   [...] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 49df6397ed ("KVM: x86: Split the APIC from the rest of IRQCHIP.")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-02 09:09:00 +01:00
Radim Krčmář
b7f9404d1b KVM: x86: drop error recovery in em_jmp_far and em_ret_far
commit 2117d5398c81554fbf803f5fd1dc55eb78216c0c upstream.

em_jmp_far and em_ret_far assumed that setting IP can only fail in 64
bit mode, but syzkaller proved otherwise (and SDM agrees).
Code segment was restored upon failure, but it was left uninitialized
outside of long mode, which could lead to a leak of host kernel stack.
We could have fixed that by always saving and restoring the CS, but we
take a simpler approach and just break any guest that manages to fail
as the error recovery is error-prone and modern CPUs don't need emulator
for this.

Found by syzkaller:

  WARNING: CPU: 2 PID: 3668 at arch/x86/kvm/emulate.c:2217 em_ret_far+0x428/0x480
  Kernel panic - not syncing: panic_on_warn set ...

  CPU: 2 PID: 3668 Comm: syz-executor Not tainted 4.9.0-rc4+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   [...]
  Call Trace:
   [...] __dump_stack lib/dump_stack.c:15
   [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51
   [...] panic+0x1b7/0x3a3 kernel/panic.c:179
   [...] __warn+0x1c4/0x1e0 kernel/panic.c:542
   [...] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
   [...] em_ret_far+0x428/0x480 arch/x86/kvm/emulate.c:2217
   [...] em_ret_far_imm+0x17/0x70 arch/x86/kvm/emulate.c:2227
   [...] x86_emulate_insn+0x87a/0x3730 arch/x86/kvm/emulate.c:5294
   [...] x86_emulate_instruction+0x520/0x1ba0 arch/x86/kvm/x86.c:5545
   [...] emulate_instruction arch/x86/include/asm/kvm_host.h:1116
   [...] complete_emulated_io arch/x86/kvm/x86.c:6870
   [...] complete_emulated_mmio+0x4e9/0x710 arch/x86/kvm/x86.c:6934
   [...] kvm_arch_vcpu_ioctl_run+0x3b7a/0x5a90 arch/x86/kvm/x86.c:6978
   [...] kvm_vcpu_ioctl+0x61e/0xdd0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2557
   [...] vfs_ioctl fs/ioctl.c:43
   [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679
   [...] SYSC_ioctl fs/ioctl.c:694
   [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685
   [...] entry_SYSCALL_64_fastpath+0x1f/0xc2

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: d1442d85cc ("KVM: x86: Handle errors when RIP is set during far jumps")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-02 09:09:00 +01:00
Sebastian Andrzej Siewior
e543f094a3 x86/kexec: add -fno-PIE
commit 90944e40ba1838de4b2a9290cf273f9d76bd3bdd upstream.

If the gcc is configured to do -fPIE by default then the build aborts
later with:
| Unsupported relocation type: unknown type rel type name (29)

Tagging it stable so it is possible to compile recent stable kernels as
well.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Michal Marek <mmarek@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-26 09:54:52 +01:00
Ignacio Alvarado
d4a774fdb9 KVM: Disable irq while unregistering user notifier
commit 1650b4ebc99da4c137bfbfc531be4a2405f951dd upstream.

Function user_notifier_unregister should be called only once for each
registered user notifier.

Function kvm_arch_hardware_disable can be executed from an IPI context
which could cause a race condition with a VCPU returning to user mode
and attempting to unregister the notifier.

Signed-off-by: Ignacio Alvarado <ikalvarado@google.com>
Fixes: 18863bdd60 ("KVM: x86 shared msr infrastructure")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-26 09:54:52 +01:00
Paolo Bonzini
b689e86c9a KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr
commit 7301d6abaea926d685832f7e1f0c37dd206b01f4 upstream.

Reported by syzkaller:

    [ INFO: suspicious RCU usage. ]
    4.9.0-rc4+ #47 Not tainted
    -------------------------------
    ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage!

    stack backtrace:
    CPU: 1 PID: 6679 Comm: syz-executor Not tainted 4.9.0-rc4+ #47
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
     ffff880039e2f6d0 ffffffff81c2e46b ffff88003e3a5b40 0000000000000000
     0000000000000001 ffffffff83215600 ffff880039e2f700 ffffffff81334ea9
     ffffc9000730b000 0000000000000004 ffff88003c4f8420 ffff88003d3f8000
    Call Trace:
     [<     inline     >] __dump_stack lib/dump_stack.c:15
     [<ffffffff81c2e46b>] dump_stack+0xb3/0x118 lib/dump_stack.c:51
     [<ffffffff81334ea9>] lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4445
     [<     inline     >] __kvm_memslots include/linux/kvm_host.h:534
     [<     inline     >] kvm_memslots include/linux/kvm_host.h:541
     [<ffffffff8105d6ae>] kvm_gfn_to_hva_cache_init+0xa1e/0xce0 virt/kvm/kvm_main.c:1941
     [<ffffffff8112685d>] kvm_lapic_set_vapic_addr+0xed/0x140 arch/x86/kvm/lapic.c:2217

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: fda4e2e855
Cc: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-26 09:54:51 +01:00
Yazen Ghannam
aea9d760b8 x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems
commit b0b6e86846093c5f8820386bc01515f857dd8faa upstream.

cpu_llc_id (Last Level Cache ID) derivation on AMD Fam17h has an
underflow bug when extracting the socket_id value. It starts from 0
so subtracting 1 from it will result in an invalid value. This breaks
scheduling topology later on since the cpu_llc_id will be incorrect.

For example, the the cpu_llc_id of the *other* CPU in the loops in
set_cpu_sibling_map() underflows and we're generating the funniest
thread_siblings masks and then when I run 8 threads of nbench, they get
spread around the LLC domains in a very strange pattern which doesn't
give you the normal scheduling spread one would expect for performance.

Other things like EDAC use cpu_llc_id so they will be b0rked too.

So, the APIC ID is preset in APICx020 for bits 3 and above: they contain
the core complex, node and socket IDs.

The LLC is at the core complex level so we can find a unique cpu_llc_id
by right shifting the APICID by 3 because then the least significant bit
will be the Core Complex ID.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
[ Cleaned up and extended the commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 3849e91f57 ("x86/AMD: Fix last level cache topology for AMD Fam17h systems")
Link: http://lkml.kernel.org/r/20161108083506.rvqb5h4chrcptj7d@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-26 09:54:51 +01:00
Owen Hofmann
91e1f7b0eb kvm: x86: Check memopp before dereference (CVE-2016-8630)
commit d9092f52d7e61dd1557f2db2400ddb430e85937e upstream.

Commit 41061cdb98 ("KVM: emulate: do not initialize memopp") removes a
check for non-NULL under incorrect assumptions. An undefined instruction
with a ModR/M byte with Mod=0 and R/M-5 (e.g. 0xc7 0x15) will attempt
to dereference a null pointer here.

Fixes: 41061cdb98
Message-Id: <1477592752-126650-2-git-send-email-osh@google.com>
Signed-off-by: Owen Hofmann <osh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10 16:36:37 +01:00
Juergen Gross
eeae15fece x86/xen: fix upper bound of pmd loop in xen_cleanhighmap()
commit 1cf38741308c64d08553602b3374fb39224eeb5a upstream.

xen_cleanhighmap() is operating on level2_kernel_pgt only. The upper
bound of the loop setting non-kernel-image entries to zero should not
exceed the size of level2_kernel_pgt.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10 16:36:36 +01:00
Ido Yariv
159766dff4 KVM: x86: fix wbinvd_dirty_mask use-after-free
commit bd768e146624cbec7122ed15dead8daa137d909d upstream.

vcpu->arch.wbinvd_dirty_mask may still be used after freeing it,
corrupting memory. For example, the following call trace may set a bit
in an already freed cpu mask:
    kvm_arch_vcpu_load
    vcpu_load
    vmx_free_vcpu_nested
    vmx_free_vcpu
    kvm_arch_vcpu_free

Fix this by deferring freeing of wbinvd_dirty_mask.

Signed-off-by: Ido Yariv <ido@wizery.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10 16:36:34 +01:00
Linus Torvalds
dc1555e670 Fix potential infoleak in older kernels
Not upstream as it is not needed there.

So a patch something like this might be a safe way to fix the
potential infoleak in older kernels.

THIS IS UNTESTED. It's a very obvious patch, though, so if it compiles
it probably works. It just initializes the output variable with 0 in
the inline asm description, instead of doing it in the exception
handler.

It will generate slightly worse code (a few unnecessary ALU
operations), but it doesn't have any interactions with the exception
handler implementation.


Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10 16:36:34 +01:00
Greg Kroah-Hartman
aef682ff5a Revert "fix minor infoleak in get_user_ex()"
This reverts commit 9d25c78ec0 which is
1c109fabbd51863475cd12ac206bdd249aee35af upstream

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-31 19:56:26 -06:00
Greg Kroah-Hartman
bb730cc301 Revert "x86/mm: Expand the exception table logic to allow new handling options"
This reverts commit fcf5e5198b which is
548acf19234dbda5a52d5a8e7e205af46e9da840 upstream.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-31 19:56:26 -06:00
Tony Luck
fcf5e5198b x86/mm: Expand the exception table logic to allow new handling options
commit 548acf19234dbda5a52d5a8e7e205af46e9da840 upstream.

Huge amounts of help from  Andy Lutomirski and Borislav Petkov to
produce this. Andy provided the inspiration to add classes to the
exception table with a clever bit-squeezing trick, Boris pointed
out how much cleaner it would all be if we just had a new field.

Linus Torvalds blessed the expansion with:

  ' I'd rather not be clever in order to save just a tiny amount of space
    in the exception table, which isn't really criticial for anybody. '

The third field is another relative function pointer, this one to a
handler that executes the actions.

We start out with three handlers:

 1: Legacy - just jumps the to fixup IP
 2: Fault - provide the trap number in %ax to the fixup code
 3: Cleaned up legacy for the uaccess error hack

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f6af78fcbd348cf4939875cfda9c19689b5e50b8.1455732970.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-31 04:14:00 -06:00
Ville Syrjälä
be1cd22fe1 drm/i915: Account for TSEG size when determining 865G stolen base
commit d721b02fd00bf133580f431b82ef37f3b746dfb2 upstream.

Looks like the TSEG lives just above TOUD, stolen comes after TSEG.

The spec seems somewhat self-contradictory in places, in the ESMRAMC
register desctription it says:
 TSEG Size:
  10=(TOUD + 512 KB) to TOUD
  11 =(TOUD + 1 MB) to TOUD

so that agrees with TSEG being at TOUD. But the example given
elsehwere in the spec says:

 TOUD equals 62.5 MB = 03E7FFFFh
 TSEG selected as 512 KB in size,
 Graphics local memory selected as 1 MB in size
 General System RAM available in system = 62.5 MB
 General system RAM range00000000h to 03E7FFFFh
 TSEG address range03F80000h to 03FFFFFFh
 TSEG pre-allocated from03F80000h to 03FFFFFFh
 Graphics local memory pre-allocated from03E80000h to 03F7FFFFh

so here we have TSEG above stolen.

Real world evidence agrees with the TOUD->TSEG->stolen order however, so
let's fix up the code to account for the TSEG size.

Cc: Taketo Kabe <fdporg@vega.pgw.jp>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Fixes: 0ad98c74e0 ("drm/i915: Determine the stolen memory base address on gen2")
Fixes: a4dff76924 ("x86/gpu: Add Intel graphics stolen memory quirk for gen2 platforms")
Reported-by: Taketo Kabe <fdporg@vega.pgw.jp>
Tested-by: Taketo Kabe <fdporg@vega.pgw.jp>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96473
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1470653919-27251-1-git-send-email-ville.syrjala@linux.intel.com
Link: http://download.intel.com/design/chipsets/datashts/25251405.pdf
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-31 04:13:58 -06:00
Jiri Slaby
742b0cee45 kvm: x86: memset whole irq_eoi
commit 8678654e3c7ad7b0f4beb03fa89691279cba71f9 upstream.

gcc 7 warns:
arch/x86/kvm/ioapic.c: In function 'kvm_ioapic_reset':
arch/x86/kvm/ioapic.c:597:2: warning: 'memset' used with length equal to number of elements without multiplication by element size [-Wmemset-elt-size]

And it is right. Memset whole array using sizeof operator.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
[Added x86 subject tag]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28 03:01:33 -04:00
Dan Williams
c234231da4 x86/e820: Don't merge consecutive E820_PRAM ranges
commit 23446cb66c073b827779e5eb3dec301623299b32 upstream.

Commit:

  917db484dc6a ("x86/boot: Fix kdump, cleanup aborted E820_PRAM max_pfn manipulation")

... fixed up the broken manipulations of max_pfn in the presence of
E820_PRAM ranges.

However, it also broke the sanitize_e820_map() support for not merging
E820_PRAM ranges.

Re-introduce the enabling to keep resource boundaries between
consecutive defined ranges. Otherwise, for example, an environment that
boots with memmap=2G!8G,2G!10G will end up with a single 4G /dev/pmem0
device instead of a /dev/pmem0 and /dev/pmem1 device 2G in size.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhang Yi <yizhan@redhat.com>
Cc: linux-nvdimm@lists.01.org
Fixes: 917db484dc6a ("x86/boot: Fix kdump, cleanup aborted E820_PRAM max_pfn manipulation")
Link: http://lkml.kernel.org/r/147629530854.10618.10383744751594021268.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28 03:01:33 -04:00
H.J. Lu
74a2862d96 x86/build: Build compressed x86 kernels as PIE
commit 6d92bc9d483aa1751755a66fee8fb39dffb088c0 upstream.

The 32-bit x86 assembler in binutils 2.26 will generate R_386_GOT32X
relocation to get the symbol address in PIC.  When the compressed x86
kernel isn't built as PIC, the linker optimizes R_386_GOT32X relocations
to their fixed symbol addresses.  However, when the compressed x86
kernel is loaded at a different address, it leads to the following
load failure:

  Failed to allocate space for phdrs

during the decompression stage.

If the compressed x86 kernel is relocatable at run-time, it should be
compiled with -fPIE, instead of -fPIC, if possible and should be built as
Position Independent Executable (PIE) so that linker won't optimize
R_386_GOT32X relocation to its fixed symbol address.

Older linkers generate R_386_32 relocations against locally defined
symbols, _bss, _ebss, _got and _egot, in PIE.  It isn't wrong, just less
optimal than R_386_RELATIVE.  But the x86 kernel fails to properly handle
R_386_32 relocations when relocating the kernel.  To generate
R_386_RELATIVE relocations, we mark _bss, _ebss, _got and _egot as
hidden in both 32-bit and 64-bit x86 kernels.

To build a 64-bit compressed x86 kernel as PIE, we need to disable the
relocation overflow check to avoid relocation overflow errors. We do
this with a new linker command-line option, -z noreloc-overflow, which
got added recently:

 commit 4c10bbaa0912742322f10d9d5bb630ba4e15dfa7
 Author: H.J. Lu <hjl.tools@gmail.com>
 Date:   Tue Mar 15 11:07:06 2016 -0700

    Add -z noreloc-overflow option to x86-64 ld

    Add -z noreloc-overflow command-line option to the x86-64 ELF linker to
    disable relocation overflow check.  This can be used to avoid relocation
    overflow check if there will be no dynamic relocation overflow at
    run-time.

The 64-bit compressed x86 kernel is built as PIE only if the linker supports
-z noreloc-overflow.  So far 64-bit relocatable compressed x86 kernel
boots fine even when it is built as a normal executable.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
[ Edited the changelog and comments. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-20 10:00:47 +02:00
Josh Poimboeuf
63f9190b09 x86/dumpstack: Fix x86_32 kernel_stack_pointer() previous stack access
commit 72b4f6a5e903b071f2a7c4eb1418cbe4eefdc344 upstream.

On x86_32, when an interrupt happens from kernel space, SS and SP aren't
pushed and the existing stack is used.  So pt_regs is effectively two
words shorter, and the previous stack pointer is normally the memory
after the shortened pt_regs, aka '&regs->sp'.

But in the rare case where the interrupt hits right after the stack
pointer has been changed to point to an empty stack, like for example
when call_on_stack() is used, the address immediately after the
shortened pt_regs is no longer on the stack.  In that case, instead of
'&regs->sp', the previous stack pointer should be retrieved from the
beginning of the current stack page.

kernel_stack_pointer() wants to do that, but it forgets to dereference
the pointer.  So instead of returning a pointer to the previous stack,
it returns a pointer to the beginning of the current stack.

Note that it's probably outside of kernel_stack_pointer()'s scope to be
switching stacks at all.  The x86_64 version of this function doesn't do
it, and it would be better for the caller to do it if necessary.  But
that's a patch for another day.  This just fixes the original intent.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0788aa6a23 ("x86: Prepare removal of previous_esp from i386 thread_info structure")
Link: http://lkml.kernel.org/r/472453d6e9f6a2d4ab16aaed4935f43117111566.1471535549.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-16 17:36:15 +02:00
Mika Westerberg
eba90a4c76 x86/irq: Prevent force migration of irqs which are not in the vector domain
commit db91aa793ff984ac048e199ea1c54202543952fe upstream.

When a CPU is about to be offlined we call fixup_irqs() that resets IRQ
affinities related to the CPU in question. The same thing is also done when
the system is suspended to S-states like S3 (mem).

For each IRQ we try to complete any on-going move regardless whether the
IRQ is actually part of x86_vector_domain. For each IRQ descriptor we fetch
its chip_data, assume it is of type struct apic_chip_data and manipulate it
by clearing old_domain mask etc. For irq_chips that are not part of the
x86_vector_domain, like those created by various GPIO drivers, will find
their chip_data being changed unexpectly.

Below is an example where GPIO chip owned by pinctrl-sunrisepoint.c gets
corrupted after resume:

  # cat /sys/kernel/debug/gpio
  gpiochip0: GPIOs 360-511, parent: platform/INT344B:00, INT344B:00:
   gpio-511 (                    |sysfs               ) in  hi

  # rtcwake -s10 -mmem
  <10 seconds passes>

  # cat /sys/kernel/debug/gpio
  gpiochip0: GPIOs 360-511, parent: platform/INT344B:00, INT344B:00:
   gpio-511 (                    |sysfs               ) in  ?

Note '?' in the output. It means the struct gpio_chip ->get function is
NULL whereas before suspend it was there.

Fix this by first checking that the IRQ belongs to x86_vector_domain before
we try to use the chip_data as struct apic_chip_data.

Reported-and-tested-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Link: http://lkml.kernel.org/r/20161003101708.34795-1-mika.westerberg@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-16 17:36:15 +02:00
Dan Williams
09634475c7 x86/boot: Fix kdump, cleanup aborted E820_PRAM max_pfn manipulation
commit 917db484dc6a69969d317b3e57add4208a8d9d42 upstream.

In commit:

  ec776ef6bb ("x86/mm: Add support for the non-standard protected e820 type")

Christoph references the original patch I wrote implementing pmem support.
The intent of the 'max_pfn' changes in that commit were to enable persistent
memory ranges to be covered by the struct page memmap by default.

However, that approach was abandoned when Christoph ported the patches [1], and
that functionality has since been replaced by devm_memremap_pages().

In the meantime, this max_pfn manipulation is confusing kdump [2] that
assumes that everything covered by the max_pfn is "System RAM".  This
results in kdump hanging or crashing.

 [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-March/000348.html
 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1351098

So fix it.

Reported-by: Zhang Yi <yizhan@redhat.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Zhang Yi <yizhan@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-nvdimm@lists.01.org
Fixes: ec776ef6bb ("x86/mm: Add support for the non-standard protected e820 type")
Link: http://lkml.kernel.org/r/147448744538.34910.11287693517367139607.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-16 17:36:15 +02:00
Radim Krčmář
ae9ba37c04 KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write
commit dccbfcf52cebb8963246eba5b177b77f26b34da0 upstream.

If vmcs12 does not intercept APIC_BASE writes, then KVM will handle the
write with vmcs02 as the current VMCS.
This will incorrectly apply modifications intended for vmcs01 to vmcs02
and L2 can use it to gain access to L0's x2APIC registers by disabling
virtualized x2APIC while using msr bitmap that assumes enabled.

Postpone execution of vmx_set_virtual_x2apic_mode until vmcs01 is the
current VMCS.  An alternative solution would temporarily make vmcs01 the
current VMCS, but it requires more care.

Fixes: 8d14695f95 ("x86, apicv: add virtual x2apic support")
Reported-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-07 15:23:46 +02:00
Andy Lutomirski
7ba8db47da x86/boot: Initialize FPU and X86_FEATURE_ALWAYS even if we don't have CPUID
commit 05fb3c199bb09f5b85de56cc3ede194ac95c5e1f upstream.

Otherwise arch_task_struct_size == 0 and we die.  While we're at it,
set X86_FEATURE_ALWAYS, too.

Reported-by: David Saggiorato <david@saggiorato.net>
Tested-by: David Saggiorato <david@saggiorato.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: aaeb5c01c5b ("x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86")
Link: http://lkml.kernel.org/r/8de723afbf0811071185039f9088733188b606c9.1475103911.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-07 15:23:40 +02:00
Andy Lutomirski
6c0db7437f x86/init: Fix cr4_init_shadow() on CR4-less machines
commit e1bfc11c5a6f40222a698a818dc269113245820e upstream.

cr4_init_shadow() will panic on 486-like machines without CR4.  Fix
it using __read_cr4_safe().

Reported-by: david@saggiorato.net
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1e02ce4ccc ("x86: Store a per-cpu shadow copy of CR4")
Link: http://lkml.kernel.org/r/43a20f81fb504013bf613913dc25574b45336a61.1475091074.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-07 15:23:40 +02:00
Al Viro
9d25c78ec0 fix minor infoleak in get_user_ex()
commit 1c109fabbd51863475cd12ac206bdd249aee35af upstream.

get_user_ex(x, ptr) should zero x on failure.  It's not a lot of a leak
(at most we are leaking uninitialized 64bit value off the kernel stack,
and in a fairly constrained situation, at that), but the fix is trivial,
so...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[ This sat in different branch from the uaccess fixes since mid-August ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24 10:07:43 +02:00
Arnd Bergmann
268a8bf90e kconfig: tinyconfig: provide whole choice blocks to avoid warnings
commit 236dec051078a8691950f56949612b4b74107e48 upstream.

Using "make tinyconfig" produces a couple of annoying warnings that show
up for build test machines all the time:

    .config:966:warning: override: NOHIGHMEM changes choice state
    .config:965:warning: override: SLOB changes choice state
    .config:963:warning: override: KERNEL_XZ changes choice state
    .config:962:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
    .config:933:warning: override: SLOB changes choice state
    .config:930:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
    .config:870:warning: override: SLOB changes choice state
    .config:868:warning: override: KERNEL_XZ changes choice state
    .config:867:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state

I've made a previous attempt at fixing them and we discussed a number of
alternatives.

I tried changing the Makefile to use "merge_config.sh -n
$(fragment-list)" but couldn't get that to work properly.

This is yet another approach, based on the observation that we do want
to see a warning for conflicting 'choice' options, and that we can
simply make them non-conflicting by listing all other options as
disabled.  This is a trivial patch that we can apply independent of
plans for other changes.

Link: http://lkml.kernel.org/r/20160829214952.1334674-2-arnd@arndb.de
Link: https://storage.kernelci.org/mainline/v4.7-rc6/x86-tinyconfig/build.log
https://patchwork.kernel.org/patch/9212749/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24 10:07:42 +02:00
Emanuel Czirai
07450b30b7 x86/AMD: Apply erratum 665 on machines without a BIOS fix
commit d1992996753132e2dafe955cccb2fb0714d3cfc4 upstream.

AMD F12h machines have an erratum which can cause DIV/IDIV to behave
unpredictably. The workaround is to set MSRC001_1029[31] but sometimes
there is no BIOS update containing that workaround so let's do it
ourselves unconditionally. It is simple enough.

[ Borislav: Wrote commit message. ]

Signed-off-by: Emanuel Czirai <icanrealizeum@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Yaowu Xu <yaowu@google.com>
Link: http://lkml.kernel.org/r/20160902053550.18097-1-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24 10:07:37 +02:00
Steven Rostedt
7547bc9f03 x86/paravirt: Do not trace _paravirt_ident_*() functions
commit 15301a570754c7af60335d094dd2d1808b0641a5 upstream.

Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up
after enabling function tracer. I asked him to bisect the functions within
available_filter_functions, which he did and it came down to three:

  _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()

It was found that this is only an issue when noreplace-paravirt is added
to the kernel command line.

This means that those functions are most likely called within critical
sections of the funtion tracer, and must not be traced.

In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
longer an issue.  But both _paravirt_ident_{32,64}() causes the
following splat when they are traced:

 mm/pgtable-generic.c:33: bad pmd ffff8800d2435150(0000000001d00054)
 mm/pgtable-generic.c:33: bad pmd ffff8800d3624190(0000000001d00070)
 mm/pgtable-generic.c:33: bad pmd ffff8800d36a5110(0000000001d00054)
 mm/pgtable-generic.c:33: bad pmd ffff880118eb1450(0000000001d00054)
 NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469]
 Modules linked in: e1000e
 CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
 task: ffff880118f740c0 ti: ffff8800d4aec000 task.ti: ffff8800d4aec000
 RIP: 0010:[<ffffffff81134148>]  [<ffffffff81134148>] queued_spin_lock_slowpath+0x118/0x1a0
 RSP: 0018:ffff8800d4aefb90  EFLAGS: 00000246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011eb16d40
 RDX: ffffffff82485760 RSI: 000000001f288820 RDI: ffffea0000008030
 RBP: ffff8800d4aefb90 R08: 00000000000c0000 R09: 0000000000000000
 R10: ffffffff821c8e0e R11: 0000000000000000 R12: ffff880000200fb8
 R13: 00007f7a4e3f7000 R14: ffffea000303f600 R15: ffff8800d4b562e0
 FS:  00007f7a4e3d7840(0000) GS:ffff88011eb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f7a4e3f7000 CR3: 00000000d3e71000 CR4: 00000000001406e0
 Call Trace:
   _raw_spin_lock+0x27/0x30
   handle_pte_fault+0x13db/0x16b0
   handle_mm_fault+0x312/0x670
   __do_page_fault+0x1b1/0x4e0
   do_page_fault+0x22/0x30
   page_fault+0x28/0x30
   __vfs_read+0x28/0xe0
   vfs_read+0x86/0x130
   SyS_read+0x46/0xa0
   entry_SYSCALL_64_fastpath+0x1e/0xa8
 Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14 c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b

Reported-by: Łukasz Daniluk <lukasz.daniluk@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24 10:07:37 +02:00
Paolo Bonzini
92c67861da Revert "KVM: x86: fix missed hardware breakpoints"
[the change is part of 70e4da7a8ff62f2775337b705f45c804bb450454, which
is already in stable kernels 4.1.y to 4.4.y.  this part of the fix
however was later undone, so remove the line again]

The following patches were applied in the wrong order in -stable. This
is the order as they appear in Linus' tree,

 [0] commit 4e422bdd2f84 ("KVM: x86: fix missed hardware breakpoints")
 [1] commit 172b2386ed16 ("KVM: x86: fix missed hardware breakpoints")
 [2] commit 70e4da7a8ff6 ("KVM: x86: fix root cause for missed hardware breakpoints")

but this is the order for linux-4.4.y

 [1] commit fc90441e72 ("KVM: x86: fix missed hardware breakpoints")
 [2] commit 25e8618619 ("KVM: x86: fix root cause for missed hardware breakpoints")
 [0] commit 0f6e5e26e6 ("KVM: x86: fix missed hardware breakpoints")

The upshot is that KVM_DEBUGREG_RELOAD is always set when returning
from kvm_arch_vcpu_load() in stable, but not in Linus' tree.

This happened because [0] and [1] are the same patch.  [0] and [1] come from two
different merges, and the later merge is trivially resolved; when [2]
is applied it reverts both of them.  Instead, when using the [1][2][0]
order, patches applies normally but "KVM: x86: fix missed hardware
breakpoints" is present in the final tree.

Reported-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24 10:07:35 +02:00
Wanpeng Li
2e133f2d2c x86/apic: Do not init irq remapping if ioapic is disabled
commit 2e63ad4bd5dd583871e6602f9d398b9322d358d9 upstream.

native_smp_prepare_cpus
  -> default_setup_apic_routing
    -> enable_IR_x2apic
      -> irq_remapping_prepare
        -> intel_prepare_irq_remapping
          -> intel_setup_irq_remapping

So IR table is setup even if "noapic" boot parameter is added. As a result we
crash later when the interrupt affinity is set due to a half initialized
remapping infrastructure.

Prevent remap initialization when IOAPIC is disabled.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Link: http://lkml.kernel.org/r/1471954039-3942-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-15 08:27:53 +02:00
Vitaly Kuznetsov
c5852a85ed x86/hyperv: Avoid reporting bogus NMI status for Gen2 instances
[ Upstream commit 1e2ae9ec072f3b7887f456426bc2cf23b80f661a ]

Generation2 instances don't support reporting the NMI status on port 0x61,
read from there returns 'ff' and we end up reporting nonsensical PCI
error (as there is no PCI bus in these instances) on all NMIs:

    NMI: PCI system error (SERR) for reason ff on CPU 0.
    Dazed and confused, but trying to continue

Fix the issue by overriding x86_platform.get_nmi_reason. Use 'booted on
EFI' flag to detect Gen2 instances.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/1460728232-31433-1-git-send-email-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-15 08:27:48 +02:00
Vikas Shivappa
12850f3616 perf/x86/cqm: Fix CQM memory leak and notifier leak
[ Upstream commit ada2f634cd50d050269b67b4e2966582387e7c27 ]

Fixes the hotcpu notifier leak and other global variable memory leaks
during CQM (cache quality of service monitoring) initialization.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fenghua.yu@intel.com
Cc: h.peter.anvin@intel.com
Cc: ravi.v.shankar@intel.com
Cc: vikas.shivappa@intel.com
Link: http://lkml.kernel.org/r/1457652732-4499-3-git-send-email-vikas.shivappa@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-15 08:27:46 +02:00
Vikas Shivappa
0b152db042 perf/x86/cqm: Fix CQM handling of grouping events into a cache_group
[ Upstream commit a223c1c7ab4cc64537dc4b911f760d851683768a ]

Currently CQM (cache quality of service monitoring) is grouping all
events belonging to same PID to use one RMID. However its not counting
all of these different events. Hence we end up with a count of zero
for all events other than the group leader.

The patch tries to address the issue by keeping a flag in the
perf_event.hw which has other CQM related fields. The field is updated
at event creation and during grouping.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
[peterz: Changed hw_perf_event::is_group_event to an int]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fenghua.yu@intel.com
Cc: h.peter.anvin@intel.com
Cc: ravi.v.shankar@intel.com
Cc: vikas.shivappa@intel.com
Link: http://lkml.kernel.org/r/1457652732-4499-2-git-send-email-vikas.shivappa@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-15 08:27:46 +02:00
Denys Vlasenko
77b0e10991 uprobes/x86: Fix RIP-relative handling of EVEX-encoded instructions
commit 68187872c76a96ed4db7bfb064272591f02e208b upstream.

Since instruction decoder now supports EVEX-encoded instructions, two fixes
are needed to correctly handle them in uprobes.

Extended bits for MODRM.rm field need to be sanitized just like we do it
for VEX3, to avoid encoding wrong register for register-relative access.

EVEX has _two_ extended bits: b and x. Theoretically, EVEX.x should be
ignored by the CPU (since GPRs go only up to 15, not 31), but let's be
paranoid here: proper encoding for register-relative access
should have EVEX.x = 1.

Secondly, we should fetch vex.vvvv for EVEX too.
This is now super easy because instruction decoder populates
vex_prefix.bytes[2] for all flavors of (e)vex encodings, even for VEX2.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Fixes: 8a764a875f ("x86/asm/decoder: Create artificial 3rd byte for 2-byte VEX")
Link: http://lkml.kernel.org/r/20160811154521.20469-1-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-07 08:32:36 +02:00
Sebastian Andrzej Siewior
ebabe4ad97 x86/mm: Disable preemption during CR3 read+write
commit 5cf0791da5c162ebc14b01eb01631cfa7ed4fa6e upstream.

There's a subtle preemption race on UP kernels:

Usually current->mm (and therefore mm->pgd) stays the same during the
lifetime of a task so it does not matter if a task gets preempted during
the read and write of the CR3.

But then, there is this scenario on x86-UP:

TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by:

 -> mmput()
 -> exit_mmap()
 -> tlb_finish_mmu()
 -> tlb_flush_mmu()
 -> tlb_flush_mmu_tlbonly()
 -> tlb_flush()
 -> flush_tlb_mm_range()
 -> __flush_tlb_up()
 -> __flush_tlb()
 ->  __native_flush_tlb()

At this point current->mm is NULL but current->active_mm still points to
the "old" mm.

Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its
own mm so CR3 has changed.

Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's
mm and so CR3 remains unchanged. Once taskA gets active it continues
where it was interrupted and that means it writes its old CR3 value
back. Everything is fine because userland won't need its memory
anymore.

Now the fun part:

Let's preempt taskA one more time and get back to taskB. This
time switch_mm() won't do a thing because oldmm (->active_mm)
is the same as mm (as per context_switch()). So we remain
with a bad CR3 / PGD and return to userland.

The next thing that happens is handle_mm_fault() with an address for
the execution of its code in userland. handle_mm_fault() realizes that
it has a PTE with proper rights so it returns doing nothing. But the
CPU looks at the wrong PGD and insists that something is wrong and
faults again. And again. And one more time…

This pagefault circle continues until the scheduler gets tired of it and
puts another task on the CPU. It gets little difficult if the task is a
RT task with a high priority. The system will either freeze or it gets
fixed by the software watchdog thread which usually runs at RT-max prio.
But waiting for the watchdog will increase the latency of the RT task
which is no good.

Fix this by disabling preemption across the critical code section.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de
[ Prettified the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-07 08:32:35 +02:00
Andy Shevchenko
32b04db4f2 x86/platform/intel_mid_pci: Rework IRQ0 workaround
commit bb27570525a71f48347ed0e0c265063e7952bb61 upstream.

On Intel Merrifield platform several PCI devices have a bogus configuration,
i.e. the IRQ0 had been assigned to few of them. These are PCI root bridge,
eMMC0, HS UART common registers, PWM, and HDMI. The actual interrupt line can
be allocated to one device exclusively, in our case to eMMC0, the rest should
cope without it and basically known drivers for them are not using interrupt
line at all.

Rework IRQ0 workaround, which was previously done to avoid conflict between
eMMC0 and HS UART common registers, to behave differently based on the device
in question, i.e. allocate interrupt line to eMMC0, but silently skip interrupt
allocation for the rest except HS UART common registers which are not used
anyway. With this rework IOSF MBI driver in particular would be used.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 39d9b77b8d ("x86/pci/intel_mid_pci: Work around for IRQ0 assignment")
Link: http://lkml.kernel.org/r/1465842481-136852-1-git-send-email-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-20 18:09:27 +02:00