Commit graph

5831 commits

Author SHA1 Message Date
Mathias Krause
ad505a7b4f x86/alternatives: Fix alt_max_short macro to really be a max()
commit 6b32c126d33d5cb379bca280ab8acedc1ca978ff upstream.

The alt_max_short() macro in asm/alternative.h does not work as
intended, leading to nasty bugs. E.g. alt_max_short("1", "3")
evaluates to 3, but alt_max_short("3", "1") evaluates to 1 -- not
exactly the maximum of 1 and 3.

In fact, I had to learn it the hard way by crashing my kernel in not
so funny ways by attempting to make use of the ALTENATIVE_2 macro
with alternatives where the first one was larger than the second
one.

According to [1] and commit dbe4058a6a ("x86/alternatives: Fix
ALTERNATIVE_2 padding generation properly") the right handed side
should read "-(-(a < b))" not "-(-(a - b))". Fix that, to make the
macro work as intended.

While at it, fix up the comments regarding the additional "-", too.
It's not about gas' usage of s32 but brain dead logic of having a
"true" value of -1 for the < operator ... *sigh*

Btw., the one in asm/alternative-asm.h is correct. And, apparently,
all current users of ALTERNATIVE_2() pass same sized alternatives,
avoiding to hit the bug.

[1] http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax

Reviewed-and-tested-by: Borislav Petkov <bp@suse.de>
Fixes: dbe4058a6a ("x86/alternatives: Fix ALTERNATIVE_2 padding generation properly")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1507228213-13095-1-git-send-email-minipli@googlemail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18 09:20:43 +02:00
Andy Lutomirski
e21d66048d x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
commit 9584d98bed7a7a904d0702ad06bbcc94703cb5b4 upstream.

In ELF_COPY_CORE_REGS, we're copying from the current task, so
accessing thread.fsbase and thread.gsbase makes no sense.  Just read
the values from the CPU registers.

In practice, the old code would have been correct most of the time
simply because thread.fsbase and thread.gsbase usually matched the
CPU registers.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chang Seok <chang.seok.bae@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27 11:00:12 +02:00
Arnd Bergmann
218720fe59 x86/io: Add "memory" clobber to insb/insw/insl/outsb/outsw/outsl
commit 7206f9bf108eb9513d170c73f151367a1bdf3dbf upstream.

The x86 version of insb/insw/insl uses an inline assembly that does
not have the target buffer listed as an output. This can confuse
the compiler, leading it to think that a subsequent access of the
buffer is uninitialized:

  drivers/net/wireless/wl3501_cs.c: In function ‘wl3501_mgmt_scan_confirm’:
  drivers/net/wireless/wl3501_cs.c:665:9: error: ‘sig.status’ is used uninitialized in this function [-Werror=uninitialized]
  drivers/net/wireless/wl3501_cs.c:668:12: error: ‘sig.cap_info’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  drivers/net/sb1000.c: In function 'sb1000_rx':
  drivers/net/sb1000.c:775:9: error: 'st[0]' is used uninitialized in this function [-Werror=uninitialized]
  drivers/net/sb1000.c:776:10: error: 'st[1]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  drivers/net/sb1000.c:784:11: error: 'st[1]' may be used uninitialized in this function [-Werror=maybe-uninitialized]

I tried to mark the exact input buffer as an output here, but couldn't
figure it out. As suggested by Linus, marking all memory as clobbered
however is good enough too. For the outs operations, I also add the
memory clobber, to force the input to be written to local variables.
This is probably already guaranteed by the "asm volatile", but it can't
hurt to do this for symmetry.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: http://lkml.kernel.org/r/20170719125310.2487451-5-arnd@arndb.de
Link: https://lkml.org/lkml/2017/7/12/605
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-02 07:06:52 +02:00
Kees Cook
240628085e mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes
commit c715b72c1ba406f133217b509044c38d8e714a37 upstream.

Moving the x86_64 and arm64 PIE base from 0x555555554000 to 0x000100000000
broke AddressSanitizer.  This is a partial revert of:

  eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
  02445990a96e ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")

The AddressSanitizer tool has hard-coded expectations about where
executable mappings are loaded.

The motivation for changing the PIE base in the above commits was to
avoid the Stack-Clash CVEs that allowed executable mappings to get too
close to heap and stack.  This was mainly a problem on 32-bit, but the
64-bit bases were moved too, in an effort to proactively protect those
systems (proofs of concept do exist that show 64-bit collisions, but
other recent changes to fix stack accounting and setuid behaviors will
minimize the impact).

The new 32-bit PIE base is fine for ASan (since it matches the ET_EXEC
base), so only the 64-bit PIE base needs to be reverted to let x86 and
arm64 ASan binaries run again.  Future changes to the 64-bit PIE base on
these architectures can be made optional once a more dynamic method for
dealing with AddressSanitizer is found.  (e.g.  always loading PIE into
the mmap region for marked binaries.)

Link: http://lkml.kernel.org/r/20170807201542.GA21271@beast
Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
Fixes: 02445990a96e ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Kostya Serebryany <kcc@google.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-24 17:02:35 -07:00
Marek Marczykowski-Górecki
5c7c179eb9 x86/xen: allow userspace access during hypercalls
commit c54590cac51db8ab5fd30156bdaba34af915e629 upstream.

Userspace application can do a hypercall through /dev/xen/privcmd, and
some for some hypercalls argument is a pointers to user-provided
structure. When SMAP is supported and enabled, hypervisor can't access.
So, lets allow it.

The same applies to HYPERVISOR_dm_op, where additionally privcmd driver
carefully verify buffer addresses.

Cc: stable@vger.kernel.org
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
[HYPERVISOR_dm_op dropped - not present until 4.11]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27 15:06:06 -07:00
Jim Mattson
6ffa92ffae kvm: vmx: Check value written to IA32_BNDCFGS
commit 4531662d1abf6c1f0e5c2b86ddb60e61509786c8 upstream.

Bits 11:2 must be zero and the linear addess in bits 63:12 must be
canonical. Otherwise, WRMSR(BNDCFGS) should raise #GP.

Fixes: 0dd376e709 ("KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:59 +02:00
Kees Cook
7eb968cd04 binfmt_elf: use ELF_ET_DYN_BASE only for PIE
commit eab09532d40090698b05a07c1c87f39fdbc5fab5 upstream.

The ELF_ET_DYN_BASE position was originally intended to keep loaders
away from ET_EXEC binaries.  (For example, running "/lib/ld-linux.so.2
/bin/cat" might cause the subsequent load of /bin/cat into where the
loader had been loaded.)

With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
ELF_ET_DYN_BASE continued to be used since the kernel was only looking
at ET_DYN.  However, since ELF_ET_DYN_BASE is traditionally set at the
top 1/3rd of the TASK_SIZE, a substantial portion of the address space
is unused.

For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
loaded above the mmap region.  This means they can be made to collide
(CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
pathological stack regions.

Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
region in all cases, and will now additionally avoid programs falling
back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
if it would have collided with the stack, now it will fail to load
instead of falling back to the mmap region).

To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
are loaded into the mmap region, leaving space available for either an
ET_EXEC binary with a fixed location or PIE being loaded into mmap by
the loader.  Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
which means architectures can now safely lower their values without risk
of loaders colliding with their subsequently loaded programs.

For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
the entire 32-bit address space for 32-bit pointers.

Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
suggestions on how to implement this solution.

Fixes: d1fd836dcf ("mm: split ET_DYN ASLR from mmap ASLR")
Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:57 +02:00
Mikulas Patocka
646b65808b x86/mm/pat: Don't report PAT on CPUs that don't support it
commit 99c13b8c8896d7bcb92753bf0c63a8de4326e78d upstream.

The pat_enabled() logic is broken on CPUs which do not support PAT and
where the initialization code fails to call pat_init(). Due to that the
enabled flag stays true and pat_enabled() returns true wrongfully.

As a consequence the mappings, e.g. for Xorg, are set up with the wrong
caching mode and the required MTRR setups are omitted.

To cure this the following changes are required:

  1) Make pat_enabled() return true only if PAT initialization was
     invoked and successful.

  2) Invoke init_cache_modes() unconditionally in setup_arch() and
     remove the extra callsites in pat_disable() and the pat disabled
     code path in pat_init().

Also rename __pat_enabled to pat_disabled to reflect the real purpose of
this variable.

Fixes: 9cd25aac1f ("x86/mm/pat: Emulate PAT when it is disabled")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: "Luis R. Rodriguez" <mcgrof@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1707041749300.3456@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15 11:57:50 +02:00
Ladi Prosek
b9b3eb5c77 KVM: x86: fix emulation of RSM and IRET instructions
commit 6ed071f051e12cf7baa1b69d3becb8f232fdfb7b upstream.

On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
on hflags is reverted later on in x86_emulate_instruction where hflags are
overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.

Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
an instruction is emulated, this commit deletes emul_flags altogether and
makes the emulator access vcpu->arch.hflags using two new accessors. This
way all changes, on the emulator side as well as in functions called from
the emulator and accessing vcpu state with emul_to_vcpu, are preserved.

More details on the bug and its manifestation with Windows and OVMF:

  It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
  I believe that the SMM part explains why we started seeing this only with
  OVMF.

  KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
  the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
  later on in x86_emulate_instruction we overwrite arch.hflags with
  ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
  The AMD-specific hflag of interest here is HF_NMI_MASK.

  When rebooting the system, Windows sends an NMI IPI to all but the current
  cpu to shut them down. Only after all of them are parked in HLT will the
  initiating cpu finish the restart. If NMI is masked, other cpus never get
  the memo and the initiating cpu spins forever, waiting for
  hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.

Fixes: a584539b24 ("KVM: x86: pass the whole hflags field to emulator and back")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:37:22 +02:00
Ben Hutchings
b8cd9dd513 x86, pmem: Fix cache flushing for iovec write < 8 bytes
commit 8376efd31d3d7c44bd05be337adde023cc531fa1 upstream.

Commit 11e63f6d920d added cache flushing for unaligned writes from an
iovec, covering the first and last cache line of a >= 8 byte write and
the first cache line of a < 8 byte write.  But an unaligned write of
2-7 bytes can still cover two cache lines, so make sure we flush both
in that case.

Fixes: 11e63f6d920d ("x86, pmem: fix broken __copy_user_nocache ...")
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:27:00 +02:00
Dan Williams
d1cc3cdd39 x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
commit 11e63f6d920d6f2dfd3cd421e939a4aec9a58dcd upstream.

Before we rework the "pmem api" to stop abusing __copy_user_nocache()
for memcpy_to_pmem() we need to fix cases where we may strand dirty data
in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
for arbitrary data transfers from userspace. There is no guarantee that
these transfers, performed by dax_iomap_actor(), will have aligned
destinations or aligned transfer lengths. Backstop the usage
__copy_user_nocache() with explicit cache management in these unaligned
cases.

Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
that is saved for a later patch that moves the entirety of the "pmem
api" into the pmem driver directly.

Fixes: 5de490daec ("pmem: add copy_from_iter_pmem() and clear_pmem()")
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-27 09:09:34 +02:00
Thomas Gleixner
ec3978e10e x86/vdso: Plug race between mapping and ELF header setup
commit 6fdc6dd90272ce7e75d744f71535cfbd8d77da81 upstream.

The vsyscall32 sysctl can racy against a concurrent fork when it switches
from disabled to enabled:

    arch_setup_additional_pages()
	if (vdso32_enabled)
           --> No mapping
                                        sysctl.vsysscall32()
                                          --> vdso32_enabled = true
    create_elf_tables()
      ARCH_DLINFO_IA32
        if (vdso32_enabled) {
           --> Add VDSO entry with NULL pointer

Make ARCH_DLINFO_IA32 check whether the VDSO mapping has been set up for
the newly forked process or not.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathias Krause <minipli@googlemail.com>
Link: http://lkml.kernel.org/r/20170410151723.602367196@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21 09:30:05 +02:00
Linus Torvalds
dc1555e670 Fix potential infoleak in older kernels
Not upstream as it is not needed there.

So a patch something like this might be a safe way to fix the
potential infoleak in older kernels.

THIS IS UNTESTED. It's a very obvious patch, though, so if it compiles
it probably works. It just initializes the output variable with 0 in
the inline asm description, instead of doing it in the exception
handler.

It will generate slightly worse code (a few unnecessary ALU
operations), but it doesn't have any interactions with the exception
handler implementation.


Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10 16:36:34 +01:00
Greg Kroah-Hartman
aef682ff5a Revert "fix minor infoleak in get_user_ex()"
This reverts commit 9d25c78ec0 which is
1c109fabbd51863475cd12ac206bdd249aee35af upstream

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-31 19:56:26 -06:00
Greg Kroah-Hartman
bb730cc301 Revert "x86/mm: Expand the exception table logic to allow new handling options"
This reverts commit fcf5e5198b which is
548acf19234dbda5a52d5a8e7e205af46e9da840 upstream.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-31 19:56:26 -06:00
Tony Luck
fcf5e5198b x86/mm: Expand the exception table logic to allow new handling options
commit 548acf19234dbda5a52d5a8e7e205af46e9da840 upstream.

Huge amounts of help from  Andy Lutomirski and Borislav Petkov to
produce this. Andy provided the inspiration to add classes to the
exception table with a clever bit-squeezing trick, Boris pointed
out how much cleaner it would all be if we just had a new field.

Linus Torvalds blessed the expansion with:

  ' I'd rather not be clever in order to save just a tiny amount of space
    in the exception table, which isn't really criticial for anybody. '

The third field is another relative function pointer, this one to a
handler that executes the actions.

We start out with three handlers:

 1: Legacy - just jumps the to fixup IP
 2: Fault - provide the trap number in %ax to the fixup code
 3: Cleaned up legacy for the uaccess error hack

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f6af78fcbd348cf4939875cfda9c19689b5e50b8.1455732970.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-31 04:14:00 -06:00
Andy Lutomirski
6c0db7437f x86/init: Fix cr4_init_shadow() on CR4-less machines
commit e1bfc11c5a6f40222a698a818dc269113245820e upstream.

cr4_init_shadow() will panic on 486-like machines without CR4.  Fix
it using __read_cr4_safe().

Reported-by: david@saggiorato.net
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1e02ce4ccc ("x86: Store a per-cpu shadow copy of CR4")
Link: http://lkml.kernel.org/r/43a20f81fb504013bf613913dc25574b45336a61.1475091074.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-07 15:23:40 +02:00
Al Viro
9d25c78ec0 fix minor infoleak in get_user_ex()
commit 1c109fabbd51863475cd12ac206bdd249aee35af upstream.

get_user_ex(x, ptr) should zero x on failure.  It's not a lot of a leak
(at most we are leaking uninitialized 64bit value off the kernel stack,
and in a fairly constrained situation, at that), but the fix is trivial,
so...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[ This sat in different branch from the uaccess fixes since mid-August ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24 10:07:43 +02:00
Sebastian Andrzej Siewior
ebabe4ad97 x86/mm: Disable preemption during CR3 read+write
commit 5cf0791da5c162ebc14b01eb01631cfa7ed4fa6e upstream.

There's a subtle preemption race on UP kernels:

Usually current->mm (and therefore mm->pgd) stays the same during the
lifetime of a task so it does not matter if a task gets preempted during
the read and write of the CR3.

But then, there is this scenario on x86-UP:

TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by:

 -> mmput()
 -> exit_mmap()
 -> tlb_finish_mmu()
 -> tlb_flush_mmu()
 -> tlb_flush_mmu_tlbonly()
 -> tlb_flush()
 -> flush_tlb_mm_range()
 -> __flush_tlb_up()
 -> __flush_tlb()
 ->  __native_flush_tlb()

At this point current->mm is NULL but current->active_mm still points to
the "old" mm.

Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its
own mm so CR3 has changed.

Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's
mm and so CR3 remains unchanged. Once taskA gets active it continues
where it was interrupted and that means it writes its old CR3 value
back. Everything is fine because userland won't need its memory
anymore.

Now the fun part:

Let's preempt taskA one more time and get back to taskB. This
time switch_mm() won't do a thing because oldmm (->active_mm)
is the same as mm (as per context_switch()). So we remain
with a bad CR3 / PGD and return to userland.

The next thing that happens is handle_mm_fault() with an address for
the execution of its code in userland. handle_mm_fault() realizes that
it has a PTE with proper rights so it returns doing nothing. But the
CPU looks at the wrong PGD and insists that something is wrong and
faults again. And again. And one more time…

This pagefault circle continues until the scheduler gets tired of it and
puts another task on the CPU. It gets little difficult if the task is a
RT task with a high priority. The system will either freeze or it gets
fixed by the software watchdog thread which usually runs at RT-max prio.
But waiting for the watchdog will increase the latency of the RT task
which is no good.

Fix this by disabling preemption across the critical code section.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de
[ Prettified the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-07 08:32:35 +02:00
Toshi Kani
26b340ea33 x86/xen, pat: Remove PAT table init code from Xen
commit 88ba281108ed0c25c9d292b48bd3f272fcb90dd0 upstream.

Xen supports PAT without MTRRs for its guests.  In order to
enable WC attribute, it was necessary for xen_start_kernel()
to call pat_init_cache_modes() to update PAT table before
starting guest kernel.

Now that the kernel initializes PAT table to the BIOS handoff
state when MTRR is disabled, this Xen-specific PAT init code
is no longer necessary.  Delete it from xen_start_kernel().

Also change __init_cache_modes() to a static function since
PAT table should not be tweaked by other modules.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: elliott@hpe.com
Cc: paul.gortmaker@windriver.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1458769323-24491-7-git-send-email-toshi.kani@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16 09:30:49 +02:00
Toshi Kani
594055cf63 x86/mtrr: Fix Xorg crashes in Qemu sessions
commit edfe63ec97ed8d4496225f7ba54c9ce4207c5431 upstream.

A Xorg failure on qemu32 was reported as a regression [1] caused by
commit 9cd25aac1f ("x86/mm/pat: Emulate PAT when it is disabled").

This patch fixes the Xorg crash.

Negative effects of this regression were the following two failures [2]
in Xorg on QEMU with QEMU CPU model "qemu32" (-cpu qemu32), which were
triggered by the fact that its virtual CPU does not support MTRRs.

 #1. copy_process() failed in the check in reserve_pfn_range()

    copy_process
     copy_mm
      dup_mm
       dup_mmap
        copy_page_range
         track_pfn_copy
          reserve_pfn_range

 A WC map request was tracked as WC in memtype, which set a PTE as
 UC (pgprot) per __cachemode2pte_tbl[].  This led to this error in
 reserve_pfn_range() called from track_pfn_copy(), which obtained
 a pgprot from a PTE.  It converts pgprot to page_cache_mode, which
 does not necessarily result in the original page_cache_mode since
 __cachemode2pte_tbl[] redirects multiple types to UC.

 #2. error path in copy_process() then hit WARN_ON_ONCE in
     untrack_pfn().

     x86/PAT: Xorg:509 map pfn expected mapping type uncached-
     minus for [mem 0xfd000000-0xfdffffff], got write-combining
      Call Trace:
     dump_stack
     warn_slowpath_common
     ? untrack_pfn
     ? untrack_pfn
     warn_slowpath_null
     untrack_pfn
     ? __kunmap_atomic
     unmap_single_vma
     ? pagevec_move_tail_fn
     unmap_vmas
     exit_mmap
     mmput
     copy_process.part.47
     _do_fork
     SyS_clone
     do_syscall_32_irqs_on
     entry_INT80_32

These negative effects are caused by two separate bugs, but they
can be addressed in separate patches.  Fixing the pat_init() issue
described below addresses the root cause, and avoids Xorg to hit
these cases.

When the CPU does not support MTRRs, MTRR does not call pat_init(),
which leaves PAT enabled without initializing PAT.  This pat_init()
issue is a long-standing issue, but manifested as issue #1 (and then
hit issue #2) with the above-mentioned commit because the memtype
now tracks cache attribute with 'page_cache_mode'.

This pat_init() issue existed before the commit, but we used pgprot
in memtype.  Hence, we did not have issue #1 before.  But WC request
resulted in WT in effect because WC pgrot is actually WT when PAT
is not initialized.  This is not how it was designed to work.  When
PAT is set to disable properly, WC is converted to UC.  The use of
WT can result in a system crash if the target range does not support
WT.  Fortunately, nobody ran into such issue before.

To fix this pat_init() issue, PAT code has been enhanced to provide
pat_disable() interface.  Call this interface when MTRRs are disabled.
By setting PAT to disable properly, PAT bypasses the memtype check,
and avoids issue #1.

  [1]: https://lkml.org/lkml/2016/3/3/828
  [2]: https://lkml.org/lkml/2016/3/4/775

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: elliott@hpe.com
Cc: konrad.wilk@oracle.com
Cc: paul.gortmaker@windriver.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1458769323-24491-5-git-send-email-toshi.kani@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16 09:30:49 +02:00
Toshi Kani
d50e8b108e x86/mm/pat: Add pat_disable() interface
commit 224bb1e5d67ba0f2872c98002d6a6f991ac6fd4a upstream.

In preparation for fixing a regression caused by:

  9cd25aac1f ("x86/mm/pat: Emulate PAT when it is disabled")

... PAT needs to provide an interface that prevents the OS from
initializing the PAT MSR.

PAT MSR initialization must be done on all CPUs using the specific
sequence of operations defined in the Intel SDM.  This requires MTRRs
to be enabled since pat_init() is called as part of MTRR init
from mtrr_rendezvous_handler().

Make pat_disable() as the interface that prevents the OS from
initializing the PAT MSR.  MTRR will call this interface when it
cannot provide the SDM-defined sequence to initialize PAT.

This also assures that pat_disable() called from pat_bsp_init()
will set the PAT table properly when CPU does not support PAT.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: konrad.wilk@oracle.com
Cc: paul.gortmaker@windriver.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1458769323-24491-3-git-send-email-toshi.kani@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16 09:30:49 +02:00
Toshi Kani
8f5b8210ff x86/mm/pat: Add support of non-default PAT MSR setting
commit 02f037d641dc6672be5cfe7875a48ab99b95b154 upstream.

In preparation for fixing a regression caused by:

  9cd25aac1f ("x86/mm/pat: Emulate PAT when it is disabled")'

... PAT needs to support a case that PAT MSR is initialized with a
non-default value.

When pat_init() is called and PAT is disabled, it initializes the
PAT table with the BIOS default value. Xen, however, sets PAT MSR
with a non-default value to enable WC. This causes inconsistency
between the PAT table and PAT MSR when PAT is set to disable on Xen.

Change pat_init() to handle the PAT disable cases properly.  Add
init_cache_modes() to handle two cases when PAT is set to disable.

 1. CPU supports PAT: Set PAT table to be consistent with PAT MSR.
 2. CPU does not support PAT: Set PAT table to be consistent with
    PWT and PCD bits in a PTE.

Note, __init_cache_modes(), renamed from pat_init_cache_modes(),
will be changed to a static function in a later patch.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: elliott@hpe.com
Cc: konrad.wilk@oracle.com
Cc: paul.gortmaker@windriver.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1458769323-24491-2-git-send-email-toshi.kani@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16 09:30:49 +02:00
Minfei Huang
ca3455867e pvclock: Add CPU barriers to get correct version value
commit 749d088b8e7f4b9826ede02b9a043e417fa84aa1 upstream.

Protocol for the "version" fields is: hypervisor raises it (making it
uneven) before it starts updating the fields and raises it again (making
it even) when it is done.  Thus the guest can make sure the time values
it got are consistent by checking the version before and after reading
them.

Add CPU barries after getting version value just like what function
vread_pvclock does, because all of callees in this function is inline.

Fixes: 502dfeff23
Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10 11:49:27 +02:00
Jan Beulich
27b3cc048a x86/mm/xen: Suppress hugetlbfs in PV guests
commit 103f6112f253017d7062cd74d17f4a514ed4485c upstream.

Huge pages are not normally available to PV guests. Not suppressing
hugetlbfs use results in an endless loop of page faults when user mode
code tries to access a hugetlbfs mapped area (since the hypervisor
denies such PTEs to be created, but error indications can't be
propagated out of xen_set_pte_at(), just like for various of its
siblings), and - once killed in an oops like this:

  kernel BUG at .../fs/hugetlbfs/inode.c:428!
  invalid opcode: 0000 [#1] SMP
  ...
  RIP: e030:[<ffffffff811c333b>]  [<ffffffff811c333b>] remove_inode_hugepages+0x25b/0x320
  ...
  Call Trace:
   [<ffffffff811c3415>] hugetlbfs_evict_inode+0x15/0x40
   [<ffffffff81167b3d>] evict+0xbd/0x1b0
   [<ffffffff8116514a>] __dentry_kill+0x19a/0x1f0
   [<ffffffff81165b0e>] dput+0x1fe/0x220
   [<ffffffff81150535>] __fput+0x155/0x200
   [<ffffffff81079fc0>] task_work_run+0x60/0xa0
   [<ffffffff81063510>] do_exit+0x160/0x400
   [<ffffffff810637eb>] do_group_exit+0x3b/0xa0
   [<ffffffff8106e8bd>] get_signal+0x1ed/0x470
   [<ffffffff8100f854>] do_signal+0x14/0x110
   [<ffffffff810030e9>] prepare_exit_to_usermode+0xe9/0xf0
   [<ffffffff814178a5>] retint_user+0x8/0x13

This is CVE-2016-3961 / XSA-174.

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <JGross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>
Link: http://lkml.kernel.org/r/57188ED802000078000E431C@prv-mh.provo.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04 14:48:42 -07:00
Bjorn Helgaas
c1491657c5 Revert "PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()"
commit 6c777e8799a93e3bdb67bec622429e1b48dc90fb upstream.

991de2e590 ("PCI, x86: Implement pcibios_alloc_irq() and
pcibios_free_irq()") appeared in v4.3 and helps support IOAPIC hotplug.

Олег reported that the Elcus-1553 TA1-PCI driver worked in v4.2 but not
v4.3 and bisected it to 991de2e590.  Sunjin reported that the RocketRAID
272x driver worked in v4.2 but not v4.3.  In both cases booting with
"pci=routirq" is a workaround.

I think the problem is that after 991de2e590, we no longer call
pcibios_enable_irq() for upstream bridges.  Prior to 991de2e590, when a
driver called pci_enable_device(), we recursively called
pcibios_enable_irq() for upstream bridges via pci_enable_bridge().

After 991de2e590, we call pcibios_enable_irq() from pci_device_probe()
instead of the pci_enable_device() path, which does *not* call
pcibios_enable_irq() for upstream bridges.

Revert 991de2e590 to fix these driver regressions.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=111211
Fixes: 991de2e590 ("PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()")
Reported-and-tested-by: Олег Мороз <oleg.moroz@mcc.vniiem.ru>
Reported-by: Sunjin Yang <fan4326@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
CC: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:42:16 +09:00
Paolo Bonzini
84f2443e21 KVM: x86: reduce default value of halt_poll_ns parameter
commit 14ebda3394fd3e5388747e742e510b0802a65d24 upstream.

Windows lets applications choose the frequency of the timer tick,
and in Windows 10 the maximum rate was changed from 1024 Hz to
2048 Hz.  Unfortunately, because of the way the Windows API
works, most applications who need a higher rate than the default
64 Hz will just do

   timeGetDevCaps(&tc, sizeof(tc));
   timeBeginPeriod(tc.wPeriodMin);

and pick the maximum rate.  This causes very high CPU usage when
playing media or games on Windows 10, even if the guest does not
actually use the CPU very much, because the frequent timer tick
causes halt_poll_ns to kick in.

There is no really good solution, especially because Microsoft
could sooner or later bump the limit to 4096 Hz, but for now
the best we can do is lower a bit the upper limit for
halt_poll_ns. :-(

Reported-by: Jon Panozzo <jonp@lime-technology.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:42:07 +09:00
Andy Lutomirski
0f63ab5873 x86/iopl/64: Properly context-switch IOPL on Xen PV
commit b7a584598aea7ca73140cb87b40319944dd3393f upstream.

On Xen PV, regs->flags doesn't reliably reflect IOPL and the
exit-to-userspace code doesn't change IOPL.  We need to context
switch it manually.

I'm doing this without going through paravirt because this is
specific to Xen PV.  After the dust settles, we can merge this with
the 32-bit code, tidy up the iopl syscall implementation, and remove
the set_iopl pvop entirely.

Fixes XSA-171.

Reviewewd-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/693c3bd7aeb4d3c27c92c622b7d0f554a458173c.1458162709.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:38 -07:00
Dave Jones
1eeb322585 x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt()
commit 7834c10313fb823e538f2772be78edcdeed2e6e3 upstream.

Since 4.4, I've been able to trigger this occasionally:

===============================
[ INFO: suspicious RCU usage. ]
4.5.0-rc7-think+ #3 Not tainted
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/20160315012054.GA17765@codemonkey.org.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

-------------------------------
./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

RCU used illegally from idle CPU!
rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
no locks held by swapper/3/0.

stack backtrace:
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.5.0-rc7-think+ #3
 ffffffff92f821e0 1f3e5c340597d7fc ffff880468e07f10 ffffffff92560c2a
 ffff880462145280 0000000000000001 ffff880468e07f40 ffffffff921376a6
 ffffffff93665ea0 0000cc7c876d28da 0000000000000005 ffffffff9383dd60
Call Trace:
 <IRQ>  [<ffffffff92560c2a>] dump_stack+0x67/0x9d
 [<ffffffff921376a6>] lockdep_rcu_suspicious+0xe6/0x100
 [<ffffffff925ae7a7>] do_trace_write_msr+0x127/0x1a0
 [<ffffffff92061c83>] native_apic_msr_eoi_write+0x23/0x30
 [<ffffffff92054408>] smp_trace_call_function_interrupt+0x38/0x360
 [<ffffffff92d1ca60>] trace_call_function_interrupt+0x90/0xa0
 <EOI>  [<ffffffff92ac5124>] ? cpuidle_enter_state+0x1b4/0x520

Move the entering_irq() call before ack_APIC_irq(), because entering_irq()
tells the RCU susbstems to end the extended quiescent state, so that the
following trace call in ack_APIC_irq() works correctly.

Suggested-by: Andi Kleen <ak@linux.intel.com>
Fixes: 4787c368a9 "x86/tracing: Add irq_enter/exit() in smp_trace_reschedule_interrupt()"
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-04-12 09:08:38 -07:00
Thomas Gleixner
dc1441612f x86/irq: Cure live lock in fixup_irqs()
commit 551adc60573cb68e3d55cacca9ba1b7437313df7 upstream.

Harry reported, that he's able to trigger a system freeze with cpu hot
unplug. The freeze turned out to be a live lock caused by recent changes in
irq_force_complete_move().

When fixup_irqs() and from there irq_force_complete_move() is called on the
dying cpu, then all other cpus are in stop machine an wait for the dying cpu
to complete the teardown. If there is a move of an interrupt pending then
irq_force_complete_move() sends the cleanup IPI to the cpus in the old_domain
mask and waits for them to clear the mask. That's obviously impossible as
those cpus are firmly stuck in stop machine with interrupts disabled.

I should have known that, but I completely overlooked it being concentrated on
the locking issues around the vectors. And the existance of the call to
__irq_complete_move() in the code, which actually sends the cleanup IPI made
it reasonable to wait for that cleanup to complete. That call was bogus even
before the recent changes as it was just a pointless distraction.

We have to look at two cases:

1) The move_in_progress flag of the interrupt is set

   This means the ioapic has been updated with the new vector, but it has not
   fired yet. In theory there is a race:

   set_ioapic(new_vector) <-- Interrupt is raised before update is effective,
   			      i.e. it's raised on the old vector.

   So if the target cpu cannot handle that interrupt before the old vector is
   cleaned up, we get a spurious interrupt and in the worst case the ioapic
   irq line becomes stale, but my experiments so far have only resulted in
   spurious interrupts.

   But in case of cpu hotplug this should be a non issue because if the
   affinity update happens right before all cpus rendevouz in stop machine,
   there is no way that the interrupt can be blocked on the target cpu because
   all cpus loops first with interrupts enabled in stop machine, so the old
   vector is not yet cleaned up when the interrupt fires.

   So the only way to run into this issue is if the delivery of the interrupt
   on the apic/system bus would be delayed beyond the point where the target
   cpu disables interrupts in stop machine. I doubt that it can happen, but at
   least there is a theroretical chance. Virtualization might be able to
   expose this, but AFAICT the IOAPIC emulation is not as stupid as the real
   hardware.

   I've spent quite some time over the weekend to enforce that situation,
   though I was not able to trigger the delayed case.

2) The move_in_progress flag is not set and the old_domain cpu mask is not
   empty.

   That means, that an interrupt was delivered after the change and the
   cleanup IPI has been sent to the cpus in old_domain, but not all CPUs have
   responded to it yet.

In both cases we can assume that the next interrupt will arrive on the new
vector, so we can cleanup the old vectors on the cpus in the old_domain cpu
mask.

Fixes: 98229aa36caa "x86/irq: Plug vector cleanup race"
Reported-by: Harry Junior <harryjr@outlook.fr>
Tested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ben Hutchings <ben@decadent.org.uk>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603140931430.3657@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:37 -07:00
Stephane Eranian
f6724209df perf/x86/intel: Add definition for PT PMI bit
commit 5690ae28e472d25e330ad0c637a5cea3fc39fb32 upstream.

This patch adds a definition for GLOBAL_OVFL_STATUS bit 55
which is used with the Processor Trace (PT) feature.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: namhyung@kernel.org
Link: http://lkml.kernel.org/r/1457034642-21837-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:33 -07:00
Borislav Petkov
5aeaf8bd6c x86/microcode: Untangle from BLK_DEV_INITRD
commit 5f9c01aa7c49a2d74474d6d879a797b8badf29e6 upstream.

Thomas Voegtle reported that doing oldconfig with a .config which has
CONFIG_MICROCODE enabled but BLK_DEV_INITRD disabled prevents the
microcode loading mechanism from being built.

So untangle it from the BLK_DEV_INITRD dependency so that oldconfig
doesn't turn it off and add an explanatory text to its Kconfig help what
the supported methods for supplying microcode are.

Reported-by: Thomas Voegtle <tv@lio96.de>
Tested-by: Thomas Voegtle <tv@lio96.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1454499225-21544-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 09:08:32 -07:00
Thomas Gleixner
4f45a0edf3 x86/irq: Call irq_force_move_complete with irq descriptor
commit 90a2282e23f0522e4b3f797ad447c5e91bf7fe32 upstream.

First of all there is no point in looking up the irq descriptor again, but we
also need the descriptor for the final cleanup race fix in the next
patch. Make that change seperate. No functional difference.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Link: http://lkml.kernel.org/r/20151231160107.125211743@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:31 -08:00
Jan Beulich
75a101ba31 x86/mm: Fix types used in pgprot cacheability flags translations
commit 3625c2c234ef66acf21a72d47a5ffa94f6c5ebf2 upstream.

For PAE kernels "unsigned long" is not suitable to hold page protection
flags, since _PAGE_NX doesn't fit there. This is the reason for quite a
few W+X pages getting reported as insecure during boot (observed namely
for the entire initrd range).

Fixes: 281d4078be ("x86: Make page cache mode a real type")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <JGross@suse.com>
Link: http://lkml.kernel.org/r/56A7635602000078000CAFF1@prv-mh.provo.novell.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25 12:01:13 -08:00
Andy Lutomirski
0dec73176d x86/mm: Improve switch_mm() barrier comments
commit 4eaffdd5a5fe6ff9f95e1ab4de1ac904d5e0fa8b upstream.

My previous comments were still a bit confusing and there was a
typo. Fix it up.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 71b3c126e611 ("x86/mm: Add barriers and document switch_mm()-vs-flush synchronization")
Link: http://lkml.kernel.org/r/0a0b43cdcdd241c5faaaecfbcc91a155ddedc9a1.1452631609.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:28:55 -08:00
Andy Lutomirski
43a2ba8c1a x86/mm: Add barriers and document switch_mm()-vs-flush synchronization
commit 71b3c126e61177eb693423f2e18a1914205b165e upstream.

When switch_mm() activates a new PGD, it also sets a bit that
tells other CPUs that the PGD is in use so that TLB flush IPIs
will be sent.  In order for that to work correctly, the bit
needs to be visible prior to loading the PGD and therefore
starting to fill the local TLB.

Document all the barriers that make this work correctly and add
a couple that were missing.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:28:54 -08:00
H.J. Lu
966ae5ea35 x86/boot: Double BOOT_HEAP_SIZE to 64KB
commit 8c31902cffc4d716450be549c66a67a8a3dd479c upstream.

When decompressing kernel image during x86 bootup, malloc memory
for ELF program headers may run out of heap space, which leads
to system halt.  This patch doubles BOOT_HEAP_SIZE to 64KB.

Tested with 32-bit kernel which failed to boot without this patch.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:28:54 -08:00
David Vrabel
d8c98a1d14 x86/paravirt: Prevent rtc_cmos platform device init on PV guests
Adding the rtc platform device in non-privileged Xen PV guests causes
an IRQ conflict because these guests do not have legacy PIC and may
allocate irqs in the legacy range.

In a single VCPU Xen PV guest we should have:

/proc/interrupts:
           CPU0
  0:       4934  xen-percpu-virq      timer0
  1:          0  xen-percpu-ipi       spinlock0
  2:          0  xen-percpu-ipi       resched0
  3:          0  xen-percpu-ipi       callfunc0
  4:          0  xen-percpu-virq      debug0
  5:          0  xen-percpu-ipi       callfuncsingle0
  6:          0  xen-percpu-ipi       irqwork0
  7:        321   xen-dyn-event     xenbus
  8:         90   xen-dyn-event     hvc_console
  ...

But hvc_console cannot get its interrupt because it is already in use
by rtc0 and the console does not work.

  genirq: Flags mismatch irq 8. 00000000 (hvc_console) vs. 00000000 (rtc0)

We can avoid this problem by realizing that unprivileged PV guests (both
Xen and lguests) are not supposed to have rtc_cmos device and so
adding it is not necessary.

Privileged guests (i.e. Xen's dom0) do use it but they should not have
irq conflicts since they allocate irqs above legacy range (above
gsi_top, in fact).

Instead of explicitly testing whether the guest is privileged we can
extend pv_info structure to include information about guest's RTC
support.

Reported-and-tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: vkuznets@redhat.com
Cc: xen-devel@lists.xenproject.org
Cc: konrad.wilk@oracle.com
Cc: stable@vger.kernel.org # 4.2+
Link: http://lkml.kernel.org/r/1449842873-2613-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19 21:35:13 +01:00
Boris Ostrovsky
91e2eea98f x86/xen: Avoid fast syscall path for Xen PV guests
After 32-bit syscall rewrite, and specifically after commit:

  5f310f739b ("x86/entry/32: Re-implement SYSENTER using the new C path")

... the stack frame that is passed to xen_sysexit is no longer a
"standard" one (i.e. it's not pt_regs).

Since we end up calling xen_iret from xen_sysexit we don't need
to fix up the stack and instead follow entry_SYSENTER_32's IRET
path directly to xen_iret.

We can do the same thing for compat mode even though stack does
not need to be fixed. This will allow us to drop usergs_sysret32
paravirt op (in the subsequent patch)

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: david.vrabel@citrix.com
Cc: konrad.wilk@oracle.com
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1447970147-1733-2-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19 09:55:52 +01:00
Kirill A. Shutemov
70f1528747 x86/mm: Fix regression with huge pages on PAE
Recent PAT patchset has caused issue on 32-bit PAE machines:

  page:eea45000 count:0 mapcount:-128 mapping:  (null) index:0x0 flags: 0x40000000()
  page dumped because: VM_BUG_ON_PAGE(page_mapcount(page) < 0)
  ------------[ cut here ]------------
  kernel BUG at /home/build/linux-boris/mm/huge_memory.c:1485!
  invalid opcode: 0000 [#1] SMP
  [...]
  Call Trace:
   unmap_single_vma
   ? __wake_up
   unmap_vmas
   unmap_region
   do_munmap
   vm_munmap
   SyS_munmap
   do_fast_syscall_32
   ? __do_page_fault
   sysenter_past_esp
  Code: ...
  EIP: [<c11bde80>] zap_huge_pmd+0x240/0x260 SS:ESP 0068:f6459d98

The problem is in pmd_pfn_mask() and pmd_flags_mask(). These
helpers use PMD_PAGE_MASK to calculate resulting mask.
PMD_PAGE_MASK is 'unsigned long', not 'unsigned long long' as
phys_addr_t is on 32-bit PAE (ARCH_PHYS_ADDR_T_64BIT). As a
result, the upper bits of resulting mask get truncated.

pud_pfn_mask() and pud_flags_mask() aren't problematic since we
don't have PUD page table level on 32-bit systems, but it's
reasonable to keep them consistent with PMD counterpart.

Introduce PHYSICAL_PMD_PAGE_MASK and PHYSICAL_PUD_PAGE_MASK in
addition to existing PHYSICAL_PAGE_MASK and reworks helpers to
use them.

Reported-and-Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[ Fix -Woverflow warnings from the realmode code. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jürgen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: elliott@hpe.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Fixes: f70abb0fc3 ("x86/asm: Fix pud/pmd interfaces to handle large PAT bit")
Link: http://lkml.kernel.org/r/1448878233-11390-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 09:14:27 +01:00
Rafael J. Wysocki
d9f67dbc0f Merge branch 'pm-tools'
* pm-tools:
  x86: remove unused definition of MSR_NHM_PLATFORM_INFO
  tools/power turbostat: use new name for MSR_PLATFORM_INFO
2015-11-16 22:57:02 +01:00
Linus Torvalds
bba072dfd7 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "A couple of fixes and updates related to x86:

   - Fix the W+X check regression on XEN

   - The real fix for the low identity map trainwreck

   - Probe legacy PIC early instead of unconditionally allocating legacy
     irqs

   - Add cpu verification to long mode entry

   - Adjust the cache topology to AMD Fam17H systems

   - Let Merrifield use the TSC across S3"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Call verify_cpu() after having entered long mode too
  x86/setup: Fix low identity map for >= 2GB kernel range
  x86/mm: Skip the hypervisor range when walking PGD
  x86/AMD: Fix last level cache topology for AMD Fam17h systems
  x86/irq: Probe for PIC presence before allocating descs for legacy IRQs
  x86/cpu/intel: Enable X86_FEATURE_NONSTOP_TSC_S3 for Merrifield
2015-11-15 09:32:59 -08:00
Len Brown
5369a21e3f x86: remove unused definition of MSR_NHM_PLATFORM_INFO
MSR_NHM_PLATFORM_INFO has been replaced by...
MSR_PLATFORM_INFO

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-11-13 23:28:01 +01:00
Linus Torvalds
3370b69eb0 Four changes:
- x86: work around two nasty cases where a benign exception occurs while
 another is being delivered.  The endless stream of exceptions causes an
 infinite loop in the processor, which not even NMIs or SMIs can interrupt;
 in the virt case, there is no possibility to exit to the host either.
 
 - x86: support for Skylake per-guest TSC rate.  Long supported by AMD,
 the patches mostly move things from there to common arch/x86/kvm/ code.
 
 - generic: remove local_irq_save/restore from the guest entry and exit
 paths when context tracking is enabled.  The patches are a few months
 old, but we discussed them again at kernel summit.  Andy will pick up
 from here and, in 4.5, try to remove it from the user entry/exit paths.
 
 - PPC: Two bug fixes, see merge commit 370289756b for details.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJWRFb0AAoJEL/70l94x66DjjMH/31jr8d119MW0uv2x+03+wRq
 6dbJ8tjQ8grvBRExKvLsUVjDmHlhCa1BQl5qjCsyYhX9UeAf4NQOmoEFpq+YTLxh
 Ctveyn+yiZWC7qxbQDmauiQ4JCOp+W9ial782iqw5+ouQMajGOffq5WrojCa2ZNF
 jI278JgdHJLrKj/uie//WBu3V7MJY5Apc3p4zatnSYFSQ3MA0sxl4r4zIrwOa5qs
 23ZeeoqbP4sHh4X5wL/30Y6XFSCHj0qoYHHyAgzLi0PCMvBdt4DrAFUPDG/Rhlv6
 o1WB/kcUfcz3DtBX85wfSOMuw0nF6patWhWv07R/3EIbYoz3dKvp9d6ORYgXqlY=
 =Um9M
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull second batch of kvm updates from Paolo Bonzini:
 "Four changes:

   - x86: work around two nasty cases where a benign exception occurs
     while another is being delivered.  The endless stream of exceptions
     causes an infinite loop in the processor, which not even NMIs or
     SMIs can interrupt; in the virt case, there is no possibility to
     exit to the host either.

   - x86: support for Skylake per-guest TSC rate.  Long supported by
     AMD, the patches mostly move things from there to common
     arch/x86/kvm/ code.

   - generic: remove local_irq_save/restore from the guest entry and
     exit paths when context tracking is enabled.  The patches are a few
     months old, but we discussed them again at kernel summit.  Andy
     will pick up from here and, in 4.5, try to remove it from the user
     entry/exit paths.

   - PPC: Two bug fixes, see merge commit 370289756b for details"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
  KVM: x86: rename update_db_bp_intercept to update_bp_intercept
  KVM: svm: unconditionally intercept #DB
  KVM: x86: work around infinite loop in microcode when #AC is delivered
  context_tracking: avoid irq_save/irq_restore on guest entry and exit
  context_tracking: remove duplicate enabled check
  KVM: VMX: Dump TSC multiplier in dump_vmcs()
  KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
  KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
  KVM: VMX: Enable and initialize VMX TSC scaling
  KVM: x86: Use the correct vcpu's TSC rate to compute time scale
  KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
  KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
  KVM: x86: Replace call-back compute_tsc_offset() with a common function
  KVM: x86: Replace call-back set_tsc_khz() with a common function
  KVM: x86: Add a common TSC scaling function
  KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
  KVM: x86: Collect information for setting TSC scaling ratio
  KVM: x86: declare a few variables as __read_mostly
  KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common
  KVM: PPC: Book3S HV: Don't dynamically split core when already split
  ...
2015-11-12 14:34:06 -08:00
Paolo Bonzini
a96036b8ef KVM: x86: rename update_db_bp_intercept to update_bp_intercept
Because #DB is now intercepted unconditionally, this callback
only operates on #BP for both VMX and SVM.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:25 +01:00
Haozhong Zhang
64903d6195 KVM: VMX: Enable and initialize VMX TSC scaling
This patch exhances kvm-intel module to enable VMX TSC scaling and
collects information of TSC scaling ratio during initialization.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:19 +01:00
Haozhong Zhang
4ba76538dd KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
Both VMX and SVM scales the host TSC in the same way in call-back
read_l1_tsc(), so this patch moves the scaling logic from call-back
read_l1_tsc() to a common function kvm_read_l1_tsc().

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:18 +01:00
Haozhong Zhang
58ea676787 KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
For both VMX and SVM, if the 2nd argument of call-back
adjust_tsc_offset() is the host TSC, then adjust_tsc_offset() will scale
it first. This patch moves this common TSC scaling logic to its caller
adjust_tsc_offset_host() and rename the call-back adjust_tsc_offset() to
adjust_tsc_offset_guest().

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:17 +01:00
Haozhong Zhang
07c1419a32 KVM: x86: Replace call-back compute_tsc_offset() with a common function
Both VMX and SVM calculate the tsc-offset in the same way, so this
patch removes the call-back compute_tsc_offset() and replaces it with a
common function kvm_compute_tsc_offset().

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:16 +01:00
Haozhong Zhang
381d585c80 KVM: x86: Replace call-back set_tsc_khz() with a common function
Both VMX and SVM propagate virtual_tsc_khz in the same way, so this
patch removes the call-back set_tsc_khz() and replaces it with a common
function.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:16 +01:00