commit f4eafd8bcd5229e998aa252627703b8462c3b90f upstream.
A kernel page fault oops with the callstack below was observed
when a read syscall was made to a pmem device after a huge amount
(>512GB) of vmalloc ranges was allocated by ioremap() on a x86_64
system:
BUG: unable to handle kernel paging request at ffff880840000ff8
IP: vmalloc_fault+0x1be/0x300
PGD c7f03a067 PUD 0
Oops: 0000 [#1] SM
Call Trace:
__do_page_fault+0x285/0x3e0
do_page_fault+0x2f/0x80
? put_prev_entity+0x35/0x7a0
page_fault+0x28/0x30
? memcpy_erms+0x6/0x10
? schedule+0x35/0x80
? pmem_rw_bytes+0x6a/0x190 [nd_pmem]
? schedule_timeout+0x183/0x240
btt_log_read+0x63/0x140 [nd_btt]
:
? __symbol_put+0x60/0x60
? kernel_read+0x50/0x80
SyS_finit_module+0xb9/0xf0
entry_SYSCALL_64_fastpath+0x1a/0xa4
Since v4.1, ioremap() supports large page (pud/pmd) mappings in
x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc
range is limited to pte mappings.
vmalloc faults do not normally happen in ioremap'd ranges since
ioremap() sets up the kernel page tables, which are shared by
user processes. pgd_ctor() sets the kernel's PGD entries to
user's during fork(). When allocation of the vmalloc ranges
crosses a 512GB boundary, ioremap() allocates a new pud table
and updates the kernel PGD entry to point it. If user process's
PGD entry does not have this update yet, a read/write syscall
to the range will cause a vmalloc fault, which hits the Oops
above as it does not handle a large page properly.
Following changes are made to vmalloc_fault().
64-bit:
- No change for the PGD sync operation as it handles large
pages already.
- Add pud_huge() and pmd_huge() to the validation code to
handle large pages.
- Change pud_page_vaddr() to pud_pfn() since an ioremap range
is not directly mapped (while the if-statement still works
with a bogus addr).
- Change pmd_page() to pmd_pfn() since an ioremap range is not
backed by struct page (while the if-statement still works
with a bogus addr).
32-bit:
- No change for the sync operation since the index3 PGD entry
covers the entire vmalloc range, which is always valid.
(A separate change to sync PGD entry is necessary if this
memory layout is changed regardless of the page size.)
- Add pmd_huge() to the validation code to handle large pages.
This is for completeness since vmalloc_fault() won't happen
in ioremap'd ranges as its PGD entry is always valid.
Reported-by: Henning Schild <henning.schild@siemens.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/1455758214-24623-1-git-send-email-toshi.kani@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a82eee7424525e34e98d821dd059ce14560a1e35 upstream.
Data corruption issues were observed in tests which initiated
a system crash/reset while accessing BTT devices. This problem
is reproducible.
The BTT driver calls pmem_rw_bytes() to update data in pmem
devices. This interface calls __copy_user_nocache(), which
uses non-temporal stores so that the stores to pmem are
persistent.
__copy_user_nocache() uses non-temporal stores when a request
size is 8 bytes or larger (and is aligned by 8 bytes). The
BTT driver updates the BTT map table, which entry size is
4 bytes. Therefore, updates to the map table entries remain
cached, and are not written to pmem after a crash.
Change __copy_user_nocache() to use non-temporal store when
a request size is 4 bytes. The change extends the current
byte-copy path for a less-than-8-bytes request, and does not
add any overhead to the regular path.
Reported-and-tested-by: Micah Parrish <micah.parrish@hpe.com>
Reported-and-tested-by: Brian Boylston <brian.boylston@hpe.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/1455225857-12039-3-git-send-email-toshi.kani@hpe.com
[ Small readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 742563777e8da62197d6cb4b99f4027f59454735 upstream.
There are a couple of nasty truncation bugs lurking in the pageattr
code that can be triggered when mapping EFI regions, e.g. when we pass
a cpa->pgd pointer. Because cpa->numpages is a 32-bit value, shifting
left by PAGE_SHIFT will truncate the resultant address to 32-bits.
Viorel-Cătălin managed to trigger this bug on his Dell machine that
provides a ~5GB EFI region which requires 1236992 pages to be mapped.
When calling populate_pud() the end of the region gets calculated
incorrectly in the following buggy expression,
end = start + (cpa->numpages << PAGE_SHIFT);
And only 188416 pages are mapped. Next, populate_pud() gets invoked
for a second time because of the loop in __change_page_attr_set_clr(),
only this time no pages get mapped because shifting the remaining
number of pages (1048576) by PAGE_SHIFT is zero. At which point the
loop in __change_page_attr_set_clr() spins forever because we fail to
map progress.
Hitting this bug depends very much on the virtual address we pick to
map the large region at and how many pages we map on the initial run
through the loop. This explains why this issue was only recently hit
with the introduction of commit
a5caa209ba ("x86/efi: Fix boot crash by mapping EFI memmap
entries bottom-up at runtime, instead of top-down")
It's interesting to note that safe uses of cpa->numpages do exist in
the pageattr code. If instead of shifting ->numpages we multiply by
PAGE_SIZE, no truncation occurs because PAGE_SIZE is a UL value, and
so the result is unsigned long.
To avoid surprises when users try to convert very large cpa->numpages
values to addresses, change the data type from 'int' to 'unsigned
long', thereby making it suitable for shifting by PAGE_SHIFT without
any type casting.
The alternative would be to make liberal use of casting, but that is
far more likely to cause problems in the future when someone adds more
code and fails to cast properly; this bug was difficult enough to
track down in the first place.
Reported-and-tested-by: Viorel-Cătălin Răpițeanu <rapiteanu.catalin@gmail.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110131
Link: http://lkml.kernel.org/r/1454067370-10374-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3625c2c234ef66acf21a72d47a5ffa94f6c5ebf2 upstream.
For PAE kernels "unsigned long" is not suitable to hold page protection
flags, since _PAGE_NX doesn't fit there. This is the reason for quite a
few W+X pages getting reported as insecure during boot (observed namely
for the entire initrd range).
Fixes: 281d4078be ("x86: Make page cache mode a real type")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <JGross@suse.com>
Link: http://lkml.kernel.org/r/56A7635602000078000CAFF1@prv-mh.provo.novell.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cbe09bd51bf23b42c3a94c5fb6815e1397c5fc3f upstream.
This aligns the stack pointer in chacha20_4block_xor_ssse3 to 64 bytes.
Fixes general protection faults and potential kernel panics.
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Acked-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4eaffdd5a5fe6ff9f95e1ab4de1ac904d5e0fa8b upstream.
My previous comments were still a bit confusing and there was a
typo. Fix it up.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 71b3c126e611 ("x86/mm: Add barriers and document switch_mm()-vs-flush synchronization")
Link: http://lkml.kernel.org/r/0a0b43cdcdd241c5faaaecfbcc91a155ddedc9a1.1452631609.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 71b3c126e61177eb693423f2e18a1914205b165e upstream.
When switch_mm() activates a new PGD, it also sets a bit that
tells other CPUs that the PGD is in use so that TLB flush IPIs
will be sent. In order for that to work correctly, the bit
needs to be visible prior to loading the PGD and therefore
starting to fill the local TLB.
Document all the barriers that make this work correctly and add
a couple that were missing.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8c31902cffc4d716450be549c66a67a8a3dd479c upstream.
When decompressing kernel image during x86 bootup, malloc memory
for ELF program headers may run out of heap space, which leads
to system halt. This patch doubles BOOT_HEAP_SIZE to 64KB.
Tested with 32-bit kernel which failed to boot without this patch.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2f0c0b2d96b1205efb14347009748d786c2d9ba5 upstream.
Without the reboot=pci method, the iMac 10,1 simply
hangs after printing "Restarting system" at the point
when it should reboot. This fixes it.
Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1450466646-26663-1-git-send-email-mario.kleiner.de@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 45bdbcfdf241149642fb6c25ab0c209d59c371b7 upstream.
vmx_cpuid_tries to update SECONDARY_VM_EXEC_CONTROL in the VMCS, but
it will cause a vmwrite error on older CPUs because the code does not
check for the presence of CPU_BASED_ACTIVATE_SECONDARY_CONTROLS.
This will get rid of the following trace on e.g. Core2 6600:
vmwrite error: reg 401e value 10 (err 12)
Call Trace:
[<ffffffff8116e2b9>] dump_stack+0x40/0x57
[<ffffffffa020b88d>] vmx_cpuid_update+0x5d/0x150 [kvm_intel]
[<ffffffffa01d8fdc>] kvm_vcpu_ioctl_set_cpuid2+0x4c/0x70 [kvm]
[<ffffffffa01b8363>] kvm_arch_vcpu_ioctl+0x903/0xfa0 [kvm]
Fixes: feda805fe7
Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit aba2f06c070f604e388cf77b1dcc7f4cf4577eb0 upstream.
Poor #AC was so unimportant until a few days ago that we were
not even tracing its name correctly. But now it's all over
the place.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9dbe6cf941a6fe82933aef565e4095fb10f65023 upstream.
If we do not do this, it is not properly saved and restored across
migration. Windows notices due to its self-protection mechanisms,
and is very upset about it (blue screen of death).
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6a1f513776b78c994045287073e55bae44ed9f8c upstream.
On a cancelled suspend the vcpu_info location does not change (it's
still in the per-cpu area registered by xen_vcpu_setup()). So do not
call xen_hvm_init_shared_info() which would make the kernel think its
back in the shared info. With the wrong vcpu_info, events cannot be
received and the domain will hang after a cancelled suspend.
Signed-off-by: Charles Ouyang <ouyangzhaowei@huawei.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
a patch found in your master branch but not yet in the kvm/next branch
that is destined for 4.5.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWj+yRAAoJEL/70l94x66DulYH/0OGP+yIHDDFlBqtPRm6q0pr
r8pSVRPPd4GY2SOJDBsBvMmWphFSYKIoCTyMbFnikADHM2yh/pycwLU/uzCM5xQl
uABMsCUntwbGaKq+A4bOvsNO49ueRCkML4ToVuKNTeuEKRYfdnlj3XcAMMgsUfEF
QGz8W2cm9xPn69df91cfBuFLLFeQVv2XsjA5WpqzzvWy5HEs1F07aVh57TI4j8OF
eFdn3Lkes9Ync70KjEy2QKe2Su0EWjderE0oqAORKomwZFVCYv/Vg1wERJYsugg5
UyYCY2j1tKlycKYDnO47L1xoS9JgMHY05OsH08Sn/EXBjRjnEVwTyco5pGPmuNA=
=5Lst
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fix from Paolo Bonzini:
"A simple fix. I'm sending it before the merge window, because it
refines a patch found in your master branch but not yet in the
kvm/next branch that is destined for 4.5"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: x86: only channel 0 of the i8254 is linked to the HPET
Pull x86 fixes from Ingo Molnar:
"A handful of x86 fixes:
- a syscall ABI fix, fixing an Android breakage
- a Xen PV guest fix relating to the RTC device, causing a
non-working console
- a Xen guest syscall stack frame fix
- an MCE hotplug CPU crash fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/numachip: Fix NumaConnect2 MMCFG PCI access
x86/entry: Restore traditional SYSENTER calling convention
x86/entry: Fix some comments
x86/paravirt: Prevent rtc_cmos platform device init on PV guests
x86/xen: Avoid fast syscall path for Xen PV guests
x86/mce: Ensure offline CPUs don't participate in rendezvous process
While setting the KVM PIT counters in 'kvm_pit_load_count', if
'hpet_legacy_start' is set, the function disables the timer on
channel[0], instead of the respective index 'channel'. This is
because channels 1-3 are not linked to the HPET. Fix the caller
to only activate the special HPET processing for channel 0.
Reported-by: P J P <pjp@fedoraproject.org>
Fixes: 0185604c2d
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The MMCFG PCI accessors weren't being setup for NumacConnect2
correctly due to over-early assignment; this would create the
potential for the wrong PCI domain to be accessed.
Fix this by using the correct arch-specific PCI init function.
Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Steffen Persvold <sp@numascale.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1451498807-15920-1-git-send-email-daniel@numascale.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix the build warning:
arch/x86/xen/suspend.c: In function 'xen_arch_pre_suspend':
arch/x86/xen/suspend.c:70:9: error: implicit declaration of function 'xen_pv_domain' [-Werror=implicit-function-declaration]
if (xen_pv_domain())
^
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- A series of fixes to the MTRR emulation, tested in the BZ by several users
so they should be safe this late
- A fix for a division by zero
- Two very simple ARM and PPC fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWeV/6AAoJEL/70l94x66DqzsH/05YnLi2GsX5WeZHMfIUgzgT
S/GoIkA7A4E2eXVoGg824MWppSViUzZkWgYFQTG4+KY9WPXzm9z2ij7DIlUHCD6n
QfevgQx1kIu1obyhm6bYM2xUdM3f7NCsQgw9bXZObB0ay+b/+GjR9/RbCbx60EO5
K1P+kveK6PFlS9/hc0PLztu6WkPV9BCO1RJUbeAEdnrMbpuQfHC+coR7MHRCiv2V
iy8f1CqrGaO5YPm9/3GbdH1xMKew4OZShOxTXwtvUThdrLkks2c8sk6FoLzqkznH
LMHVIpkm4mrIgThZG7VqZMXOrWvBtsCt04Vr9MzCM6QetB02b/Uz0xKvMYx2kZQ=
=pmYz
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- A series of fixes to the MTRR emulation, tested in the BZ by several
users so they should be safe this late
- A fix for a division by zero
- Two very simple ARM and PPC fixes
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Reload pit counters for all channels when restoring state
KVM: MTRR: treat memory as writeback if MTRR is disabled in guest CPUID
KVM: MTRR: observe maxphyaddr from guest CPUID, not host
KVM: MTRR: fix fixed MTRR segment look up
KVM: VMX: Fix host initiated access to guest MSR_TSC_AUX
KVM: arm/arm64: vgic: Fix kvm_vgic_map_is_active's dist check
kvm: x86: move tracepoints outside extended quiescent state
KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR
Fix a pointer cast typo introduced in v4.4-rc5 especially visible for
the i386 subarchitecture where it results in a kernel crash.
[ Also removed pointless cast as per Al Viro - Linus ]
Fixes: 8090bfd2bb ("um: Fix fpstate handling")
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Jeff Dike <jdike@addtoit.com>
Acked-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently if userspace restores the pit counters with a count of 0
on channels 1 or 2 and the guest attempts to read the count on those
channels, then KVM will perform a mod of 0 and crash. This will ensure
that 0 values are converted to 65536 as per the spec.
This is CVE-2015-7513.
Signed-off-by: Andy Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Virtual machines can be run with CPUID such that there are no MTRRs.
In that case, the firmware will never enable MTRRs and it is obviously
undesirable to run the guest entirely with UC memory. Check out guest
CPUID, and use WB memory if MTRR do not exist.
Cc: qemu-stable@nongnu.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107561
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Conversion of MTRRs to ranges used the maxphyaddr from the boot CPU.
This is wrong, because var_mtrr_range's mask variable then is discontiguous
(like FF00FFFF000, where the first run of 0s corresponds to the bits
between host and guest maxphyaddr). Instead always set up the masks
to be full 64-bit values---we know that the reserved bits at the top
are zero, and we can restore them when reading the MSR. This way
var_mtrr_range gets a mask that just works.
Fixes: a13842dc66
Cc: qemu-stable@nongnu.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107561
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This fixes the slow-down of VM running with pci-passthrough, since some MTRR
range changed from MTRR_TYPE_WRBACK to MTRR_TYPE_UNCACHABLE. Memory in the
0K-640K range was incorrectly treated as uncacheable.
Fixes: f7bfb57b3e
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107561
Cc: qemu-stable@nongnu.org
Signed-off-by: Alexis Dambricourt <alexis.dambricourt@gmail.com>
[Use correct BZ for "Fixes" annotation. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It turns out that some Android versions hardcode the SYSENTER
calling convention. This is buggy and will cause problems no
matter what the kernel does. Nonetheless, we should try to
support it.
Credit goes to Linus for pointing out a clean way to handle
the SYSENTER/SYSCALL clobber differences while preserving
straightforward DWARF annotations.
I believe that the original offending Android commit was:
https://android.googlesource.com/platform%2Fbionic/+/7dc3684d7a2587e43e6d2a8e0e3f39bf759bd535
Reported-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-and-tested-by: Borislav Petkov <bp@alien8.de>
Cc: <mark.gross@intel.com>
Cc: Su Tao <tao.su@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: <frank.wang@intel.com>
Cc: <borun.fu@intel.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Mingwei Shi <mingwei.shi@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Adding the rtc platform device in non-privileged Xen PV guests causes
an IRQ conflict because these guests do not have legacy PIC and may
allocate irqs in the legacy range.
In a single VCPU Xen PV guest we should have:
/proc/interrupts:
CPU0
0: 4934 xen-percpu-virq timer0
1: 0 xen-percpu-ipi spinlock0
2: 0 xen-percpu-ipi resched0
3: 0 xen-percpu-ipi callfunc0
4: 0 xen-percpu-virq debug0
5: 0 xen-percpu-ipi callfuncsingle0
6: 0 xen-percpu-ipi irqwork0
7: 321 xen-dyn-event xenbus
8: 90 xen-dyn-event hvc_console
...
But hvc_console cannot get its interrupt because it is already in use
by rtc0 and the console does not work.
genirq: Flags mismatch irq 8. 00000000 (hvc_console) vs. 00000000 (rtc0)
We can avoid this problem by realizing that unprivileged PV guests (both
Xen and lguests) are not supposed to have rtc_cmos device and so
adding it is not necessary.
Privileged guests (i.e. Xen's dom0) do use it but they should not have
irq conflicts since they allocate irqs above legacy range (above
gsi_top, in fact).
Instead of explicitly testing whether the guest is privileged we can
extend pv_info structure to include information about guest's RTC
support.
Reported-and-tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: vkuznets@redhat.com
Cc: xen-devel@lists.xenproject.org
Cc: konrad.wilk@oracle.com
Cc: stable@vger.kernel.org # 4.2+
Link: http://lkml.kernel.org/r/1449842873-2613-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
After 32-bit syscall rewrite, and specifically after commit:
5f310f739b ("x86/entry/32: Re-implement SYSENTER using the new C path")
... the stack frame that is passed to xen_sysexit is no longer a
"standard" one (i.e. it's not pt_regs).
Since we end up calling xen_iret from xen_sysexit we don't need
to fix up the stack and instead follow entry_SYSENTER_32's IRET
path directly to xen_iret.
We can do the same thing for compat mode even though stack does
not need to be fixed. This will allow us to drop usergs_sysret32
paravirt op (in the subsequent patch)
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: david.vrabel@citrix.com
Cc: konrad.wilk@oracle.com
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1447970147-1733-2-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Intel's MCA implementation broadcasts MCEs to all CPUs on the
node. This poses a problem for offlined CPUs which cannot
participate in the rendezvous process:
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Kernel Offset: disabled
Rebooting in 100 seconds..
More specifically, Linux does a soft offline of a CPU when
writing a 0 to /sys/devices/system/cpu/cpuX/online, which
doesn't prevent the #MC exception from being broadcasted to that
CPU.
Ensure that offline CPUs don't participate in the MCE rendezvous
and clear the RIP valid status bit so that a second MCE won't
cause a shutdown.
Without the patch, mce_start() will increment mce_callin and
wait for all CPUs. Offlined CPUs should avoid participating in
the rendezvous process altogether.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
- XSA-155 security fixes to backend drivers.
- XSA-157 security fixes to pciback.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWdDrXAAoJEFxbo/MsZsTR3N0H/0Lvz6MWBARCje7livbz7nqE
PS0Bea+2yAfNhCDDiDlpV0lor8qlyfWDF6lGhLjItldAzahag3ZDKDf1Z/lcQvhf
3MwFOcOVZE8lLtvLT6LGnPuehi1Mfdi1Qk1/zQhPhsq6+FLPLT2y+whmBihp8mMh
C12f7KRg5r3U7eZXNB6MEtGA0RFrOp0lBdvsiZx3qyVLpezj9mIe0NueQqwY3QCS
xQ0fILp/x2EnZNZuzgghFTPRxMAx5ReOezgn9Rzvq4aThD+irz1y6ghkYN4rG2s2
tyYOTqBnjJEJEQ+wmYMhnfCwVvDffztG+uI9hqN31QFJiNB0xsjSWFCkDAWchiU=
=Argz
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.4-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel:
- XSA-155 security fixes to backend drivers.
- XSA-157 security fixes to pciback.
* tag 'for-linus-4.4-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen-pciback: fix up cleanup path when alloc fails
xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
xen/pciback: Do not install an IRQ handler for MSI interrupts.
xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
xen/pciback: Save xen_pci_op commands before processing it
xen-scsiback: safely copy requests
xen-blkback: read from indirect descriptors only once
xen-blkback: only read request operation from shared ring once
xen-netback: use RING_COPY_REQUEST() throughout
xen-netback: don't use last request to determine minimum Tx credit
xen: Add RING_COPY_REQUEST()
xen/x86/pvh: Use HVM's flush_tlb_others op
xen: Resume PMU from non-atomic context
xen/events/fifo: Consume unprocessed events when a CPU dies
Pavel Machek reports a warning about W+X pages found in the "Persisent"
kmap area. After grepping for it (using the correct spelling), and not
finding it, I noticed how the debug printk was just misspelled. Fix it.
The actual mapping bug that Pavel reported is still open. It's
apparently a separate issue from the known EFI page tables, looks like
it's related to the HIGHMEM mappings.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current handling of accesses to guest MSR_TSC_AUX returns error if
vcpu does not support rdtscp, though those accesses are initiated by
host. This can result in the reboot failure of some versions of
QEMU. This patch fixes this issue by passing those host initiated
accesses for further handling instead.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using MMUEXT_TLB_FLUSH_MULTI doesn't buy us much since the hypervisor
will likely perform same IPIs as would have the guest.
More importantly, using MMUEXT_INVLPG_MULTI may not to invalidate the
guest's address on remote CPU (when, for example, VCPU from another guest
is running there).
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Pull uml fixes from Richard Weinberger:
"This contains various bug fixes, most of them are fall out from the
merge window"
* 'for-linus-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: fix returns without va_end
um: Fix fpstate handling
arch: um: fix error when linking vmlinux.
um: Fix get_signal() usage
Pull perf fixes from Ingo Molnar:
"This tree includes four core perf fixes for misc bugs, three fixes to
x86 PMU drivers, and two updates to old email addresses"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Do not send exit event twice
perf/x86/intel: Fix INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA macro
perf/x86/intel: Make L1D_PEND_MISS.FB_FULL not constrained on Haswell
perf: Fix PERF_EVENT_IOC_PERIOD deadlock
treewide: Remove old email address
perf/x86: Fix LBR call stack save/restore
perf: Update email address in MAINTAINERS
perf/core: Robustify the perf_cgroup_from_task() RCU checks
perf/core: Fix RCU problem with cgroup context switching code
Pull x86 fixes from Thoma Gleixner:
"Another round of fixes for x86:
- Move the initialization of the microcode driver to late_initcall to
make sure everything that init function needs is available.
- Make sure that lockdep knows about interrupts being off in the
entry code before calling into c-code.
- Undo the cpu hotplug init delay regression.
- Use the proper conditionals in the mpx instruction decoder.
- Fixup restart_syscall for x32 tasks.
- Fix the hugepage regression on PAE kernels which was introduced
with the latest PAT changes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/signal: Fix restart_syscall number for x32 tasks
x86/mpx: Fix instruction decoder condition
x86/mm: Fix regression with huge pages on PAE
x86 smpboot: Re-enable init_udelay=0 by default on modern CPUs
x86/entry/64: Fix irqflag tracing wrt context tracking
x86/microcode: Initialize the driver late when facilities are up
We need to add rest of the flags to the constraint mask
instead of another INTEL_ARCH_EVENT_MASK, fixing a typo.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1447061071-28085-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When restarting a syscall with regs->ax == -ERESTART_RESTARTBLOCK,
regs->ax is assigned to a restart_syscall number. For x32 tasks, this
syscall number must have __X32_SYSCALL_BIT set, otherwise it will be
an x86_64 syscall number instead of a valid x32 syscall number. This
issue has been there since the introduction of x32.
Reported-by: strace/tests/restart_syscall.test
Reported-and-tested-by: Elvira Khabirova <lineprinter0@gmail.com>
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Cc: Elvira Khabirova <lineprinter0@gmail.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20151130215436.GA25996@altlinux.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
MPX decodes instructions in order to tell which bounds register
was violated. Part of this decoding involves looking at the "REX
prefix" which is a special instrucion prefix used to retrofit
support for new registers in to old instructions.
The X86_REX_*() macros are defined to return actual bit values:
#define X86_REX_R(rex) ((rex) & 4)
*not* boolean values. However, the MPX code was checking for
them like they were booleans. This might have led to us
mis-decoding the "REX prefix" and giving false information out to
userspace about bounds violations. X86_REX_B() actually is bit 1,
so this is really only broken for the X86_REX_X() case.
Fix the conditionals up to tolerate the non-boolean values.
Fixes: fcc7ffd679 "x86, mpx: Decode MPX instruction to get bound violation information"
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20151201003113.D800C1E0@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull libnvdimm fixes from Dan Williams:
- NFIT parsing regression fixes from Linda. The nvdimm hot-add
implementation merged in 4.4-rc1 interpreted the specification in a
way that breaks actual HPE platforms. We are also closing the loop
with the ACPI Working Group to get this clarification added to the
spec.
- Andy pointed out that his laptop without nvdimm resources is loading
the e820-nvdimm module by default, fix that up to only load the
module when an e820-type-12 range is present.
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
nfit: Adjust for different _FIT and NFIT headers
nfit: Fix the check for a successful NFIT merge
nfit: Account for table size length variation
libnvdimm, e820: skip module loading when no type-12
Recent PAT patchset has caused issue on 32-bit PAE machines:
page:eea45000 count:0 mapcount:-128 mapping: (null) index:0x0 flags: 0x40000000()
page dumped because: VM_BUG_ON_PAGE(page_mapcount(page) < 0)
------------[ cut here ]------------
kernel BUG at /home/build/linux-boris/mm/huge_memory.c:1485!
invalid opcode: 0000 [#1] SMP
[...]
Call Trace:
unmap_single_vma
? __wake_up
unmap_vmas
unmap_region
do_munmap
vm_munmap
SyS_munmap
do_fast_syscall_32
? __do_page_fault
sysenter_past_esp
Code: ...
EIP: [<c11bde80>] zap_huge_pmd+0x240/0x260 SS:ESP 0068:f6459d98
The problem is in pmd_pfn_mask() and pmd_flags_mask(). These
helpers use PMD_PAGE_MASK to calculate resulting mask.
PMD_PAGE_MASK is 'unsigned long', not 'unsigned long long' as
phys_addr_t is on 32-bit PAE (ARCH_PHYS_ADDR_T_64BIT). As a
result, the upper bits of resulting mask get truncated.
pud_pfn_mask() and pud_flags_mask() aren't problematic since we
don't have PUD page table level on 32-bit systems, but it's
reasonable to keep them consistent with PMD counterpart.
Introduce PHYSICAL_PMD_PAGE_MASK and PHYSICAL_PUD_PAGE_MASK in
addition to existing PHYSICAL_PAGE_MASK and reworks helpers to
use them.
Reported-and-Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[ Fix -Woverflow warnings from the realmode code. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jürgen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: elliott@hpe.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Fixes: f70abb0fc3 ("x86/asm: Fix pud/pmd interfaces to handle large PAT bit")
Link: http://lkml.kernel.org/r/1448878233-11390-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Resuming PMU currently triggers a warning from ___might_sleep() (assuming
CONFIG_DEBUG_ATOMIC_SLEEP is set) when xen_pmu_init() allocates GFP_KERNEL
page because we are in state resembling atomic context.
Move resuming PMU to xen_arch_resume() which is called in regular context.
For symmetry move suspending PMU to xen_arch_suspend() as well.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org> # 4.3
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Commit 4d6b4e69a2 ("x86/PCI/ACPI: Use common interface to support
PCI host bridge") converted x86 to use the common interface
acpi_pci_root_create, but the conversion missed on code piece in
arch/x86/pci/bus_numa.c, which causes regression on some legacy
AMD platforms as reported by Arthur Marsh <arthur.marsh@internode.on.net>.
The root causes is that acpi_pci_root_create() fails to insert
host bridge resources into iomem_resource/ioport_resource because
x86_pci_root_bus_resources() has already inserted those resources.
So change x86_pci_root_bus_resources() to not insert resources into
iomem_resource/ioport_resource.
Fixes: 4d6b4e69a2 ("x86/PCI/ACPI: Use common interface to support PCI host bridge")
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Reported-and-tested-by: Arthur Marsh <arthur.marsh@internode.on.net>
Reported-and-tested-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Reported-and-tested-by: Keith Busch <keith.busch@intel.com>
Reported-and-tested-by: Hans de Bruin <jmdebruin@xmsnet.nl>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If there are no persistent memory ranges present then don't bother
creating the platform device. Otherwise, it loads the full libnvdimm
sub-system only to discover no resources present.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
commit f1ccd24931 allowed the cmdline "cpu_init_udelay=" to work
with all values, including the default of 10000.
But in setting the default of 10000, it over-rode the code that sets
the delay 0 on modern processors.
Also, tidy up use of INT/UINT.
Fixes: f1ccd24931 "x86/smpboot: Fix cpu_init_udelay=10000 corner case boot parameter misbehavior"
Reported-by: Shane <shrybman@teksavvy.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: dparsons@brightdsl.net
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/9082eb809ef40dad02db714759c7aaf618c518d4.1448232494.git.len.brown@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>