While perfmon2 is a sufficiently evil library (it pokes MSRs
directly) that breaking it is fair game, it's still useful, so we
might as well try to support it. This allows users to write 2 to
/sys/devices/cpu/rdpmc to disable all rdpmc protection so that hack
like perfmon2 can continue to work.
At some point, if perf_event becomes fast enough to replace
perfmon2, then this can go.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/caac3c1c707dcca48ecbc35f4def21495856f479.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We currently allow any process to use rdpmc. This significantly
weakens the protection offered by PR_TSC_DISABLED, and it could be
helpful to users attempting to exploit timing attacks.
Since we can't enable access to individual counters, use a very
coarse heuristic to limit access to rdpmc: allow access only when
a perf_event is mmapped. This protects seccomp sandboxes.
There is plenty of room to further tighen these restrictions. For
example, this allows rdpmc for any x86_pmu event, but it's only
useful for self-monitoring tasks.
As a side effect, cap_user_rdpmc will now be false for AMD uncore
events. This isn't a real regression, since .event_idx is disabled
for these events anyway for the time being. Whenever that gets
re-added, the cap_user_rdpmc code can be adjusted or refactored
accordingly.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/a2bdb3cf3a1d70c26980d7c6dddfbaa69f3182bf.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The code is correct, but only for a rather subtle reason. This
confused me for quite a while when I read switch_mm, so clarify the
code to avoid confusing other people, too.
TBH, I wouldn't be surprised if this code was only correct by
accident.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/0db86397f968996fb772c443c251415b0b430ddd.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Context switches and TLB flushes can change individual bits of CR4.
CR4 reads take several cycles, so store a shadow copy of CR4 in a
per-cpu variable.
To avoid wasting a cache line, I added the CR4 shadow to
cpu_tlbstate, which is already touched in switch_mm. The heaviest
users of the cr4 shadow will be switch_mm and __switch_to_xtra, and
__switch_to_xtra is called shortly after switch_mm during context
switch, so the cacheline is likely to be hot.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CR4 manipulation was split, seemingly at random, between direct
(write_cr4) and using a helper (set/clear_in_cr4). Unfortunately,
the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code,
which only a small subset of users actually wanted.
This patch replaces all cr4 access in functions that don't leave cr4
exactly the way they found it with new helpers cr4_set_bits,
cr4_clear_bits, and cr4_set_bits_and_update_boot.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename CONFIG_LIVE_PATCHING to CONFIG_LIVEPATCH to make the naming of
the config and the code more consistent.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Intel Moorestown platform support was removed few years ago. This is a follow
up which removes Moorestown specific code for the serial devices. It includes
mrst_max3110 and earlyprintk bits.
This was used on SFI (Medfield, Clovertrail) based platforms as well, though
new ones use normal serial interface for the console service.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: David Cohen <david.a.cohen@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Revert 7c6a98dfa1, given
that testing PIR is not necessary anymore.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds PML support in VMX. A new module parameter 'enable_pml' is added
to allow user to enable/disable it manually.
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds new kvm_x86_ops dirty logging hooks to enable/disable dirty
logging for particular memory slot, and to flush potentially logged dirty GPAs
before reporting slot->dirty_bitmap to userspace.
kvm x86 common code calls these hooks when they are available so PML logic can
be hidden to VMX specific. SVM won't be impacted as these hooks remain NULL
there.
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch changes the second parameter of kvm_mmu_slot_remove_write_access from
'slot id' to 'struct kvm_memory_slot *' to align with kvm_x86_ops dirty logging
hooks, which will be introduced in further patch.
Better way is to change second parameter of kvm_arch_commit_memory_region from
'struct kvm_userspace_memory_region *' to 'struct kvm_memory_slot * new', but it
requires changes on other non-x86 ARCH too, so avoid it now.
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds new mmu layer functions to clear/set D-bit for memory slot, and
to write protect superpages for memory slot.
In case of PML, CPU logs the dirty GPA automatically to PML buffer when CPU
updates D-bit from 0 to 1, therefore we don't have to write protect 4K pages,
instead, we only need to clear D-bit in order to log that GPA.
For superpages, we still write protect it and let page fault code to handle
dirty page logging, as we still need to split superpage to 4K pages in PML.
As PML is always enabled during guest's lifetime, to eliminate unnecessary PML
GPA logging, we set D-bit manually for the slot with dirty logging disabled.
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The new hw_breakpoint bits are now ready for v3.20, merge them
into the main branch, to avoid conflicts.
Conflicts:
tools/perf/Documentation/perf-record.txt
Signed-off-by: Ingo Molnar <mingo@kernel.org>
of this is an IST rework. When an IST exception interrupts user
space, we will handle it on the per-thread kernel stack instead of
on the IST stack. This sounds messy, but it actually simplifies the
IST entry/exit code, because it eliminates some ugly games we used
to play in order to handle rescheduling, signal delivery, etc on the
way out of an IST exception.
The IST rework introduces proper context tracking to IST exception
handlers. I haven't seen any bug reports, but the old code could
have incorrectly treated an IST exception handler as an RCU extended
quiescent state.
The memory failure change (included in this pull request with
Borislav and Tony's permission) eliminates a bunch of code that
is no longer needed now that user memory failure handlers are
called in process context.
Finally, this includes a few on Denys' uncontroversial and Obviously
Correct (tm) cleanups.
The IST and memory failure changes have been in -next for a while.
LKML references:
IST rework:
http://lkml.kernel.org/r/cover.1416604491.git.luto@amacapital.net
Memory failure change:
http://lkml.kernel.org/r/54ab2ffa301102cd6e@agluck-desk.sc.intel.com
Denys' cleanups:
http://lkml.kernel.org/r/1420927210-19738-1-git-send-email-dvlasenk@redhat.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUtvkFAAoJEK9N98ZeDfrkcfsIAJxZ0UBUCEDvulbqgk/iPGOa
fIpKLMowS7CpKtw6Wdc/YvAIkeHXWm1vU44Hj0TrjSrXCgVF8yCngs/xlXtOjoa1
dosXQqgqVJJ+hyui7chAEWyalLW7bEO8raq/6snhiMrhiuEkVKpEr7Fer4FVVCZL
4VALmNQQsbV+Qq4pXIhuagZC0Nt/XKi/+/cKvhS4p//q1F/TbHTz0FpDUrh0jPMh
18WFy0jWgxdkMRnSp/wJhekvdXX6PwUy5BdES9fjw8LQJZxxFpqN3Fe1kgfyzV0k
yuvEHw1hPt2aBGj3q69wQvDVyyn4OqMpRDBhk4S+GJYmVh7mFyFMN4BDMEy/EY8=
=LXVl
-----END PGP SIGNATURE-----
Merge tag 'pr-20150114-x86-entry' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux into x86/asm
Pull x86/entry enhancements from Andy Lutomirski:
" This is my accumulated x86 entry work, part 1, for 3.20. The meat
of this is an IST rework. When an IST exception interrupts user
space, we will handle it on the per-thread kernel stack instead of
on the IST stack. This sounds messy, but it actually simplifies the
IST entry/exit code, because it eliminates some ugly games we used
to play in order to handle rescheduling, signal delivery, etc on the
way out of an IST exception.
The IST rework introduces proper context tracking to IST exception
handlers. I haven't seen any bug reports, but the old code could
have incorrectly treated an IST exception handler as an RCU extended
quiescent state.
The memory failure change (included in this pull request with
Borislav and Tony's permission) eliminates a bunch of code that
is no longer needed now that user memory failure handlers are
called in process context.
Finally, this includes a few on Denys' uncontroversial and Obviously
Correct (tm) cleanups.
The IST and memory failure changes have been in -next for a while.
LKML references:
IST rework:
http://lkml.kernel.org/r/cover.1416604491.git.luto@amacapital.net
Memory failure change:
http://lkml.kernel.org/r/54ab2ffa301102cd6e@agluck-desk.sc.intel.com
Denys' cleanups:
http://lkml.kernel.org/r/1420927210-19738-1-git-send-email-dvlasenk@redhat.com
"
This tree semantically depends on and is based on the following RCU commit:
734d168013 ("rcu: Make rcu_nmi_enter() handle nesting")
... and for that reason won't be pushed upstream before the RCU bits hit Linus's tree.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The scratch frame mappings for ballooned pages and the m2p override
are broken. Remove them in preparation for replacing them with
simpler mechanisms that works.
The scratch pages did not ensure that the page was not in use. In
particular, the foreign page could still be in use by hardware. If
the guest reused the frame the hardware could read or write that
frame.
The m2p override did not handle the same frame being granted by two
different grant references. Trying an M2P override lookup in this
case is impossible.
With the m2p override removed, the grant map/unmap for the kernel
mappings (for x86 PV) can be easily batched in
set_foreign_p2m_mapping() and clear_foreign_p2m_mapping().
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
When unmapping grants, instead of converting the kernel map ops to
unmap ops on the fly, pre-populate the set of unmap ops.
This allows the grant unmap for the kernel mappings to be trivially
batched in the future.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
The IRET instruction should clear NMI masking, but the current implementation
does not do so.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Implement a clockevent device based on the timer support available on
Hyper-V.
In this version of the patch I have addressed Jason's review comments.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Witcher 2 did something like this to allocate a TLS segment index:
struct user_desc u_info;
bzero(&u_info, sizeof(u_info));
u_info.entry_number = (uint32_t)-1;
syscall(SYS_set_thread_area, &u_info);
Strictly speaking, this code was never correct. It should have set
read_exec_only and seg_not_present to 1 to indicate that it wanted
to find a free slot without putting anything there, or it should
have put something sensible in the TLS slot if it wanted to allocate
a TLS entry for real. The actual effect of this code was to
allocate a bogus segment that could be used to exploit espfix.
The set_thread_area hardening patches changed the behavior, causing
set_thread_area to return -EINVAL and crashing the game.
This changes set_thread_area to interpret this as a request to find
a free slot and to leave it empty, which isn't *quite* what the game
expects but should be close enough to keep it working. In
particular, using the code above to allocate two segments will
allocate the same segment both times.
According to FrostbittenKing on Github, this fixes The Witcher 2.
If this somehow still causes problems, we could instead allocate
a limit==0 32-bit data segment, but that seems rather ugly to me.
Fixes: 41bdc78544 x86/tls: Validate TLS entries to protect espfix
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: stable@vger.kernel.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/0cb251abe1ff0958b8e468a9a9a905b80ae3a746.1421954363.git.luto@amacapital.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The 3.19 merge window saw some TLB modifications merged which caused a
performance regression. They were fixed in commit 045bbb9fa.
Once that fix was applied, I also noticed that there was a small
but intermittent regression still present. It was not present
consistently enough to bisect reliably, but I'm fairly confident
that it came from (my own) MPX patches. The source was reading
a relatively unused field in the mm_struct via arch_unmap.
I also noted that this code was in the main instruction flow of
do_munmap() and probably had more icache impact than we want.
This patch does two things:
1. Adds a static (via Kconfig) and dynamic (via cpuid) check
for MPX with cpu_feature_enabled(). This keeps us from
reading that cacheline in the mm and trades it for a check
of the global CPUID variables at least on CPUs without MPX.
2. Adds an unlikely() to ensure that the MPX call ends up out
of the main instruction flow in do_munmap(). I've added
a detailed comment about why this was done and why we want
it even on systems where MPX is present.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: luto@amacapital.net
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20150108223021.AEEAB987@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Extend apic_bsp_setup() so the same code flow can be used for
APIC_init_uniprocessor().
Folded Jiangs fix to provide proper ordering of the UP setup.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20150115211704.084765674@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We better provide proper functions which implement the required code
flow in the apic code rather than letting the smpboot code open code
it. That allows to make more functions static and confines the APIC
functionality to apic.c where it belongs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20150115211703.907616730@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
To avoid lots of ifdeffery provide proper stubs for setup_IO_APIC(),
enable_IO_APIC() and setup_ioapic_dest().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20150115211703.397170414@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No point for a separate header file.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20150115211703.304126687@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
enable_x2apic() is a convoluted unreadable mess because it is used for
both enablement in early boot and for setup in cpu_init().
Split the code into x2apic_enable() for enablement and x2apic_setup()
for setup of (secondary cpus). Make use of the new state tracking to
simplify the logic.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20150115211703.129287153@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is no point in postponing the hardware disablement of x2apic. It
can be disabled right away in the nox2apic setup function.
Disable it right away and set the state to DISABLED . This allows to
remove all the nox2apic conditionals all over the place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20150115211703.051214090@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No point in having try_to_enable_x2apic() outside of the
CONFIG_X86_X2APIC section and having inline functions and more ifdefs
to deal with it. Move the code into the existing ifdef section and
remove the inline cruft.
Fixup the printk about not enabling interrupt remapping as suggested
by Boris.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20150115211702.795388613@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No point in delaying the x2apic detection for the CONFIG_X86_X2APIC=n
case to enable_IR_x2apic(). We rather detect that before we try to
setup anything there.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20150115211702.702479404@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The x2apic_preenabled flag is just a horrible hack and if X2APIC
support is disabled it does not reflect the actual hardware
state. Check the hardware instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20150115211702.541280622@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No point in having a static variable around which is always 0. Let the
compiler optimize code out if disabled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20150115211702.363274310@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
enable_IR_x2apic() grew a open coded x2apic detection. Implement a
proper helper function which shares the code with the already existing
x2apic_enabled().
Made it use rdmsrl_safe as suggested by Boris.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20150115211702.285038186@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
arch/x86/kvm/emulate.c: In function ‘check_cr_write’:
arch/x86/kvm/emulate.c:3552:4: warning: left shift count >= width of type
rsvd = CR3_L_MODE_RESERVED_BITS & ~CR3_PCID_INVD;
happens because sizeof(UL) on 32-bit is 4 bytes but we shift it 63 bits
to the left.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SuSE's 2.6.16 kernel fails to boot if the delta between tsc_timestamp
and rdtsc is larger than a given threshold:
* If we get more than the below threshold into the future, we rerequest
* the real time from the host again which has only little offset then
* that we need to adjust using the TSC.
*
* For now that threshold is 1/5th of a jiffie. That should be good
* enough accuracy for completely broken systems, but also give us swing
* to not call out to the host all the time.
*/
#define PVCLOCK_DELTA_MAX ((1000000000ULL / HZ) / 5)
Disable masterclock support (which increases said delta) in case the
boot vcpu does not use MSR_KVM_SYSTEM_TIME_NEW.
Upstreams kernels which support pvclock vsyscalls (and therefore make
use of PVCLOCK_STABLE_BIT) use MSR_KVM_SYSTEM_TIME_NEW.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
math_state_restore() can race with kernel_fpu_begin() if irq comes
right after __thread_fpu_begin(), __save_init_fpu() will overwrite
fpu->state we are going to restore.
Add 2 simple helpers, kernel_fpu_disable() and kernel_fpu_enable()
which simply set/clear in_kernel_fpu, and change math_state_restore()
to exclude kernel_fpu_begin() in between.
Alternatively we could use local_irq_save/restore, but probably these
new helpers can have more users.
Perhaps they should disable/enable preemption themselves, in this case
we can remove preempt_disable() in __restore_xstate_sig().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: matt.fleming@intel.com
Cc: bp@suse.de
Cc: pbonzini@redhat.com
Cc: luto@amacapital.net
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Link: http://lkml.kernel.org/r/20150115192028.GD27332@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
interrupted_kernel_fpu_idle() tries to detect if kernel_fpu_begin()
is safe or not. In particular it should obviously deny the nested
kernel_fpu_begin() and this logic looks very confusing.
If use_eager_fpu() == T we rely on a) __thread_has_fpu() check in
interrupted_kernel_fpu_idle(), and b) on the fact that _begin() does
__thread_clear_has_fpu().
Otherwise we demand that the interrupted task has no FPU if it is in
kernel mode, this works because __kernel_fpu_begin() does clts() and
interrupted_kernel_fpu_idle() checks X86_CR0_TS.
Add the per-cpu "bool in_kernel_fpu" variable, and change this code
to check/set/clear it. This allows to do more cleanups and fixes, see
the next changes.
The patch also moves WARN_ON_ONCE() under preempt_disable() just to
make this_cpu_read() look better, this is not really needed. And in
fact I think we should move it into __kernel_fpu_begin().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: matt.fleming@intel.com
Cc: bp@suse.de
Cc: pbonzini@redhat.com
Cc: luto@amacapital.net
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Link: http://lkml.kernel.org/r/20150115191943.GB27332@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The PSS register reflects the power state of each island on SoC. It would be
useful to know which of the islands is on or off at the momemnt.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Kumar P. Mahesh <mahesh.kumar.p@intel.com>
Link: http://lkml.kernel.org/r/1421253575-22509-6-git-send-email-andriy.shevchenko@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Xen overrides __acpi_register_gsi and leaves __acpi_unregister_gsi as is.
That means, an IRQ allocated by acpi_register_gsi_xen_hvm() or
acpi_register_gsi_xen() will be freed by acpi_unregister_gsi_ioapic(),
which may cause undesired effects. So override __acpi_unregister_gsi to
NULL for safety.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Tony Luck <tony.luck@intel.com>
Cc: xen-devel@lists.xenproject.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Graeme Gregory <graeme.gregory@linaro.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Link: http://lkml.kernel.org/r/1421720467-7709-4-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
commit 78bff1c868 ("x86/ticketlock: Fix spin_unlock_wait() livelock")
introduced two additional ACCESS_ONCE cases in x86 spinlock.h.
Lets change those as well.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
We now have a generic function that does most of the work of
kvm_vm_ioctl_get_dirty_log, now use it.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
Simplify irq_remapping code by killing irq_remapping_supported() and
related interfaces.
Joerg posted a similar patch at https://lkml.org/lkml/2014/12/15/490,
so assume an signed-off from Joerg.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Oren Twaig <oren@scalemp.com>
Link: http://lkml.kernel.org/r/1420615903-28253-14-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
enable_IR_x2apic() calls setup_irq_remapping_ops() which by default
installs the intel dmar remapping ops and then calls the amd iommu irq
remapping prepare callback to figure out whether we are running on an
AMD machine with irq remapping hardware.
Right after that it calls irq_remapping_prepare() which pointlessly
checks:
if (!remap_ops || !remap_ops->prepare)
return -ENODEV;
and then calls
remap_ops->prepare()
which is silly in the AMD case as it got called from
setup_irq_remapping_ops() already a few microseconds ago.
Simplify this and just collapse everything into
irq_remapping_prepare().
The irq_remapping_prepare() remains still silly as it assigns blindly
the intel ops, but that's not scope of this patch.
The scope here is to move the preperatory work, i.e. memory
allocations out of the atomic section which is required to enable irq
remapping.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Acked-and-tested-by: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Oren Twaig <oren@scalemp.com>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20141205084147.232633738@linutronix.de
Link: http://lkml.kernel.org/r/1420615903-28253-2-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
At the moment, if p and x are both tagged as bitwise types,
some of get_user(x, p), put_user(x, p), __get_user(x, p), __put_user(x, p)
might produce a sparse warning on many architectures.
This is a false positive: *p on these architectures is loaded into long
(typically using asm), then cast back to typeof(*p).
When typeof(*p) is a bitwise type (which is uncommon), such a cast needs
__force, otherwise sparse produces a warning.
Some architectures already have the __force tag, add it
where it's missing.
I verified that adding these __force casts does not supress any useful warnings.
Specifically, vhost wants to read/write bitwise types in userspace memory
using get_user/put_user.
At the moment this triggers sparse errors, since the value is passed through an
integer.
For example:
__le32 __user *p;
__u32 x;
both
put_user(x, p);
and
get_user(x, p);
should be safe, but produce warnings on some architectures.
While there, I noticed that a bunch of architectures violated
coding style rules within uaccess macros.
Included patches to fix them up.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUtS+YAAoJECgfDbjSjVRpQ/QIAKXOc6tMXo+r/F32YC0Fv74G
W4VKIk7u9XQNjOzez9i+xce75YBDBKHk5R9kLCfAg6Zew+6NRgbBV+QjGVB8dpot
2GxajcVhOySgaR45sGK3Ldg5yVz5ficqZEyYWKNgYeyMWJdlpvUk+4W5q15TiPZe
u+C57/KzfRMDHyv3UkwAbqrkYGE0h7vXBi0BmOdCJlbKjG+6kFoVU/dAWsByDD5p
q54ji8UdIkh2oyH5qhSbAwQN4Cg5N37Agw86HwltjQFJAVvV3yPRUsv7MQnpRB1+
hKlPXPUarNozGVV7OlcvGa9Lvz8m3a2rNd9+1tgHY0Fpia1JYAY2UdubS99fl5E=
=LVcN
-----END PGP SIGNATURE-----
Merge tag 'uaccess_for_upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost into asm-generic
Merge "uaccess: fix sparse warning on get/put_user for bitwise types" from Michael S. Tsirkin:
At the moment, if p and x are both tagged as bitwise types,
some of get_user(x, p), put_user(x, p), __get_user(x, p), __put_user(x, p)
might produce a sparse warning on many architectures.
This is a false positive: *p on these architectures is loaded into long
(typically using asm), then cast back to typeof(*p).
When typeof(*p) is a bitwise type (which is uncommon), such a cast needs
__force, otherwise sparse produces a warning.
Some architectures already have the __force tag, add it
where it's missing.
I verified that adding these __force casts does not supress any useful warnings.
Specifically, vhost wants to read/write bitwise types in userspace memory
using get_user/put_user.
At the moment this triggers sparse errors, since the value is passed through an
integer.
For example:
__le32 __user *p;
__u32 x;
both
put_user(x, p);
and
get_user(x, p);
should be safe, but produce warnings on some architectures.
While there, I noticed that a bunch of architectures violated
coding style rules within uaccess macros.
Included patches to fix them up.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* tag 'uaccess_for_upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (37 commits)
sparc32: nocheck uaccess coding style tweaks
sparc64: nocheck uaccess coding style tweaks
xtensa: macro whitespace fixes
sh: macro whitespace fixes
parisc: macro whitespace fixes
m68k: macro whitespace fixes
m32r: macro whitespace fixes
frv: macro whitespace fixes
cris: macro whitespace fixes
avr32: macro whitespace fixes
arm64: macro whitespace fixes
arm: macro whitespace fixes
alpha: macro whitespace fixes
blackfin: macro whitespace fixes
sparc64: uaccess_64 macro whitespace fixes
sparc32: uaccess_32 macro whitespace fixes
avr32: whitespace fix
sh: fix put_user sparse errors
metag: fix put_user sparse errors
ia64: fix put_user sparse errors
...
A define, two macros and an unreferenced bit of assembly are gone.
Acked-by: Borislav Petkov <bp@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
virtio wants to read bitwise types from userspace using get_user. At the
moment this triggers sparse errors, since the value is passed through an
integer.
Fix that up using __force.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
We are aborting a build in case when gcc doesn't support fentry on x86_64
(regs->ip modification can't really reliably work with mcount).
This however breaks allmodconfig for people with older gccs that don't
support -mfentry.
Turn the build-time failure into runtime failure, resulting in the whole
infrastructure not being initialized if CC_USING_FENTRY is unset.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
When emulating an instruction that reads the destination memory operand (i.e.,
instructions without the Mov flag in the emulator), the operand is first read.
If a page-fault is detected in this phase, the error-code which would be
delivered to the VM does not indicate that the access that caused the exception
is a write one. This does not conform with real hardware, and may cause the VM
to enter the page-fault handler twice for no reason (once for read, once for
write).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_x86_ops->test_posted_interrupt() returns true/false depending
whether 'vector' is set.
Next patch makes use of this interface.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>