2017-08-30 16:23:00 -07:00
|
|
|
#include <linux/bug.h>
|
2017-05-04 14:26:50 +02:00
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/errno.h>
|
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/types.h>
|
|
|
|
#include <linux/bug.h>
|
|
|
|
#include <linux/init.h>
|
2017-08-30 16:23:00 -07:00
|
|
|
#include <linux/interrupt.h>
|
2017-05-04 14:26:50 +02:00
|
|
|
#include <linux/spinlock.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/uaccess.h>
|
2017-08-30 16:23:00 -07:00
|
|
|
#include <linux/ftrace.h>
|
|
|
|
|
2018-01-03 10:43:32 -08:00
|
|
|
#undef pr_fmt
|
|
|
|
#define pr_fmt(fmt) "Kernel/User page tables isolation: " fmt
|
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
#include <asm/kaiser.h>
|
kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
We have many machines (Westmere, Sandybridge, Ivybridge) supporting
PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.
Flushing user context inside load_new_mm_cr3() without the use of
invpcid is difficult: momentarily switch from kernel to user context
and back to do so? I'm not sure whether that can be safely done at
all, and would risk polluting user context with kernel internals,
and kernel context with stale user externals.
Instead, follow the hint in the comment that was there: change
X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3()
can leave a note in it, for SWITCH_USER_CR3 on return to userspace to
flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.
Which works well enough that there's no need to do it this way only
when invpcid is unsupported: it's a good alternative to invpcid here.
But there's a couple of inlines in asm/tlbflush.h that need to do the
same trick, so it's best to localize all this per-cpu business in
mm/kaiser.c: moving that part of the initialization from setup_pcid()
to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the
function for noting an X86_CR3_PCID_USER_FLUSH. And let's keep a
KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.
I did try to make the feature tests in asm/tlbflush.h more consistent
with each other: there seem to be far too many ways of performing such
tests, and I don't have a good grasp of their differences. At first
I converted them all to be static_cpu_has(): but that proved to be a
mistake, as the comment in __native_flush_tlb_single() hints; so then
I reversed and made them all this_cpu_has(). Probably all gratuitous
change, but that's the way it's working at present.
I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR
gets re-initialized by each cpu (before and after these changes):
no problem when (as usual) all cpus on a machine have the same
features, but in principle incorrect. However, my experiment
to per-cpu-ify that one did not end well...
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-17 15:00:37 -07:00
|
|
|
#include <asm/tlbflush.h> /* to verify its kaiser declarations */
|
2017-05-04 14:26:50 +02:00
|
|
|
#include <asm/pgtable.h>
|
|
|
|
#include <asm/pgalloc.h>
|
|
|
|
#include <asm/desc.h>
|
2018-01-02 14:19:48 +01:00
|
|
|
#include <asm/cmdline.h>
|
kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
We have many machines (Westmere, Sandybridge, Ivybridge) supporting
PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.
Flushing user context inside load_new_mm_cr3() without the use of
invpcid is difficult: momentarily switch from kernel to user context
and back to do so? I'm not sure whether that can be safely done at
all, and would risk polluting user context with kernel internals,
and kernel context with stale user externals.
Instead, follow the hint in the comment that was there: change
X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3()
can leave a note in it, for SWITCH_USER_CR3 on return to userspace to
flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.
Which works well enough that there's no need to do it this way only
when invpcid is unsupported: it's a good alternative to invpcid here.
But there's a couple of inlines in asm/tlbflush.h that need to do the
same trick, so it's best to localize all this per-cpu business in
mm/kaiser.c: moving that part of the initialization from setup_pcid()
to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the
function for noting an X86_CR3_PCID_USER_FLUSH. And let's keep a
KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.
I did try to make the feature tests in asm/tlbflush.h more consistent
with each other: there seem to be far too many ways of performing such
tests, and I don't have a good grasp of their differences. At first
I converted them all to be static_cpu_has(): but that proved to be a
mistake, as the comment in __native_flush_tlb_single() hints; so then
I reversed and made them all this_cpu_has(). Probably all gratuitous
change, but that's the way it's working at present.
I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR
gets re-initialized by each cpu (before and after these changes):
no problem when (as usual) all cpus on a machine have the same
features, but in principle incorrect. However, my experiment
to per-cpu-ify that one did not end well...
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-17 15:00:37 -07:00
|
|
|
|
kaiser: add "nokaiser" boot option, using ALTERNATIVE
Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime. That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.
Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.
It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h). I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.
Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments. But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-24 16:59:49 -07:00
|
|
|
int kaiser_enabled __read_mostly = 1;
|
|
|
|
EXPORT_SYMBOL(kaiser_enabled); /* for inlined TLB flush functions */
|
|
|
|
|
kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
We have many machines (Westmere, Sandybridge, Ivybridge) supporting
PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.
Flushing user context inside load_new_mm_cr3() without the use of
invpcid is difficult: momentarily switch from kernel to user context
and back to do so? I'm not sure whether that can be safely done at
all, and would risk polluting user context with kernel internals,
and kernel context with stale user externals.
Instead, follow the hint in the comment that was there: change
X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3()
can leave a note in it, for SWITCH_USER_CR3 on return to userspace to
flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.
Which works well enough that there's no need to do it this way only
when invpcid is unsupported: it's a good alternative to invpcid here.
But there's a couple of inlines in asm/tlbflush.h that need to do the
same trick, so it's best to localize all this per-cpu business in
mm/kaiser.c: moving that part of the initialization from setup_pcid()
to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the
function for noting an X86_CR3_PCID_USER_FLUSH. And let's keep a
KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.
I did try to make the feature tests in asm/tlbflush.h more consistent
with each other: there seem to be far too many ways of performing such
tests, and I don't have a good grasp of their differences. At first
I converted them all to be static_cpu_has(): but that proved to be a
mistake, as the comment in __native_flush_tlb_single() hints; so then
I reversed and made them all this_cpu_has(). Probably all gratuitous
change, but that's the way it's working at present.
I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR
gets re-initialized by each cpu (before and after these changes):
no problem when (as usual) all cpus on a machine have the same
features, but in principle incorrect. However, my experiment
to per-cpu-ify that one did not end well...
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-17 15:00:37 -07:00
|
|
|
__visible
|
|
|
|
DEFINE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* These can have bit 63 set, so we can not just use a plain "or"
|
|
|
|
* instruction to get their value or'd into CR3. It would take
|
|
|
|
* another register. So, we use a memory reference to these instead.
|
|
|
|
*
|
|
|
|
* This is also handy because systems that do not support PCIDs
|
|
|
|
* just end up or'ing a 0 into their CR3, which does no harm.
|
|
|
|
*/
|
2017-08-27 16:24:27 -07:00
|
|
|
DEFINE_PER_CPU(unsigned long, x86_cr3_pcid_user);
|
2017-05-04 14:26:50 +02:00
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
/*
|
|
|
|
* At runtime, the only things we map are some things for CPU
|
|
|
|
* hotplug, and stacks for new processes. No two CPUs will ever
|
|
|
|
* be populating the same addresses, so we only need to ensure
|
|
|
|
* that we protect between two CPUs trying to allocate and
|
|
|
|
* populate the same page table page.
|
|
|
|
*
|
|
|
|
* Only take this lock when doing a set_p[4um]d(), but it is not
|
|
|
|
* needed for doing a set_pte(). We assume that only the *owner*
|
|
|
|
* of a given allocation will be doing this for _their_
|
|
|
|
* allocation.
|
|
|
|
*
|
|
|
|
* This ensures that once a system has been running for a while
|
|
|
|
* and there have been stacks all over and these page tables
|
|
|
|
* are fully populated, there will be no further acquisitions of
|
|
|
|
* this lock.
|
|
|
|
*/
|
|
|
|
static DEFINE_SPINLOCK(shadow_table_allocation_lock);
|
2017-05-04 14:26:50 +02:00
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
/*
|
|
|
|
* Returns -1 on error.
|
2017-05-04 14:26:50 +02:00
|
|
|
*/
|
2017-08-30 16:23:00 -07:00
|
|
|
static inline unsigned long get_pa_from_mapping(unsigned long vaddr)
|
2017-05-04 14:26:50 +02:00
|
|
|
{
|
|
|
|
pgd_t *pgd;
|
|
|
|
pud_t *pud;
|
|
|
|
pmd_t *pmd;
|
|
|
|
pte_t *pte;
|
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
pgd = pgd_offset_k(vaddr);
|
|
|
|
/*
|
|
|
|
* We made all the kernel PGDs present in kaiser_init().
|
|
|
|
* We expect them to stay that way.
|
|
|
|
*/
|
|
|
|
BUG_ON(pgd_none(*pgd));
|
|
|
|
/*
|
|
|
|
* PGDs are either 512GB or 128TB on all x86_64
|
|
|
|
* configurations. We don't handle these.
|
|
|
|
*/
|
|
|
|
BUG_ON(pgd_large(*pgd));
|
2017-05-04 14:26:50 +02:00
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
pud = pud_offset(pgd, vaddr);
|
|
|
|
if (pud_none(*pud)) {
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
return -1;
|
2017-05-04 14:26:50 +02:00
|
|
|
}
|
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
if (pud_large(*pud))
|
|
|
|
return (pud_pfn(*pud) << PAGE_SHIFT) | (vaddr & ~PUD_PAGE_MASK);
|
2017-05-04 14:26:50 +02:00
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
pmd = pmd_offset(pud, vaddr);
|
|
|
|
if (pmd_none(*pmd)) {
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
return -1;
|
2017-05-04 14:26:50 +02:00
|
|
|
}
|
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
if (pmd_large(*pmd))
|
|
|
|
return (pmd_pfn(*pmd) << PAGE_SHIFT) | (vaddr & ~PMD_PAGE_MASK);
|
2017-05-04 14:26:50 +02:00
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
pte = pte_offset_kernel(pmd, vaddr);
|
|
|
|
if (pte_none(*pte)) {
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (pte_pfn(*pte) << PAGE_SHIFT) | (vaddr & ~PAGE_MASK);
|
2017-05-04 14:26:50 +02:00
|
|
|
}
|
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
/*
|
|
|
|
* This is a relatively normal page table walk, except that it
|
|
|
|
* also tries to allocate page tables pages along the way.
|
|
|
|
*
|
|
|
|
* Returns a pointer to a PTE on success, or NULL on failure.
|
|
|
|
*/
|
2017-10-29 11:36:19 -07:00
|
|
|
static pte_t *kaiser_pagetable_walk(unsigned long address)
|
2017-05-04 14:26:50 +02:00
|
|
|
{
|
|
|
|
pmd_t *pmd;
|
2017-08-30 16:23:00 -07:00
|
|
|
pud_t *pud;
|
|
|
|
pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(address));
|
|
|
|
gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
|
2017-05-04 14:26:50 +02:00
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
if (pgd_none(*pgd)) {
|
|
|
|
WARN_ONCE(1, "All shadow pgds should have been populated");
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
BUILD_BUG_ON(pgd_large(*pgd) != 0);
|
2017-05-04 14:26:50 +02:00
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
pud = pud_offset(pgd, address);
|
|
|
|
/* The shadow page tables do not use large mappings: */
|
|
|
|
if (pud_large(*pud)) {
|
|
|
|
WARN_ON(1);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
if (pud_none(*pud)) {
|
|
|
|
unsigned long new_pmd_page = __get_free_page(gfp);
|
|
|
|
if (!new_pmd_page)
|
|
|
|
return NULL;
|
|
|
|
spin_lock(&shadow_table_allocation_lock);
|
kaiser: vmstat show NR_KAISERTABLE as nr_overhead
The kaiser update made an interesting choice, never to free any shadow
page tables. Contention on global spinlock was worrying, particularly
with it held across page table scans when freeing. Something had to be
done: I was going to add refcounting; but simply never to free them is
an appealing choice, minimizing contention without complicating the code
(the more a page table is found already, the less the spinlock is used).
But leaking pages in this way is also a worry: can we get away with it?
At the very least, we need a count to show how bad it actually gets:
in principle, one might end up wasting about 1/256 of memory that way
(1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
for when they are user-mapped from the vmalloc area on another occasion
(but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).
Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
shared pgd entries, and 1 for each intermediate page table added
thereafter for user-mapping - but leave out the 1 per mm, for its
shadow pgd, because that distracts from the monotonic increase.
Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).
In practice, it doesn't look so bad so far: more like 1/12000 after
nine hours of gtests below; and movable pageblock segregation should
tend to cluster the kaiser tables into a subset of the address space
(if not, they will be bad for compaction too). But production may
tell a different story: keep an eye on this number, and bring back
lighter freeing if it gets out of control (maybe a shrinker).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-09 21:27:32 -07:00
|
|
|
if (pud_none(*pud)) {
|
2017-08-30 16:23:00 -07:00
|
|
|
set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
|
kaiser: vmstat show NR_KAISERTABLE as nr_overhead
The kaiser update made an interesting choice, never to free any shadow
page tables. Contention on global spinlock was worrying, particularly
with it held across page table scans when freeing. Something had to be
done: I was going to add refcounting; but simply never to free them is
an appealing choice, minimizing contention without complicating the code
(the more a page table is found already, the less the spinlock is used).
But leaking pages in this way is also a worry: can we get away with it?
At the very least, we need a count to show how bad it actually gets:
in principle, one might end up wasting about 1/256 of memory that way
(1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
for when they are user-mapped from the vmalloc area on another occasion
(but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).
Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
shared pgd entries, and 1 for each intermediate page table added
thereafter for user-mapping - but leave out the 1 per mm, for its
shadow pgd, because that distracts from the monotonic increase.
Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).
In practice, it doesn't look so bad so far: more like 1/12000 after
nine hours of gtests below; and movable pageblock segregation should
tend to cluster the kaiser tables into a subset of the address space
(if not, they will be bad for compaction too). But production may
tell a different story: keep an eye on this number, and bring back
lighter freeing if it gets out of control (maybe a shrinker).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-09 21:27:32 -07:00
|
|
|
__inc_zone_page_state(virt_to_page((void *)
|
|
|
|
new_pmd_page), NR_KAISERTABLE);
|
|
|
|
} else
|
2017-08-30 16:23:00 -07:00
|
|
|
free_page(new_pmd_page);
|
|
|
|
spin_unlock(&shadow_table_allocation_lock);
|
|
|
|
}
|
2017-05-04 14:26:50 +02:00
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
pmd = pmd_offset(pud, address);
|
|
|
|
/* The shadow page tables do not use large mappings: */
|
|
|
|
if (pmd_large(*pmd)) {
|
|
|
|
WARN_ON(1);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
if (pmd_none(*pmd)) {
|
|
|
|
unsigned long new_pte_page = __get_free_page(gfp);
|
|
|
|
if (!new_pte_page)
|
|
|
|
return NULL;
|
|
|
|
spin_lock(&shadow_table_allocation_lock);
|
kaiser: vmstat show NR_KAISERTABLE as nr_overhead
The kaiser update made an interesting choice, never to free any shadow
page tables. Contention on global spinlock was worrying, particularly
with it held across page table scans when freeing. Something had to be
done: I was going to add refcounting; but simply never to free them is
an appealing choice, minimizing contention without complicating the code
(the more a page table is found already, the less the spinlock is used).
But leaking pages in this way is also a worry: can we get away with it?
At the very least, we need a count to show how bad it actually gets:
in principle, one might end up wasting about 1/256 of memory that way
(1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
for when they are user-mapped from the vmalloc area on another occasion
(but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).
Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
shared pgd entries, and 1 for each intermediate page table added
thereafter for user-mapping - but leave out the 1 per mm, for its
shadow pgd, because that distracts from the monotonic increase.
Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).
In practice, it doesn't look so bad so far: more like 1/12000 after
nine hours of gtests below; and movable pageblock segregation should
tend to cluster the kaiser tables into a subset of the address space
(if not, they will be bad for compaction too). But production may
tell a different story: keep an eye on this number, and bring back
lighter freeing if it gets out of control (maybe a shrinker).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-09 21:27:32 -07:00
|
|
|
if (pmd_none(*pmd)) {
|
2017-08-30 16:23:00 -07:00
|
|
|
set_pmd(pmd, __pmd(_KERNPG_TABLE | __pa(new_pte_page)));
|
kaiser: vmstat show NR_KAISERTABLE as nr_overhead
The kaiser update made an interesting choice, never to free any shadow
page tables. Contention on global spinlock was worrying, particularly
with it held across page table scans when freeing. Something had to be
done: I was going to add refcounting; but simply never to free them is
an appealing choice, minimizing contention without complicating the code
(the more a page table is found already, the less the spinlock is used).
But leaking pages in this way is also a worry: can we get away with it?
At the very least, we need a count to show how bad it actually gets:
in principle, one might end up wasting about 1/256 of memory that way
(1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
for when they are user-mapped from the vmalloc area on another occasion
(but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).
Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
shared pgd entries, and 1 for each intermediate page table added
thereafter for user-mapping - but leave out the 1 per mm, for its
shadow pgd, because that distracts from the monotonic increase.
Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).
In practice, it doesn't look so bad so far: more like 1/12000 after
nine hours of gtests below; and movable pageblock segregation should
tend to cluster the kaiser tables into a subset of the address space
(if not, they will be bad for compaction too). But production may
tell a different story: keep an eye on this number, and bring back
lighter freeing if it gets out of control (maybe a shrinker).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-09 21:27:32 -07:00
|
|
|
__inc_zone_page_state(virt_to_page((void *)
|
|
|
|
new_pte_page), NR_KAISERTABLE);
|
|
|
|
} else
|
2017-08-30 16:23:00 -07:00
|
|
|
free_page(new_pte_page);
|
|
|
|
spin_unlock(&shadow_table_allocation_lock);
|
|
|
|
}
|
2017-05-04 14:26:50 +02:00
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
return pte_offset_kernel(pmd, address);
|
|
|
|
}
|
2017-05-04 14:26:50 +02:00
|
|
|
|
kaiser: add "nokaiser" boot option, using ALTERNATIVE
Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime. That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.
Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.
It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h). I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.
Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments. But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-24 16:59:49 -07:00
|
|
|
static int kaiser_add_user_map(const void *__start_addr, unsigned long size,
|
|
|
|
unsigned long flags)
|
2017-08-30 16:23:00 -07:00
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
pte_t *pte;
|
|
|
|
unsigned long start_addr = (unsigned long )__start_addr;
|
|
|
|
unsigned long address = start_addr & PAGE_MASK;
|
|
|
|
unsigned long end_addr = PAGE_ALIGN(start_addr + size);
|
|
|
|
unsigned long target_address;
|
|
|
|
|
kaiser: add "nokaiser" boot option, using ALTERNATIVE
Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime. That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.
Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.
It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h). I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.
Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments. But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-24 16:59:49 -07:00
|
|
|
/*
|
|
|
|
* It is convenient for callers to pass in __PAGE_KERNEL etc,
|
|
|
|
* and there is no actual harm from setting _PAGE_GLOBAL, so
|
|
|
|
* long as CR4.PGE is not set. But it is nonetheless troubling
|
|
|
|
* to see Kaiser itself setting _PAGE_GLOBAL (now that "nokaiser"
|
|
|
|
* requires that not to be #defined to 0): so mask it off here.
|
|
|
|
*/
|
|
|
|
flags &= ~_PAGE_GLOBAL;
|
|
|
|
|
2017-09-03 18:48:02 -07:00
|
|
|
for (; address < end_addr; address += PAGE_SIZE) {
|
2017-08-30 16:23:00 -07:00
|
|
|
target_address = get_pa_from_mapping(address);
|
|
|
|
if (target_address == -1) {
|
|
|
|
ret = -EIO;
|
|
|
|
break;
|
|
|
|
}
|
2017-10-29 11:36:19 -07:00
|
|
|
pte = kaiser_pagetable_walk(address);
|
2017-09-03 18:48:02 -07:00
|
|
|
if (!pte) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
2017-05-04 14:26:50 +02:00
|
|
|
if (pte_none(*pte)) {
|
|
|
|
set_pte(pte, __pte(flags | target_address));
|
|
|
|
} else {
|
2017-08-30 16:23:00 -07:00
|
|
|
pte_t tmp;
|
|
|
|
set_pte(&tmp, __pte(flags | target_address));
|
|
|
|
WARN_ON_ONCE(!pte_same(*pte, tmp));
|
2017-05-04 14:26:50 +02:00
|
|
|
}
|
|
|
|
}
|
2017-08-30 16:23:00 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kaiser_add_user_map_ptrs(const void *start, const void *end, unsigned long flags)
|
|
|
|
{
|
|
|
|
unsigned long size = end - start;
|
|
|
|
|
|
|
|
return kaiser_add_user_map(start, size, flags);
|
2017-05-04 14:26:50 +02:00
|
|
|
}
|
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
/*
|
|
|
|
* Ensure that the top level of the (shadow) page tables are
|
|
|
|
* entirely populated. This ensures that all processes that get
|
|
|
|
* forked have the same entries. This way, we do not have to
|
|
|
|
* ever go set up new entries in older processes.
|
|
|
|
*
|
|
|
|
* Note: we never free these, so there are no updates to them
|
|
|
|
* after this.
|
|
|
|
*/
|
|
|
|
static void __init kaiser_init_all_pgds(void)
|
2017-05-04 14:26:50 +02:00
|
|
|
{
|
|
|
|
pgd_t *pgd;
|
|
|
|
int i = 0;
|
|
|
|
|
|
|
|
pgd = native_get_shadow_pgd(pgd_offset_k((unsigned long )0));
|
|
|
|
for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
|
2017-08-30 16:23:00 -07:00
|
|
|
pgd_t new_pgd;
|
kaiser: vmstat show NR_KAISERTABLE as nr_overhead
The kaiser update made an interesting choice, never to free any shadow
page tables. Contention on global spinlock was worrying, particularly
with it held across page table scans when freeing. Something had to be
done: I was going to add refcounting; but simply never to free them is
an appealing choice, minimizing contention without complicating the code
(the more a page table is found already, the less the spinlock is used).
But leaking pages in this way is also a worry: can we get away with it?
At the very least, we need a count to show how bad it actually gets:
in principle, one might end up wasting about 1/256 of memory that way
(1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
for when they are user-mapped from the vmalloc area on another occasion
(but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).
Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
shared pgd entries, and 1 for each intermediate page table added
thereafter for user-mapping - but leave out the 1 per mm, for its
shadow pgd, because that distracts from the monotonic increase.
Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).
In practice, it doesn't look so bad so far: more like 1/12000 after
nine hours of gtests below; and movable pageblock segregation should
tend to cluster the kaiser tables into a subset of the address space
(if not, they will be bad for compaction too). But production may
tell a different story: keep an eye on this number, and bring back
lighter freeing if it gets out of control (maybe a shrinker).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-09 21:27:32 -07:00
|
|
|
pud_t *pud = pud_alloc_one(&init_mm,
|
|
|
|
PAGE_OFFSET + i * PGDIR_SIZE);
|
2017-08-30 16:23:00 -07:00
|
|
|
if (!pud) {
|
|
|
|
WARN_ON(1);
|
|
|
|
break;
|
|
|
|
}
|
kaiser: vmstat show NR_KAISERTABLE as nr_overhead
The kaiser update made an interesting choice, never to free any shadow
page tables. Contention on global spinlock was worrying, particularly
with it held across page table scans when freeing. Something had to be
done: I was going to add refcounting; but simply never to free them is
an appealing choice, minimizing contention without complicating the code
(the more a page table is found already, the less the spinlock is used).
But leaking pages in this way is also a worry: can we get away with it?
At the very least, we need a count to show how bad it actually gets:
in principle, one might end up wasting about 1/256 of memory that way
(1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
for when they are user-mapped from the vmalloc area on another occasion
(but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).
Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
shared pgd entries, and 1 for each intermediate page table added
thereafter for user-mapping - but leave out the 1 per mm, for its
shadow pgd, because that distracts from the monotonic increase.
Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).
In practice, it doesn't look so bad so far: more like 1/12000 after
nine hours of gtests below; and movable pageblock segregation should
tend to cluster the kaiser tables into a subset of the address space
(if not, they will be bad for compaction too). But production may
tell a different story: keep an eye on this number, and bring back
lighter freeing if it gets out of control (maybe a shrinker).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-09 21:27:32 -07:00
|
|
|
inc_zone_page_state(virt_to_page(pud), NR_KAISERTABLE);
|
2017-08-30 16:23:00 -07:00
|
|
|
new_pgd = __pgd(_KERNPG_TABLE |__pa(pud));
|
|
|
|
/*
|
|
|
|
* Make sure not to stomp on some other pgd entry.
|
|
|
|
*/
|
|
|
|
if (!pgd_none(pgd[i])) {
|
|
|
|
WARN_ON(1);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
set_pgd(pgd + i, new_pgd);
|
2017-05-04 14:26:50 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
#define kaiser_add_user_map_early(start, size, flags) do { \
|
|
|
|
int __ret = kaiser_add_user_map(start, size, flags); \
|
|
|
|
WARN_ON(__ret); \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define kaiser_add_user_map_ptrs_early(start, end, flags) do { \
|
|
|
|
int __ret = kaiser_add_user_map_ptrs(start, end, flags); \
|
|
|
|
WARN_ON(__ret); \
|
|
|
|
} while (0)
|
|
|
|
|
2018-01-02 14:19:48 +01:00
|
|
|
void __init kaiser_check_boottime_disable(void)
|
|
|
|
{
|
|
|
|
bool enable = true;
|
|
|
|
char arg[5];
|
|
|
|
int ret;
|
|
|
|
|
2018-01-02 14:19:49 +01:00
|
|
|
if (boot_cpu_has(X86_FEATURE_XENPV))
|
|
|
|
goto silent_disable;
|
|
|
|
|
2018-01-02 14:19:48 +01:00
|
|
|
ret = cmdline_find_option(boot_command_line, "pti", arg, sizeof(arg));
|
|
|
|
if (ret > 0) {
|
|
|
|
if (!strncmp(arg, "on", 2))
|
|
|
|
goto enable;
|
|
|
|
|
|
|
|
if (!strncmp(arg, "off", 3))
|
|
|
|
goto disable;
|
|
|
|
|
|
|
|
if (!strncmp(arg, "auto", 4))
|
|
|
|
goto skip;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (cmdline_find_option_bool(boot_command_line, "nopti"))
|
|
|
|
goto disable;
|
|
|
|
|
|
|
|
skip:
|
|
|
|
if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
|
|
|
|
goto disable;
|
|
|
|
|
|
|
|
enable:
|
|
|
|
if (enable)
|
|
|
|
setup_force_cpu_cap(X86_FEATURE_KAISER);
|
|
|
|
|
|
|
|
return;
|
|
|
|
|
|
|
|
disable:
|
2018-01-03 10:43:32 -08:00
|
|
|
pr_info("disabled\n");
|
2018-01-02 14:19:49 +01:00
|
|
|
|
|
|
|
silent_disable:
|
2018-01-02 14:19:48 +01:00
|
|
|
kaiser_enabled = 0;
|
|
|
|
setup_clear_cpu_cap(X86_FEATURE_KAISER);
|
|
|
|
}
|
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
/*
|
|
|
|
* If anything in here fails, we will likely die on one of the
|
|
|
|
* first kernel->user transitions and init will die. But, we
|
|
|
|
* will have most of the kernel up by then and should be able to
|
|
|
|
* get a clean warning out of it. If we BUG_ON() here, we run
|
|
|
|
* the risk of being before we have good console output.
|
|
|
|
*/
|
2017-05-04 14:26:50 +02:00
|
|
|
void __init kaiser_init(void)
|
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
|
2018-01-02 14:19:48 +01:00
|
|
|
if (!kaiser_enabled)
|
|
|
|
return;
|
2018-01-02 14:19:48 +01:00
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
kaiser_init_all_pgds();
|
2017-05-04 14:26:50 +02:00
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
2017-08-30 16:23:00 -07:00
|
|
|
void *percpu_vaddr = __per_cpu_user_mapped_start +
|
|
|
|
per_cpu_offset(cpu);
|
|
|
|
unsigned long percpu_sz = __per_cpu_user_mapped_end -
|
|
|
|
__per_cpu_user_mapped_start;
|
|
|
|
kaiser_add_user_map_early(percpu_vaddr, percpu_sz,
|
|
|
|
__PAGE_KERNEL);
|
2017-05-04 14:26:50 +02:00
|
|
|
}
|
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
/*
|
|
|
|
* Map the entry/exit text section, which is needed at
|
|
|
|
* switches from user to and from kernel.
|
|
|
|
*/
|
|
|
|
kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end,
|
|
|
|
__PAGE_KERNEL_RX);
|
2017-05-04 14:26:50 +02:00
|
|
|
|
2017-08-30 16:23:00 -07:00
|
|
|
#if defined(CONFIG_FUNCTION_GRAPH_TRACER) || defined(CONFIG_KASAN)
|
|
|
|
kaiser_add_user_map_ptrs_early(__irqentry_text_start,
|
|
|
|
__irqentry_text_end,
|
|
|
|
__PAGE_KERNEL_RX);
|
|
|
|
#endif
|
|
|
|
kaiser_add_user_map_early((void *)idt_descr.address,
|
|
|
|
sizeof(gate_desc) * NR_VECTORS,
|
|
|
|
__PAGE_KERNEL_RO);
|
|
|
|
#ifdef CONFIG_TRACING
|
|
|
|
kaiser_add_user_map_early(&trace_idt_descr,
|
|
|
|
sizeof(trace_idt_descr),
|
|
|
|
__PAGE_KERNEL);
|
|
|
|
kaiser_add_user_map_early(&trace_idt_table,
|
|
|
|
sizeof(gate_desc) * NR_VECTORS,
|
|
|
|
__PAGE_KERNEL);
|
|
|
|
#endif
|
|
|
|
kaiser_add_user_map_early(&debug_idt_descr, sizeof(debug_idt_descr),
|
|
|
|
__PAGE_KERNEL);
|
|
|
|
kaiser_add_user_map_early(&debug_idt_table,
|
|
|
|
sizeof(gate_desc) * NR_VECTORS,
|
|
|
|
__PAGE_KERNEL);
|
2018-01-03 10:43:32 -08:00
|
|
|
|
|
|
|
pr_info("enabled\n");
|
2017-05-04 14:26:50 +02:00
|
|
|
}
|
|
|
|
|
2017-09-03 19:23:08 -07:00
|
|
|
/* Add a mapping to the shadow mapping, and synchronize the mappings */
|
2017-08-30 16:23:00 -07:00
|
|
|
int kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags)
|
2017-05-04 14:26:50 +02:00
|
|
|
{
|
kaiser: add "nokaiser" boot option, using ALTERNATIVE
Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime. That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.
Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.
It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h). I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.
Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments. But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-24 16:59:49 -07:00
|
|
|
if (!kaiser_enabled)
|
|
|
|
return 0;
|
2017-08-30 16:23:00 -07:00
|
|
|
return kaiser_add_user_map((const void *)addr, size, flags);
|
2017-05-04 14:26:50 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
void kaiser_remove_mapping(unsigned long start, unsigned long size)
|
|
|
|
{
|
2017-09-03 19:23:08 -07:00
|
|
|
extern void unmap_pud_range_nofree(pgd_t *pgd,
|
|
|
|
unsigned long start, unsigned long end);
|
2017-08-30 16:23:00 -07:00
|
|
|
unsigned long end = start + size;
|
2017-10-02 10:57:24 -07:00
|
|
|
unsigned long addr, next;
|
|
|
|
pgd_t *pgd;
|
2017-08-30 16:23:00 -07:00
|
|
|
|
kaiser: add "nokaiser" boot option, using ALTERNATIVE
Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime. That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.
Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.
It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h). I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.
Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments. But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-24 16:59:49 -07:00
|
|
|
if (!kaiser_enabled)
|
|
|
|
return;
|
2017-10-02 10:57:24 -07:00
|
|
|
pgd = native_get_shadow_pgd(pgd_offset_k(start));
|
|
|
|
for (addr = start; addr < end; pgd++, addr = next) {
|
|
|
|
next = pgd_addr_end(addr, end);
|
|
|
|
unmap_pud_range_nofree(pgd, addr, next);
|
2017-08-30 16:23:00 -07:00
|
|
|
}
|
2017-05-04 14:26:50 +02:00
|
|
|
}
|
kaiser: do not set _PAGE_NX on pgd_none
native_pgd_clear() uses native_set_pgd(), so native_set_pgd() must
avoid setting the _PAGE_NX bit on an otherwise pgd_none() entry:
usually that just generated a warning on exit, but sometimes
more mysterious and damaging failures (our production machines
could not complete booting).
The original fix to this just avoided adding _PAGE_NX to
an empty entry; but eventually more problems surfaced with kexec,
and EFI mapping expected to be a problem too. So now instead
change native_set_pgd() to update shadow only if _PAGE_USER:
A few places (kernel/machine_kexec_64.c, platform/efi/efi_64.c for sure)
use set_pgd() to set up a temporary internal virtual address space, with
physical pages remapped at what Kaiser regards as userspace addresses:
Kaiser then assumes a shadow pgd follows, which it will try to corrupt.
This appears to be responsible for the recent kexec and kdump failures;
though it's unclear how those did not manifest as a problem before.
Ah, the shadow pgd will only be assumed to "follow" if the requested
pgd is on an even-numbered page: so I suppose it was going wrong 50%
of the time all along.
What we need is a flag to set_pgd(), to tell it we're dealing with
userspace. Er, isn't that what the pgd's _PAGE_USER bit is saying?
Add a test for that. But we cannot do the same for pgd_clear()
(which may be called to clear corrupted entries - set aside the
question of "corrupt in which pgd?" until later), so there just
rely on pgd_clear() not being called in the problematic cases -
with a WARN_ON_ONCE() which should fire half the time if it is.
But this is getting too big for an inline function: move it into
arch/x86/mm/kaiser.c (which then demands a boot/compressed mod);
and de-void and de-space native_get_shadow/normal_pgd() while here.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-05 12:05:01 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Page table pages are page-aligned. The lower half of the top
|
|
|
|
* level is used for userspace and the top half for the kernel.
|
|
|
|
* This returns true for user pages that need to get copied into
|
|
|
|
* both the user and kernel copies of the page tables, and false
|
|
|
|
* for kernel pages that should only be in the kernel copy.
|
|
|
|
*/
|
|
|
|
static inline bool is_userspace_pgd(pgd_t *pgdp)
|
|
|
|
{
|
|
|
|
return ((unsigned long)pgdp % PAGE_SIZE) < (PAGE_SIZE / 2);
|
|
|
|
}
|
|
|
|
|
|
|
|
pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
|
|
|
|
{
|
kaiser: add "nokaiser" boot option, using ALTERNATIVE
Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime. That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.
Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.
It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h). I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.
Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments. But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-24 16:59:49 -07:00
|
|
|
if (!kaiser_enabled)
|
|
|
|
return pgd;
|
kaiser: do not set _PAGE_NX on pgd_none
native_pgd_clear() uses native_set_pgd(), so native_set_pgd() must
avoid setting the _PAGE_NX bit on an otherwise pgd_none() entry:
usually that just generated a warning on exit, but sometimes
more mysterious and damaging failures (our production machines
could not complete booting).
The original fix to this just avoided adding _PAGE_NX to
an empty entry; but eventually more problems surfaced with kexec,
and EFI mapping expected to be a problem too. So now instead
change native_set_pgd() to update shadow only if _PAGE_USER:
A few places (kernel/machine_kexec_64.c, platform/efi/efi_64.c for sure)
use set_pgd() to set up a temporary internal virtual address space, with
physical pages remapped at what Kaiser regards as userspace addresses:
Kaiser then assumes a shadow pgd follows, which it will try to corrupt.
This appears to be responsible for the recent kexec and kdump failures;
though it's unclear how those did not manifest as a problem before.
Ah, the shadow pgd will only be assumed to "follow" if the requested
pgd is on an even-numbered page: so I suppose it was going wrong 50%
of the time all along.
What we need is a flag to set_pgd(), to tell it we're dealing with
userspace. Er, isn't that what the pgd's _PAGE_USER bit is saying?
Add a test for that. But we cannot do the same for pgd_clear()
(which may be called to clear corrupted entries - set aside the
question of "corrupt in which pgd?" until later), so there just
rely on pgd_clear() not being called in the problematic cases -
with a WARN_ON_ONCE() which should fire half the time if it is.
But this is getting too big for an inline function: move it into
arch/x86/mm/kaiser.c (which then demands a boot/compressed mod);
and de-void and de-space native_get_shadow/normal_pgd() while here.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-05 12:05:01 -07:00
|
|
|
/*
|
|
|
|
* Do we need to also populate the shadow pgd? Check _PAGE_USER to
|
|
|
|
* skip cases like kexec and EFI which make temporary low mappings.
|
|
|
|
*/
|
|
|
|
if (pgd.pgd & _PAGE_USER) {
|
|
|
|
if (is_userspace_pgd(pgdp)) {
|
|
|
|
native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
|
|
|
|
/*
|
|
|
|
* Even if the entry is *mapping* userspace, ensure
|
|
|
|
* that userspace can not use it. This way, if we
|
|
|
|
* get out to userspace running on the kernel CR3,
|
|
|
|
* userspace will crash instead of running.
|
|
|
|
*/
|
kaiser: Set _PAGE_NX only if supported
This resolves a crash if loaded under qemu + haxm under windows.
See https://www.spinics.net/lists/kernel/msg2689835.html for details.
Here is a boot log (the log is from chromeos-4.4, but Tao Wu says that
the same log is also seen with vanilla v4.4.110-rc1).
[ 0.712750] Freeing unused kernel memory: 552K
[ 0.721821] init: Corrupted page table at address 57b029b332e0
[ 0.722761] PGD 80000000bb238067 PUD bc36a067 PMD bc369067 PTE 45d2067
[ 0.722761] Bad pagetable: 000b [#1] PREEMPT SMP
[ 0.722761] Modules linked in:
[ 0.722761] CPU: 1 PID: 1 Comm: init Not tainted 4.4.96 #31
[ 0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[ 0.722761] task: ffff8800bc290000 ti: ffff8800bc28c000 task.ti: ffff8800bc28c000
[ 0.722761] RIP: 0010:[<ffffffff83f4129e>] [<ffffffff83f4129e>] __clear_user+0x42/0x67
[ 0.722761] RSP: 0000:ffff8800bc28fcf8 EFLAGS: 00010202
[ 0.722761] RAX: 0000000000000000 RBX: 00000000000001a4 RCX: 00000000000001a4
[ 0.722761] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000057b029b332e0
[ 0.722761] RBP: ffff8800bc28fd08 R08: ffff8800bc290000 R09: ffff8800bb2f4000
[ 0.722761] R10: ffff8800bc290000 R11: ffff8800bb2f4000 R12: 000057b029b332e0
[ 0.722761] R13: 0000000000000000 R14: 000057b029b33340 R15: ffff8800bb1e2a00
[ 0.722761] FS: 0000000000000000(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
[ 0.722761] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 0.722761] CR2: 000057b029b332e0 CR3: 00000000bb2f8000 CR4: 00000000000006e0
[ 0.722761] Stack:
[ 0.722761] 000057b029b332e0 ffff8800bb95fa80 ffff8800bc28fd18 ffffffff83f4120c
[ 0.722761] ffff8800bc28fe18 ffffffff83e9e7a1 ffff8800bc28fd68 0000000000000000
[ 0.722761] ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 ffff8800bc290000
[ 0.722761] Call Trace:
[ 0.722761] [<ffffffff83f4120c>] clear_user+0x2e/0x30
[ 0.722761] [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7
[ 0.722761] [<ffffffff83de2088>] search_binary_handler+0x86/0x19c
[ 0.722761] [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98
[ 0.722761] [<ffffffff844febe0>] ? rest_init+0x87/0x87
[ 0.722761] [<ffffffff83de40be>] do_execve+0x23/0x25
[ 0.722761] [<ffffffff83c002e3>] run_init_process+0x2b/0x2d
[ 0.722761] [<ffffffff844fec4d>] kernel_init+0x6d/0xda
[ 0.722761] [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70
[ 0.722761] [<ffffffff844febe0>] ? rest_init+0x87/0x87
[ 0.722761] Code: 86 84 be 12 00 00 00 e8 87 0d e8 ff 66 66 90 48 89 d8 48 c1
eb 03 4c 89 e7 83 e0 07 48 89 d9 be 08 00 00 00 31 d2 48 85 c9 74 0a <48> 89 17
48 01 f7 ff c9 75 f6 48 89 c1 85 c9 74 09 88 17 48 ff
[ 0.722761] RIP [<ffffffff83f4129e>] __clear_user+0x42/0x67
[ 0.722761] RSP <ffff8800bc28fcf8>
[ 0.722761] ---[ end trace def703879b4ff090 ]---
[ 0.722761] BUG: sleeping function called from invalid context at /mnt/host/source/src/third_party/kernel/v4.4/kernel/locking/rwsem.c:21
[ 0.722761] in_atomic(): 0, irqs_disabled(): 1, pid: 1, name: init
[ 0.722761] CPU: 1 PID: 1 Comm: init Tainted: G D 4.4.96 #31
[ 0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[ 0.722761] 0000000000000086 dcb5d76098c89836 ffff8800bc28fa30 ffffffff83f34004
[ 0.722761] ffffffff84839dc2 0000000000000015 ffff8800bc28fa40 ffffffff83d57dc9
[ 0.722761] ffff8800bc28fa68 ffffffff83d57e6a ffffffff84a53640 0000000000000000
[ 0.722761] Call Trace:
[ 0.722761] [<ffffffff83f34004>] dump_stack+0x4d/0x63
[ 0.722761] [<ffffffff83d57dc9>] ___might_sleep+0x13a/0x13c
[ 0.722761] [<ffffffff83d57e6a>] __might_sleep+0x9f/0xa6
[ 0.722761] [<ffffffff84502788>] down_read+0x20/0x31
[ 0.722761] [<ffffffff83cc5d9b>] __blocking_notifier_call_chain+0x35/0x63
[ 0.722761] [<ffffffff83cc5ddd>] blocking_notifier_call_chain+0x14/0x16
[ 0.800374] usb 1-1: new full-speed USB device number 2 using uhci_hcd
[ 0.722761] [<ffffffff83cefe97>] profile_task_exit+0x1a/0x1c
[ 0.802309] [<ffffffff83cac84e>] do_exit+0x39/0xe7f
[ 0.802309] [<ffffffff83ce5938>] ? vprintk_default+0x1d/0x1f
[ 0.802309] [<ffffffff83d7bb95>] ? printk+0x57/0x73
[ 0.802309] [<ffffffff83c46e25>] oops_end+0x80/0x85
[ 0.802309] [<ffffffff83c7b747>] pgtable_bad+0x8a/0x95
[ 0.802309] [<ffffffff83ca7f4a>] __do_page_fault+0x8c/0x352
[ 0.802309] [<ffffffff83eefba5>] ? file_has_perm+0xc4/0xe5
[ 0.802309] [<ffffffff83ca821c>] do_page_fault+0xc/0xe
[ 0.802309] [<ffffffff84507682>] page_fault+0x22/0x30
[ 0.802309] [<ffffffff83f4129e>] ? __clear_user+0x42/0x67
[ 0.802309] [<ffffffff83f4127f>] ? __clear_user+0x23/0x67
[ 0.802309] [<ffffffff83f4120c>] clear_user+0x2e/0x30
[ 0.802309] [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7
[ 0.802309] [<ffffffff83de2088>] search_binary_handler+0x86/0x19c
[ 0.802309] [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98
[ 0.802309] [<ffffffff844febe0>] ? rest_init+0x87/0x87
[ 0.802309] [<ffffffff83de40be>] do_execve+0x23/0x25
[ 0.802309] [<ffffffff83c002e3>] run_init_process+0x2b/0x2d
[ 0.802309] [<ffffffff844fec4d>] kernel_init+0x6d/0xda
[ 0.802309] [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70
[ 0.802309] [<ffffffff844febe0>] ? rest_init+0x87/0x87
[ 0.830559] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[ 0.830559]
[ 0.831305] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 0.831305] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
The crash part of this problem may be solved with the following patch
(thanks to Hugh for the hint). There is still another problem, though -
with this patch applied, the qemu session aborts with "VCPU Shutdown
request", whatever that means.
Cc: lepton <ytht.net@gmail.com>
Signed-off-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-04 13:41:55 -08:00
|
|
|
if (__supported_pte_mask & _PAGE_NX)
|
|
|
|
pgd.pgd |= _PAGE_NX;
|
kaiser: do not set _PAGE_NX on pgd_none
native_pgd_clear() uses native_set_pgd(), so native_set_pgd() must
avoid setting the _PAGE_NX bit on an otherwise pgd_none() entry:
usually that just generated a warning on exit, but sometimes
more mysterious and damaging failures (our production machines
could not complete booting).
The original fix to this just avoided adding _PAGE_NX to
an empty entry; but eventually more problems surfaced with kexec,
and EFI mapping expected to be a problem too. So now instead
change native_set_pgd() to update shadow only if _PAGE_USER:
A few places (kernel/machine_kexec_64.c, platform/efi/efi_64.c for sure)
use set_pgd() to set up a temporary internal virtual address space, with
physical pages remapped at what Kaiser regards as userspace addresses:
Kaiser then assumes a shadow pgd follows, which it will try to corrupt.
This appears to be responsible for the recent kexec and kdump failures;
though it's unclear how those did not manifest as a problem before.
Ah, the shadow pgd will only be assumed to "follow" if the requested
pgd is on an even-numbered page: so I suppose it was going wrong 50%
of the time all along.
What we need is a flag to set_pgd(), to tell it we're dealing with
userspace. Er, isn't that what the pgd's _PAGE_USER bit is saying?
Add a test for that. But we cannot do the same for pgd_clear()
(which may be called to clear corrupted entries - set aside the
question of "corrupt in which pgd?" until later), so there just
rely on pgd_clear() not being called in the problematic cases -
with a WARN_ON_ONCE() which should fire half the time if it is.
But this is getting too big for an inline function: move it into
arch/x86/mm/kaiser.c (which then demands a boot/compressed mod);
and de-void and de-space native_get_shadow/normal_pgd() while here.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-05 12:05:01 -07:00
|
|
|
}
|
|
|
|
} else if (!pgd.pgd) {
|
|
|
|
/*
|
|
|
|
* pgd_clear() cannot check _PAGE_USER, and is even used to
|
|
|
|
* clear corrupted pgd entries: so just rely on cases like
|
|
|
|
* kexec and EFI never to be using pgd_clear().
|
|
|
|
*/
|
|
|
|
if (!WARN_ON_ONCE((unsigned long)pgdp & PAGE_SIZE) &&
|
|
|
|
is_userspace_pgd(pgdp))
|
|
|
|
native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
|
|
|
|
}
|
|
|
|
return pgd;
|
|
|
|
}
|
kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
We have many machines (Westmere, Sandybridge, Ivybridge) supporting
PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.
Flushing user context inside load_new_mm_cr3() without the use of
invpcid is difficult: momentarily switch from kernel to user context
and back to do so? I'm not sure whether that can be safely done at
all, and would risk polluting user context with kernel internals,
and kernel context with stale user externals.
Instead, follow the hint in the comment that was there: change
X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3()
can leave a note in it, for SWITCH_USER_CR3 on return to userspace to
flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.
Which works well enough that there's no need to do it this way only
when invpcid is unsupported: it's a good alternative to invpcid here.
But there's a couple of inlines in asm/tlbflush.h that need to do the
same trick, so it's best to localize all this per-cpu business in
mm/kaiser.c: moving that part of the initialization from setup_pcid()
to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the
function for noting an X86_CR3_PCID_USER_FLUSH. And let's keep a
KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.
I did try to make the feature tests in asm/tlbflush.h more consistent
with each other: there seem to be far too many ways of performing such
tests, and I don't have a good grasp of their differences. At first
I converted them all to be static_cpu_has(): but that proved to be a
mistake, as the comment in __native_flush_tlb_single() hints; so then
I reversed and made them all this_cpu_has(). Probably all gratuitous
change, but that's the way it's working at present.
I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR
gets re-initialized by each cpu (before and after these changes):
no problem when (as usual) all cpus on a machine have the same
features, but in principle incorrect. However, my experiment
to per-cpu-ify that one did not end well...
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-17 15:00:37 -07:00
|
|
|
|
|
|
|
void kaiser_setup_pcid(void)
|
|
|
|
{
|
|
|
|
unsigned long user_cr3 = KAISER_SHADOW_PGD_OFFSET;
|
|
|
|
|
2017-10-03 20:49:04 -07:00
|
|
|
if (this_cpu_has(X86_FEATURE_PCID))
|
kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
We have many machines (Westmere, Sandybridge, Ivybridge) supporting
PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.
Flushing user context inside load_new_mm_cr3() without the use of
invpcid is difficult: momentarily switch from kernel to user context
and back to do so? I'm not sure whether that can be safely done at
all, and would risk polluting user context with kernel internals,
and kernel context with stale user externals.
Instead, follow the hint in the comment that was there: change
X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3()
can leave a note in it, for SWITCH_USER_CR3 on return to userspace to
flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.
Which works well enough that there's no need to do it this way only
when invpcid is unsupported: it's a good alternative to invpcid here.
But there's a couple of inlines in asm/tlbflush.h that need to do the
same trick, so it's best to localize all this per-cpu business in
mm/kaiser.c: moving that part of the initialization from setup_pcid()
to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the
function for noting an X86_CR3_PCID_USER_FLUSH. And let's keep a
KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.
I did try to make the feature tests in asm/tlbflush.h more consistent
with each other: there seem to be far too many ways of performing such
tests, and I don't have a good grasp of their differences. At first
I converted them all to be static_cpu_has(): but that proved to be a
mistake, as the comment in __native_flush_tlb_single() hints; so then
I reversed and made them all this_cpu_has(). Probably all gratuitous
change, but that's the way it's working at present.
I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR
gets re-initialized by each cpu (before and after these changes):
no problem when (as usual) all cpus on a machine have the same
features, but in principle incorrect. However, my experiment
to per-cpu-ify that one did not end well...
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-17 15:00:37 -07:00
|
|
|
user_cr3 |= X86_CR3_PCID_USER_NOFLUSH;
|
|
|
|
/*
|
|
|
|
* These variables are used by the entry/exit
|
|
|
|
* code to change PCID and pgd and TLB flushing.
|
|
|
|
*/
|
2017-08-27 16:24:27 -07:00
|
|
|
this_cpu_write(x86_cr3_pcid_user, user_cr3);
|
kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
We have many machines (Westmere, Sandybridge, Ivybridge) supporting
PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.
Flushing user context inside load_new_mm_cr3() without the use of
invpcid is difficult: momentarily switch from kernel to user context
and back to do so? I'm not sure whether that can be safely done at
all, and would risk polluting user context with kernel internals,
and kernel context with stale user externals.
Instead, follow the hint in the comment that was there: change
X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3()
can leave a note in it, for SWITCH_USER_CR3 on return to userspace to
flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.
Which works well enough that there's no need to do it this way only
when invpcid is unsupported: it's a good alternative to invpcid here.
But there's a couple of inlines in asm/tlbflush.h that need to do the
same trick, so it's best to localize all this per-cpu business in
mm/kaiser.c: moving that part of the initialization from setup_pcid()
to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the
function for noting an X86_CR3_PCID_USER_FLUSH. And let's keep a
KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.
I did try to make the feature tests in asm/tlbflush.h more consistent
with each other: there seem to be far too many ways of performing such
tests, and I don't have a good grasp of their differences. At first
I converted them all to be static_cpu_has(): but that proved to be a
mistake, as the comment in __native_flush_tlb_single() hints; so then
I reversed and made them all this_cpu_has(). Probably all gratuitous
change, but that's the way it's working at present.
I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR
gets re-initialized by each cpu (before and after these changes):
no problem when (as usual) all cpus on a machine have the same
features, but in principle incorrect. However, my experiment
to per-cpu-ify that one did not end well...
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-17 15:00:37 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make a note that this cpu will need to flush USER tlb on return to user.
|
2017-11-04 18:43:06 -07:00
|
|
|
* If cpu does not have PCID, then the NOFLUSH bit will never have been set.
|
kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
We have many machines (Westmere, Sandybridge, Ivybridge) supporting
PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.
Flushing user context inside load_new_mm_cr3() without the use of
invpcid is difficult: momentarily switch from kernel to user context
and back to do so? I'm not sure whether that can be safely done at
all, and would risk polluting user context with kernel internals,
and kernel context with stale user externals.
Instead, follow the hint in the comment that was there: change
X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3()
can leave a note in it, for SWITCH_USER_CR3 on return to userspace to
flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.
Which works well enough that there's no need to do it this way only
when invpcid is unsupported: it's a good alternative to invpcid here.
But there's a couple of inlines in asm/tlbflush.h that need to do the
same trick, so it's best to localize all this per-cpu business in
mm/kaiser.c: moving that part of the initialization from setup_pcid()
to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the
function for noting an X86_CR3_PCID_USER_FLUSH. And let's keep a
KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.
I did try to make the feature tests in asm/tlbflush.h more consistent
with each other: there seem to be far too many ways of performing such
tests, and I don't have a good grasp of their differences. At first
I converted them all to be static_cpu_has(): but that proved to be a
mistake, as the comment in __native_flush_tlb_single() hints; so then
I reversed and made them all this_cpu_has(). Probably all gratuitous
change, but that's the way it's working at present.
I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR
gets re-initialized by each cpu (before and after these changes):
no problem when (as usual) all cpus on a machine have the same
features, but in principle incorrect. However, my experiment
to per-cpu-ify that one did not end well...
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-17 15:00:37 -07:00
|
|
|
*/
|
|
|
|
void kaiser_flush_tlb_on_return_to_user(void)
|
|
|
|
{
|
2017-11-04 18:43:06 -07:00
|
|
|
if (this_cpu_has(X86_FEATURE_PCID))
|
|
|
|
this_cpu_write(x86_cr3_pcid_user,
|
kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
We have many machines (Westmere, Sandybridge, Ivybridge) supporting
PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.
Flushing user context inside load_new_mm_cr3() without the use of
invpcid is difficult: momentarily switch from kernel to user context
and back to do so? I'm not sure whether that can be safely done at
all, and would risk polluting user context with kernel internals,
and kernel context with stale user externals.
Instead, follow the hint in the comment that was there: change
X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3()
can leave a note in it, for SWITCH_USER_CR3 on return to userspace to
flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.
Which works well enough that there's no need to do it this way only
when invpcid is unsupported: it's a good alternative to invpcid here.
But there's a couple of inlines in asm/tlbflush.h that need to do the
same trick, so it's best to localize all this per-cpu business in
mm/kaiser.c: moving that part of the initialization from setup_pcid()
to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the
function for noting an X86_CR3_PCID_USER_FLUSH. And let's keep a
KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.
I did try to make the feature tests in asm/tlbflush.h more consistent
with each other: there seem to be far too many ways of performing such
tests, and I don't have a good grasp of their differences. At first
I converted them all to be static_cpu_has(): but that proved to be a
mistake, as the comment in __native_flush_tlb_single() hints; so then
I reversed and made them all this_cpu_has(). Probably all gratuitous
change, but that's the way it's working at present.
I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR
gets re-initialized by each cpu (before and after these changes):
no problem when (as usual) all cpus on a machine have the same
features, but in principle incorrect. However, my experiment
to per-cpu-ify that one did not end well...
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-17 15:00:37 -07:00
|
|
|
X86_CR3_PCID_USER_FLUSH | KAISER_SHADOW_PGD_OFFSET);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kaiser_flush_tlb_on_return_to_user);
|