Commit graph

3906 commits

Author SHA1 Message Date
Al Viro
4f4202fe5a unify default ptrace_signal_deliver
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-29 00:01:23 -05:00
Al Viro
18c26c27ae death to idle_regs()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-28 23:43:42 -05:00
Al Viro
24465a40ba take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
now it can be done...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-28 23:43:27 -05:00
Al Viro
1d4b4b2994 x86, um: switch to generic fork/vfork/clone
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-28 22:13:44 -05:00
Al Viro
6b94631f9e consolidate sys_execve() prototype
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-28 21:53:35 -05:00
Marcelo Tosatti
b48aa97e38 KVM: x86: require matched TSC offsets for master clock
With master clock, a pvclock clock read calculates:

ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ]

Where 'rdtsc' is the host TSC.

system_timestamp and tsc_timestamp are unique, one tuple
per VM: the "master clock".

Given a host with synchronized TSCs, its obvious that
guest TSC must be matched for the above to guarantee monotonicity.

Allow master clock usage only if guest TSCs are synchronized.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:15 -02:00
Marcelo Tosatti
d828199e84 KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
KVM added a global variable to guarantee monotonicity in the guest.
One of the reasons for that is that the time between

	1. ktime_get_ts(&timespec);
	2. rdtscll(tsc);

Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.

The time required to execute 2. is not the same on those two instances
executing in different VCPUS (cache misses, interrupts...).

If the TSC value that is used by the host to interpolate when
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, and
a single <system_timestamp, tsc_timestamp> tuple is visible to all
vcpus simultaneously, this problem disappears. See comment on top
of pvclock_update_vm_gtod_copy for details.

Monotonicity is then guaranteed by synchronicity of the host TSCs
and guest TSCs.

Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:13 -02:00
Marcelo Tosatti
886b470cb1 KVM: x86: pass host_tsc to read_l1_tsc
Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:11 -02:00
Marcelo Tosatti
51c19b4f59 x86: vdso: pvclock gettime support
Improve performance of time system calls when using Linux pvclock,
by reading time info from fixmap visible copy of pvclock data.

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:11 -02:00
Marcelo Tosatti
3dc4f7cfb7 x86: kvm guest: pvclock vsyscall support
Hook into generic pvclock vsyscall code, with the aim to
allow userspace to have visibility into pvclock data.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:10 -02:00
Marcelo Tosatti
71056ae22d x86: pvclock: generic pvclock vsyscall initialization
Originally from Jeremy Fitzhardinge.

Introduce generic, non hypervisor specific, pvclock initialization
routines.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:09 -02:00
Marcelo Tosatti
189e11731a x86: pvclock: add note about rdtsc barriers
As noted by Gleb, not advertising SSE2 support implies
no RDTSC barriers.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:08 -02:00
Marcelo Tosatti
2697902be8 x86: pvclock: introduce helper to read flags
Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:08 -02:00
Marcelo Tosatti
dce2db0a35 x86: pvclock: create helper for pvclock data retrieval
Originally from Jeremy Fitzhardinge.

So code can be reused.

Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:07 -02:00
Linus Torvalds
2654ad44b5 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 arch fixes from Peter Anvin:
 "Here is a collection of fixes for 3.7-rc7.  This is a superset of
  tglx' earlier pull request."

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86-64: Fix ordering of CFI directives and recent ASM_CLAC additions
  x86, microcode, AMD: Add support for family 16h processors
  x86-32: Export kernel_stack_pointer() for modules
  x86-32: Fix invalid stack address while in softirq
  x86, efi: Fix processor-specific memcpy() build error
  x86: remove dummy long from EFI stub
  x86, mm: Correct vmflag test for checking VM_HUGETLB
  x86, amd: Disable way access filter on Piledriver CPUs
  x86/mce: Do not change worker's running cpu in cmci_rediscover().
  x86/ce4100: Fix PCI configuration register access for devices without interrupts
  x86/ce4100: Fix reboot by forcing the reboot method to be KBD
  x86/ce4100: Fix pm_poweroff
  MAINTAINERS: Update email address for Robert Richter
  x86, microcode_amd: Change email addresses, MAINTAINERS entry
  MAINTAINERS: Change Boris' email address
  EDAC: Change Boris' email address
  x86, AMD: Change Boris' email address
2012-11-23 20:03:14 -10:00
Len Brown
3fc808aaa0 x86 power: define RAPL MSRs
The Run Time Average Power Limiting interface
is currently model specific, present on Sandy Bridge
and Ivy Bridge processors.

These #defines correspond to documentation in the latest
"Intel® 64 and IA-32 Architectures Software Developer Manual",
plus some typos in that document corrected.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
2012-11-23 21:40:11 -05:00
Len Brown
9c63a650bb tools/power/x86/turbostat: share kernel MSR #defines
Now that turbostat is built in the kernel tree,
it can share MSR #defines with the kernel.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
2012-11-23 21:40:04 -05:00
Robert Richter
1022623842 x86-32: Fix invalid stack address while in softirq
In 32 bit the stack address provided by kernel_stack_pointer() may
point to an invalid range causing NULL pointer access or page faults
while in NMI (see trace below). This happens if called in softirq
context and if the stack is empty. The address at &regs->sp is then
out of range.

Fixing this by checking if regs and &regs->sp are in the same stack
context. Otherwise return the previous stack pointer stored in struct
thread_info. If that address is invalid too, return address of regs.

 BUG: unable to handle kernel NULL pointer dereference at 0000000a
 IP: [<c1004237>] print_context_stack+0x6e/0x8d
 *pde = 00000000
 Oops: 0000 [#1] SMP
 Modules linked in:
 Pid: 4434, comm: perl Not tainted 3.6.0-rc3-oprofile-i386-standard-g4411a05 #4 Hewlett-Packard HP xw9400 Workstation/0A1Ch
 EIP: 0060:[<c1004237>] EFLAGS: 00010093 CPU: 0
 EIP is at print_context_stack+0x6e/0x8d
 EAX: ffffe000 EBX: 0000000a ECX: f4435f94 EDX: 0000000a
 ESI: f4435f94 EDI: f4435f94 EBP: f5409ec0 ESP: f5409ea0
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 CR0: 8005003b CR2: 0000000a CR3: 34ac9000 CR4: 000007d0
 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
 DR6: ffff0ff0 DR7: 00000400
 Process perl (pid: 4434, ti=f5408000 task=f5637850 task.ti=f4434000)
 Stack:
  000003e8 ffffe000 00001ffc f4e39b00 00000000 0000000a f4435f94 c155198c
  f5409ef0 c1003723 c155198c f5409f04 00000000 f5409edc 00000000 00000000
  f5409ee8 f4435f94 f5409fc4 00000001 f5409f1c c12dce1c 00000000 c155198c
 Call Trace:
  [<c1003723>] dump_trace+0x7b/0xa1
  [<c12dce1c>] x86_backtrace+0x40/0x88
  [<c12db712>] ? oprofile_add_sample+0x56/0x84
  [<c12db731>] oprofile_add_sample+0x75/0x84
  [<c12ddb5b>] op_amd_check_ctrs+0x46/0x260
  [<c12dd40d>] profile_exceptions_notify+0x23/0x4c
  [<c1395034>] nmi_handle+0x31/0x4a
  [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45
  [<c13950ed>] do_nmi+0xa0/0x2ff
  [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45
  [<c13949e5>] nmi_stack_correct+0x28/0x2d
  [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45
  [<c1003603>] ? do_softirq+0x4b/0x7f
  <IRQ>
  [<c102a06f>] irq_exit+0x35/0x5b
  [<c1018f56>] smp_apic_timer_interrupt+0x6c/0x7a
  [<c1394746>] apic_timer_interrupt+0x2a/0x30
 Code: 89 fe eb 08 31 c9 8b 45 0c ff 55 ec 83 c3 04 83 7d 10 00 74 0c 3b 5d 10 73 26 3b 5d e4 73 0c eb 1f 3b 5d f0 76 1a 3b 5d e8 73 15 <8b> 13 89 d0 89 55 e0 e8 ad 42 03 00 85 c0 8b 55 e0 75 a6 eb cc
 EIP: [<c1004237>] print_context_stack+0x6e/0x8d SS:ESP 0068:f5409ea0
 CR2: 000000000000000a
 ---[ end trace 62afee3481b00012 ]---
 Kernel panic - not syncing: Fatal exception in interrupt

V2:
* add comments to kernel_stack_pointer()
* always return a valid stack address by falling back to the address
  of regs

Reported-by: Yang Wei <wei.yang@windriver.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Link: http://lkml.kernel.org/r/20120912135059.GZ8285@erda.amd.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jun Zhang <jun.zhang@intel.com>
2012-11-20 22:23:20 -08:00
David Howells
60606d4248 Merge branch 'x86-pre-uapi' into perf-uapi
David Howells (1):
      x86: Export asm/{svm.h,vmx.h,perf_regs.h}
2012-11-19 21:50:58 +00:00
Steven Rostedt
6c8d8b3c69 x86_32: Return actual stack when requesting sp from regs
As x86_32 traps do not save sp when taken in kernel mode, we need to
accommodate the sp when requesting to get the register.

This affects kprobes.

Before:

 # echo 'p:ftrace sys_read+4 s=%sp' > /debug/tracing/kprobe_events
 # echo 1 > /debug/tracing/events/kprobes/enable
 # cat trace
            sshd-1345  [000] d...   489.117168: ftrace: (sys_read+0x4/0x70) s=b7e96768
            sshd-1345  [000] d...   489.117191: ftrace: (sys_read+0x4/0x70) s=b7e96768
             cat-1447  [000] d...   489.117392: ftrace: (sys_read+0x4/0x70) s=5a7
             cat-1447  [001] d...   489.118023: ftrace: (sys_read+0x4/0x70) s=b77ad05f
            less-1448  [000] d...   489.118079: ftrace: (sys_read+0x4/0x70) s=b7762e06
            less-1448  [000] d...   489.118117: ftrace: (sys_read+0x4/0x70) s=b7764970

After:
            sshd-1352  [000] d...   362.348016: ftrace: (sys_read+0x4/0x70) s=f3febfa8
            sshd-1352  [000] d...   362.348048: ftrace: (sys_read+0x4/0x70) s=f3febfa8
            bash-1355  [001] d...   362.348081: ftrace: (sys_read+0x4/0x70) s=f5075fa8
            sshd-1352  [000] d...   362.348082: ftrace: (sys_read+0x4/0x70) s=f3febfa8
            sshd-1352  [000] d...   362.690950: ftrace: (sys_read+0x4/0x70) s=f3febfa8
            bash-1355  [001] d...   362.691033: ftrace: (sys_read+0x4/0x70) s=f5075fa8

Link: http://lkml.kernel.org/r/1342208654.30075.22.camel@gandalf.stny.rr.com

Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-19 05:31:37 -05:00
Yinghai Lu
c074eaac2a x86, mm: kill numa_64.h
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-44-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:47 -08:00
Yinghai Lu
94b43c3d86 x86, mm: kill numa_free_all_bootmem()
Now NO_BOOTMEM version free_all_bootmem_node() does not really
do free_bootmem at all, and it only call register_page_bootmem_info_node
instead.

That is confusing, try to kill that free_all_bootmem_node().

Before that, this patch will remove numa_free_all_bootmem().

That function could be replaced with register_page_bootmem_info() and
free_all_bootmem();

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-43-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:47 -08:00
Yinghai Lu
c8dcdb9ce4 x86, mm: Move function declaration into mm_internal.h
They are only for mm/init*.c.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-34-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:38 -08:00
Yinghai Lu
cf47065961 x86, mm: Move back pgt_buf_* to mm/init.c
Also change them to static.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-31-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:36 -08:00
Yinghai Lu
6f80b68e9e x86, mm, Xen: Remove mapping_pagetable_reserve()
Page table area are pre-mapped now after
	x86, mm: setup page table in top-down
	x86, mm: Remove early_memremap workaround for page table accessing on 64bit

mapping_pagetable_reserve is not used anymore, so remove it.

Also remove operation in mask_rw_pte(), as modified allow_low_page
always return pages that are already mapped, moreover
xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
before hooking it into the pagetable automatically.

-v2: add changelog about mask_rw_pte() from Stefano.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-27-git-send-email-yinghai@kernel.org
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:26 -08:00
Yinghai Lu
9985b4c6fa x86, mm: Move min_pfn_mapped back to mm/init.c
Also change it to static.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-26-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:24 -08:00
Yinghai Lu
8d57470d8f x86, mm: setup page table in top-down
Get pgt_buf early from BRK, and use it to map PMD_SIZE from top at first.
Then use mapped pages to map more ranges below, and keep looping until
all pages get mapped.

alloc_low_page will use page from BRK at first, after that buffer is used
up, will use memblock to find and reserve pages for page table usage.

Introduce min_pfn_mapped to make sure find new pages from mapped ranges,
that will be updated when lower pages get mapped.

Also add step_size to make sure that don't try to map too big range with
limited mapped pages initially, and increase the step_size when we have
more mapped pages on hand.

We don't need to call pagetable_reserve anymore, reserve work is done
in alloc_low_page() directly.

At last we can get rid of calculation and find early pgt related code.

-v2: update to after fix_xen change,
     also use MACRO for initial pgt_buf size and add comments with it.
-v3: skip big reserved range in memblock.reserved near end.
-v4: don't need fix_xen change now.
-v5: add changelog about moving about reserving pagetable to alloc_low_page.

Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-22-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:19 -08:00
Jacob Shin
66520ebc2d x86, mm: Only direct map addresses that are marked as E820_RAM
Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
backed by actual DRAM. This is fine for holes under 4GB which are covered
by fixed and variable range MTRRs to be UC. However, we run into trouble
on higher memory addresses which cannot be covered by MTRRs.

Our system with 1TB of RAM has an e820 that looks like this:

 BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
 BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
 BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
 BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
 BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
 BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
 BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
 BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
 BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
 BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
 BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
 BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
 BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable

and so direct mappings are created for huge memory hole between
0x000000e038000000 to 0x0000010000000000. Even though the kernel never
generates memory accesses in that region, since the page tables mark
them incorrectly as being WB, our (AMD) processor ends up causing a MCE
while doing some memory bookkeeping/optimizations around that area.

This patch iterates through e820 and only direct maps ranges that are
marked as E820_RAM, and keeps track of those pfn ranges. Depending on
the alignment of E820 ranges, this may possibly result in using smaller
size (i.e. 4K instead of 2M or 1G) page tables.

-v2: move changes from setup.c to mm/init.c, also use for_each_mem_pfn_range
	instead.  - Yinghai Lu
-v3: add calculate_all_table_space_size() to get correct needed page table
	size. - Yinghai Lu
-v4: fix add_pfn_range_mapped() to get correct max_low_pfn_mapped when
     mem map does have hole under 4g that is found by Konard on xen
     domU with 8g ram. - Yinghai

Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/1353123563-3103-16-git-send-email-yinghai@kernel.org
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:14 -08:00
Jacob Shin
dda56e1340 x86, mm: Fixup code testing if a pfn is direct mapped
Update code that previously assumed pfns [ 0 - max_low_pfn_mapped ) and
[ 4GB - max_pfn_mapped ) were always direct mapped, to now look up
pfn_mapped ranges instead.

-v2: change applying sequence to keep git bisecting working.
     so add dummy pfn_range_is_mapped(). - Yinghai Lu

Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/1353123563-3103-12-git-send-email-yinghai@kernel.org
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:09 -08:00
Yinghai Lu
22ddfcaa0d x86, mm: Move init_memory_mapping calling out of setup.c
Now init_memory_mapping is called two times, later will be called for every
ram ranges.

Could put all related init_mem calling together and out of setup.c.

Actually, it reverts commit 1bbbbe7
    x86: Exclude E820_RESERVED regions and memory holes above 4 GB from direct mapping.
will address that later with complete solution include handling hole under 4g.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-5-git-send-email-yinghai@kernel.org
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:03 -08:00
Yinghai Lu
fa62aafea9 x86, mm: Add global page_size_mask and probe one time only
Now we pass around use_gbpages and use_pse for calculating page table size,
Later we will need to call init_memory_mapping for every ram range one by one,
that mean those calculation will be done several times.

Those information are the same for all ram range and could be stored in
page_size_mask and could be probed it one time only.

Move that probing code out of init_memory_mapping into separated function
probe_page_size_mask(), and call it before all init_memory_mapping.

Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-2-git-send-email-yinghai@kernel.org
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:00 -08:00
Alexander Duyck
7d74275d39 x86: Make it so that __pa_symbol can only process kernel symbols on x86_64
I submitted an earlier patch that make __phys_addr an inline.  This obviously
results in an increase in the code size.  One step I can take to reduce that
is to make it so that the __pa_symbol call does a direct translation for
kernel addresses instead of covering all of virtual memory.

On my system this reduced the size for __pa_symbol from 5 instructions
totalling 30 bytes to 3 instructions totalling 16 bytes.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Link: http://lkml.kernel.org/r/20121116215356.8521.92472.stgit@ahduyck-cp1.jf.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-16 16:42:09 -08:00
Alexander Duyck
0bdf525f04 x86: Improve __phys_addr performance by making use of carry flags and inlining
This patch is meant to improve overall system performance when making use of
the __phys_addr call.  To do this I have implemented several changes.

First if CONFIG_DEBUG_VIRTUAL is not defined __phys_addr is made an inline,
similar to how this is currently handled in 32 bit.  However in order to do
this it is required to export phys_base so that it is available if __phys_addr
is used in kernel modules.

The second change was to streamline the code by making use of the carry flag
on an add operation instead of performing a compare on a 64 bit value.  The
advantage to this is that it allows us to significantly reduce the overall
size of the call.  On my Xeon E5 system the entire __phys_addr inline call
consumes a little less than 32 bytes and 5 instructions.  I also applied
similar logic to the debug version of the function.  My testing shows that the
debug version of the function with this patch applied is slightly faster than
the non-debug version without the patch.

Finally I also applied the same logic changes to __virt_addr_valid since it
used the same general code flow as __phys_addr and could achieve similar gains
though these changes.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Link: http://lkml.kernel.org/r/20121116215315.8521.46270.stgit@ahduyck-cp1.jf.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-16 16:42:08 -08:00
Alexander Duyck
fb50b020c5 x86: Move some contents of page_64_types.h into pgtable_64.h and page_64.h
This patch is meant to clean-up the fact that we have several functions in
page_64_types.h which really don't belong there.  I found this issue when I
had tried to replace __phys_addr with an inline function.  It resulted in the
realmode bits generating compile warnings about types.  In order to resolve
that I am relocating the address translation to page_64.h since this is in
keeping with where these functions are located in 32 bit.

In addtion I have relocated several functions defined in init_64.c to
pgtable_64.h as this seems to be where most of the functions related to
memory initialization were already located.

[ hpa: added missing #include <asm/pgtable.h> to apic_numachip.c,
  as reported by Yinghai Lu. ]

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Link: http://lkml.kernel.org/r/20121116215244.8521.31505.stgit@ahduyck-cp1.jf.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Daniel J Blueman <daniel@numascale-asia.com>
2012-11-16 16:40:34 -08:00
Fenghua Yu
a71c8bc5df x86, topology: Debug CPU0 hotplug
CONFIG_DEBUG_HOTPLUG_CPU0 is for debugging the CPU0 hotplug feature. The switch
offlines CPU0 as soon as possible and boots userspace up with CPU0 offlined.
User can online CPU0 back after boot time. The default value of the switch is
off.

To debug CPU0 hotplug, you need to enable CPU0 offline/online feature by either
turning on CONFIG_BOOTPARAM_HOTPLUG_CPU0 during compilation or giving
cpu0_hotplug kernel parameter at boot.

It's safe and early place to take down CPU0 after all hotplug notifiers
are installed and SMP is booted.

Please note that some applications or drivers, e.g. some versions of udevd,
during boot time may put CPU0 online again in this CPU0 hotplug debug mode.

In this debug mode, setup_local_APIC() may report a warning on max_loops<=0
when CPU0 is onlined back after boot time. This is because pending interrupt in
IRR can not move to ISR. The warning is not CPU0 specfic and it can happen on
other CPUs as well. It is harmless except the first CPU0 online takes a bit
longer time. And so this debug mode is useful to expose this issue. I'll send
a seperate patch to fix this generic warning issue.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1352835171-3958-15-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-14 15:28:11 -08:00
Fenghua Yu
e1c467e690 x86, hotplug: Wake up CPU0 via NMI instead of INIT, SIPI, SIPI
Instead of waiting for STARTUP after INITs, BSP will execute the BIOS boot-strap
code which is not a desired behavior for waking up BSP. To avoid the boot-strap
code, wake up CPU0 by NMI instead.

This works to wake up soft offlined CPU0 only. If CPU0 is hard offlined (i.e.
physically hot removed and then hot added), NMI won't wake it up. We'll change
this code in the future to wake up hard offlined CPU0 if real platform and
request are available.

AP is still waken up as before by INIT, SIPI, SIPI sequence.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1352896613-25957-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-14 15:28:03 -08:00
Mika Westerberg
06f64c8f23 driver core / ACPI: Move ACPI support to core device and driver types
With ACPI 5 we are starting to see devices that don't natively support
discovery but can be enumerated with the help of the ACPI namespace.
Typically, these devices can be represented in the Linux device driver
model as platform devices or some serial bus devices, like SPI or I2C
devices.

Since we want to re-use existing drivers for those devices, we need a
way for drivers to specify the ACPI IDs of supported devices, so that
they can be matched against device nodes in the ACPI namespace.  To
this end, it is sufficient to add a pointer to an array of supported
ACPI device IDs, that can be provided by the driver, to struct device.

Moreover, things like ACPI power management need to have access to
the ACPI handle of each supported device, because that handle is used
to invoke AML methods associated with the corresponding ACPI device
node.  The ACPI handles of devices are now stored in the archdata
member structure of struct device whose definition depends on the
architecture and includes the ACPI handle only on x86 and ia64. Since
the pointer to an array of supported ACPI IDs is added to struct
device_driver in an architecture-independent way, it is logical to
move the ACPI handle from archdata to struct device itself at the same
time.  This also makes code more straightforward in some places and
follows the example of Device Trees that have a poiter to struct
device_node in there too.

This changeset is based on Mika Westerberg's work.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15 00:28:00 +01:00
Fenghua Yu
30106c1743 x86, hotplug: Support functions for CPU0 online/offline
Add smp_store_boot_cpu_info() to store cpu info for BSP during boot time.

Now smp_store_cpu_info() stores cpu info for bringing up BSP or AP after
it's offline.

Continue to online CPU0 in native_cpu_up().

Continue to offline CPU0 in native_cpu_disable().

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1352835171-3958-5-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-14 09:39:48 -08:00
David Sharp
8be0709f10 tracing: Format non-nanosec times from tsc clock without a decimal point.
With the addition of the "tsc" clock, formatting timestamps to look like
fractional seconds is misleading. Mark clocks as either in nanoseconds or
not, and format non-nanosecond timestamps as decimal integers.

Tested:
$ cd /sys/kernel/debug/tracing/
$ cat trace_clock
[local] global tsc
$ echo sched_switch > set_event
$ echo 1 > tracing_on ; sleep 0.0005 ; echo 0 > tracing_on
$ cat trace
          <idle>-0     [000]  6330.555552: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=bash next_pid=29964 next_prio=120
           sleep-29964 [000]  6330.555628: sched_switch: prev_comm=bash prev_pid=29964 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120
  ...
$ echo 1 > options/latency-format
$ cat trace
  <idle>-0       0 4104553247us+: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=bash next_pid=29964 next_prio=120
   sleep-29964   0 4104553322us+: sched_switch: prev_comm=bash prev_pid=29964 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120
  ...
$ echo tsc > trace_clock
$ cat trace
$ echo 1 > tracing_on ; sleep 0.0005 ; echo 0 > tracing_on
$ echo 0 > options/latency-format
$ cat trace
          <idle>-0     [000] 16490053398357: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=bash next_pid=31128 next_prio=120
           sleep-31128 [000] 16490053588518: sched_switch: prev_comm=bash prev_pid=31128 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120
  ...
echo 1 > options/latency-format
$ cat trace
  <idle>-0       0 91557653238+: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=bash next_pid=31128 next_prio=120
   sleep-31128   0 91557843399+: sched_switch: prev_comm=bash prev_pid=31128 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120
  ...

v2:
Move arch-specific bits out of generic code.
v4:
Fix x86_32 build due to 64-bit division.

Google-Bug-Id: 6980623
Link: http://lkml.kernel.org/r/1352837903-32191-2-git-send-email-dhsharp@google.com

Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-13 15:48:40 -05:00
David Sharp
8cbd9cc625 tracing,x86: Add a TSC trace_clock
In order to promote interoperability between userspace tracers and ftrace,
add a trace_clock that reports raw TSC values which will then be recorded
in the ring buffer. Userspace tracers that also record TSCs are then on
exactly the same time base as the kernel and events can be unambiguously
interlaced.

Tested: Enabled a tracepoint and the "tsc" trace_clock and saw very large
timestamp values.

v2:
Move arch-specific bits out of generic code.
v3:
Rename "x86-tsc", cleanups
v7:
Generic arch bits in Kbuild.

Google-Bug-Id: 6980623
Link: http://lkml.kernel.org/r/1352837903-32191-1-git-send-email-dhsharp@google.com

Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-13 15:48:27 -05:00
Andreas Herrmann
04a1541828 x86, cacheinfo: Determine number of cache leafs using CPUID 0x8000001d on AMD
CPUID 0x8000001d works quite similar to Intels' CPUID function 4.
Use it to determine number of cache leafs.

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/20121019085933.GE26718@alberich
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-13 11:22:29 -08:00
Andreas Herrmann
193f3fcb3a x86: Add cpu_has_topoext
Introduce cpu_has_topoext to check for AMD's CPUID topology extensions
support. It indicates support for
CPUID Fn8000_001D_EAX_x[N:0]-CPUID Fn8000_001E_EDX

See AMD's CPUID Specification, Publication # 25481
(as of Rev. 2.34 September 2010)

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/20121019085813.GD26718@alberich
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-13 11:22:28 -08:00
Linus Torvalds
0020dd0b8c Bug-fixes:
* Fix compile issues on ARM.
  * Fix hypercall fallback code for old hypervisors.
  * Print out which HVM parameter failed if it fails.
  * Fix idle notifier call after irq_enter.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQnQdGAAoJEFjIrFwIi8fJPBAIAMX1HRx3udqhv7fziynZvFTb
 hj47XYIJHOK7P4fK7vZoSNgMHjL6LW5cUqC8VN67G3zUSkX9JYFsPBj6v4bWn+rG
 b9CS+MW7hS80LGbbqkh1F+YSEfZ863RlF9PPX2acaHTw49MlIgIqwhxIo6hy+Nm6
 thu6SlbEIJkSUdhbYMOAmy5aH/3+UuuQg+oq3P7mzV8fZjEihnrrF0NlT4wOZK1o
 gsfrKYKJLVT526W9PF/L23/A/MCHMpvjNStpaDLOGNjV9sBMpJI8JRax6+657+q1
 0kXvN5mAwTKWOaXBl4LEC9R8n1IKB91TgOY6HJAcXkb1eoP5KAeNSmU8RbsZ2T0=
 =XZ+0
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen fixes from Konrad Rzeszutek Wilk:
 "There are three ARM compile fixes (we forgot to export certain
  functions and if the drivers are built as an module - we go belly-up).

  There is also an mismatch of irq_enter() / exit_idle() calls sequence
  which were fixed some time ago in other piece of codes, but failed to
  appear in the Xen code.

  Lastly a fix for to help in the field with troubleshooting in case we
  cannot get the appropriate parameter and also fallback code when
  working with very old hypervisors."

Bug-fixes:
 - Fix compile issues on ARM.
 - Fix hypercall fallback code for old hypervisors.
 - Print out which HVM parameter failed if it fails.
 - Fix idle notifier call after irq_enter.

* tag 'stable/for-linus-3.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/arm: Fix compile errors when drivers are compiled as modules (export more).
  xen/arm: Fix compile errors when drivers are compiled as modules.
  xen/generic: Disable fallback build on ARM.
  xen/events: fix RCU warning, or Call idle notifier after irq_enter()
  xen/hvm: If we fail to fetch an HVM parameter print out which flag it is.
  xen/hypercall: fix hypercall fallback code for very old hypervisors
2012-11-10 06:56:21 +01:00
Jussi Kivilinna
cf582cceda crypto: camellia-x86_64 - share common functions and move structures and function definitions to header file
Prepare camellia-x86_64 functions to be reused from AVX/AESNI implementation
module.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-11-09 17:32:31 +08:00
David Howells
6d369a09cc x86: Export asm/{svm.h,vmx.h,perf_regs.h}
Export asm/{svm.h,vmx.h,perf_regs.h} so that they can be disintegrated.

It looks from previous commits that the first two should have been exported,
but the header-y lines weren't added to the Kbuild.

I'm guessing that asm/perf_regs.h should be exported too.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-11-08 11:38:44 +00:00
Jan Beulich
cf47a83fb0 xen/hypercall: fix hypercall fallback code for very old hypervisors
While copying the argument structures in HYPERVISOR_event_channel_op()
and HYPERVISOR_physdev_op() into the local variable is sufficiently
safe even if the actual structure is smaller than the container one,
copying back eventual output values the same way isn't: This may
collide with on-stack variables (particularly "rc") which may change
between the first and second memcpy() (i.e. the second memcpy() could
discard that change).

Move the fallback code into out-of-line functions, and handle all of
the operations known by this old a hypervisor individually: Some don't
require copying back anything at all, and for the rest use the
individual argument structures' sizes rather than the container's.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
[v2: Reduce #define/#undef usage in HYPERVISOR_physdev_op_compat().]
[v3: Fix compile errors when modules use said hypercalls]
[v4: Add xen_ prefix to the HYPERCALL_..]
[v5: Alter the name and only EXPORT_SYMBOL_GPL one of them]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-11-04 10:40:42 -05:00
Linus Torvalds
66b6a0c979 Bug-fixes:
* Use appropriate macros instead of hand-rolling our own (ARM).
  * Fixes if FB/KBD closed unexpectedly.
  * Fix memory leak in /dev/gntdev ioctl calls.
  * Fix overflow check in xenbus_file_write.
  * Document cleanup.
  * Performance optimization when migrating guests.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQk9ngAAoJEFjIrFwIi8fJXOcH/jEmTaV2rbUCCnnivlQGj5B2
 AAXt03MM2F7Ohifo8IEHDhvJlUqQnglQq4wcku/8X/bqSkxtqJMfa/UAStmS2e6r
 605msiMws/GKiDPgKywWHjMPk7JJow/T7du9mpT2Swla12+DXc7e0P6Sqm6qGtB5
 tCBFYe3CS+j8Xi/siPhveAoLoDVmC8RpNzV8EWBdUKhNeD6U4s5M3+ChVexOrB/6
 43YkzurkY/FOsP+8YhNnKFSFrpYleRB1GdFcr8PN5mv85sNKts7vHCb4qJFzZdbk
 BMImdLrTUnKArE4y4FS0iqabOTGXaUplEXfyxDw5hweESGa1qzrd29ocyMQ5p/U=
 =LQxc
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.7-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen bugfixes from Konrad Rzeszutek Wilk:
 - Use appropriate macros instead of hand-rolling our own (ARM).
 - Fixes if FB/KBD closed unexpectedly.
 - Fix memory leak in /dev/gntdev ioctl calls.
 - Fix overflow check in xenbus_file_write.
 - Document cleanup.
 - Performance optimization when migrating guests.

* tag 'stable/for-linus-3.7-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/mmu: Use Xen specific TLB flush instead of the generic one.
  xen/arm: use the __HVC macro
  xen/xenbus: fix overflow check in xenbus_file_write()
  xen-kbdfront: handle backend CLOSED without CLOSING
  xen-fbfront: handle backend CLOSED without CLOSING
  xen/gntdev: don't leak memory from IOCTL_GNTDEV_MAP_GRANT_REF
  x86: remove obsolete comment from asm/xen/hypervisor.h
2012-11-02 13:26:11 -07:00
Suresh Siddha
279f146143 x86: apic: Use tsc deadline for oneshot when available
If the TSC deadline mode is supported, LAPIC timer one-shot mode can be
implemented using IA32_TSC_DEADLINE MSR. An interrupt will be generated
when the TSC value equals or exceeds the value in the IA32_TSC_DEADLINE
MSR.

This enables us to skip the APIC calibration during boot. Also, in
xapic mode, this enables us to skip the uncached apic access to re-arm
the APIC timer.

As this timer ticks at the high frequency TSC rate, we use the
TSC_DIVISOR (32) to work with the 32-bit restrictions in the
clockevent API's to avoid 64-bit divides etc (frequency is u32 and
"unsigned long" in the set_next_event(), max_delta limits the next
event to 32-bit for 32-bit kernel).

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: venki@google.com
Cc: len.brown@intel.com
Link: http://lkml.kernel.org/r/1350941878.6017.31.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-11-02 11:23:37 +01:00
Olaf Hering
b6514633bd x86: remove obsolete comment from asm/xen/hypervisor.h
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-10-30 09:27:32 -04:00
Matt Fleming
185034e72d x86, efi: 1:1 pagetable mapping for virtual EFI calls
Some firmware still needs a 1:1 (virt->phys) mapping even after we've
called SetVirtualAddressMap(). So install the mapping alongside our
existing kernel mapping whenever we make EFI calls in virtual mode.

This bug was discovered on ASUS machines where the firmware
implementation of GetTime() accesses the RTC device via physical
addresses, even though that's bogus per the UEFI spec since we've
informed the firmware via SetVirtualAddressMap() that the boottime
memory map is no longer valid.

This bug seems to be present in a lot of consumer devices, so there's
not a lot we can do about this spec violation apart from workaround
it.

Cc: JérômeCarretero <cJ-ko@zougloub.eu>
Cc: Vasco Dias <rafa.vasco@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2012-10-30 10:39:19 +00:00