Commit graph

448536 commits

Author SHA1 Message Date
Al Viro
046b961b45 shrink_dentry_list(): take parent's ->d_lock earlier
The cause of livelocks there is that we are taking ->d_lock on
dentry and its parent in the wrong order, forcing us to use
trylock on the parent's one.  d_walk() takes them in the right
order, and unfortunately it's not hard to create a situation
when shrink_dentry_list() can't make progress since trylock
keeps failing, and shrink_dcache_parent() or check_submounts_and_drop()
keeps calling d_walk() disrupting the very shrink_dentry_list() it's
waiting for.

Solution is straightforward - if that trylock fails, let's unlock
the dentry itself and take locks in the right order.  We need to
stabilize ->d_parent without holding ->d_lock, but that's doable
using RCU.  And we'd better do that in the very beginning of the
loop in shrink_dentry_list(), since the checks on refcount, etc.
would need to be redone anyway.

That deals with a half of the problem - killing dentries on the
shrink list itself.  Another one (dropping their parents) is
in the next commit.

locking parent is interesting - it would be easy to do rcu_read_lock(),
lock whatever we think is a parent, lock dentry itself and check
if the parent is still the right one.  Except that we need to check
that *before* locking the dentry, or we are risking taking ->d_lock
out of order.  Fortunately, once the D1 is locked, we can check if
D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent
can start or stop pointing to D1 only under D1->d_lock, so taking
D1->d_lock is enough.  In other words, the right solution is
rcu_read_lock/lock what looks like parent right now/check if it's
still our parent/rcu_read_unlock/lock the child.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-30 11:03:21 -04:00
Jens Axboe
67aec14ce8 blk-mq: make the sysfs mq/ layout reflect current mappings
Currently blk-mq registers all the hardware queues in sysfs,
regardless of whether it uses them (e.g. they have CPU mappings)
or not. The unused hardware queues lack the cpux/ directories,
and the other sysfs entries (like active, pending, etc) are all
zeroes.

Change this so that sysfs correctly reflects the current mappings
of the hardware queues.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30 08:25:36 -06:00
Ming Lei
e8edca6f7f block: virtio_blk: don't hold spin lock during world switch
Firstly, it isn't necessary to hold lock of vblk->vq_lock
when notifying hypervisor about queued I/O.

Secondly, virtqueue_notify() will cause world switch and
it may take long time on some hypervisors(such as, qemu-arm),
so it isn't good to hold the lock and block other vCPUs.

On arm64 quad core VM(qemu-kvm), the patch can increase I/O
performance a lot with VIRTIO_RING_F_EVENT_IDX enabled:
	- without the patch: 14K IOPS
	- with the patch: 34K IOPS

fio script:
	[global]
	direct=1
	bsrange=4k-4k
	timeout=10
	numjobs=4
	ioengine=libaio
	iodepth=64

	filename=/dev/vdc
	group_reporting=1

	[f1]
	rw=randread

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org # 3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30 08:19:39 -06:00
Jens Axboe
f89ca16646 Merge branch 'for-3.16/core' into for-3.16/drivers
Pulled in for the blk_mq_tag_to_rq() change, which impacts
mtip32xx.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30 08:11:50 -06:00
Shaohua Li
2230237500 blk-mq: blk_mq_tag_to_rq should handle flush request
flush request is special, which borrows the tag from the parent
request. Hence blk_mq_tag_to_rq needs special handling to return
the flush request from the tag.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30 08:06:42 -06:00
Zhang Rui
48459340b9 ACPI / scan: use platform bus type by default for _HID enumeration
Because of the growing demand for enumerating ACPI devices to
platform bus, change the code to enumerate ACPI device objects to
platform bus by default.  Namely, create platform devices for the
ACPI device objects that
 1. Have pnp.type.platform_id set (device objects with _HID currently).
 2. Do not have a scan handler attached.
 3. Are not SPI/I2C slave devices (that should be enumerated to the
    appropriate buses bus by their parent).

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
[rjw: Subject and changelog, rebase and code cleanup]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2014-05-30 16:04:37 +02:00
Rafael J. Wysocki
d6ddaaac8f ACPI / scan: always register ACPI LPSS scan handler
Prevent platform devices from being created for ACPI LPSS devices
if CONFIG_X86_INTEL_LPSS is unset by compiling out the LPSS scan
handler's callbacks only in that case and still compiling its device
ID list in and registering the scan handler in either case.

This change is based on a prototype from Zhang Rui.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2014-05-30 16:04:36 +02:00
Rafael J. Wysocki
cccd420859 ACPI / scan: always register memory hotplug scan handler
Prevent platform devices from being created for ACPI memory device
objects if CONFIG_ACPI_HOTPLUG_MEMORY is unset by compiling out the
memory hotplug scan handler's callbacks only in that case and still
compiling its device ID list in and registering the scan handler in
either case.

Also unset the memory hotplug scan handler's .attach() callback
if acpi_no_memhotplug is set, but still register the scan handler to
avoid creating platform devices for ACPI memory devices in that case
too.

This change is based on a prototype from Zhang Rui.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2014-05-30 16:04:36 +02:00
Rafael J. Wysocki
a1ec657213 ACPI / scan: always register container scan handler
Prevent platform devices from being created for ACPI containers
if CONFIG_ACPI_CONTAINER is unset by compiling out the container
scan handler's callbacks only in that case and still compiling
its device ID list in and registering the scan handler in either
case.

This change is based on a prototype from Zhang Rui.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2014-05-30 16:04:36 +02:00
Rafael J. Wysocki
d34afa9de4 ACPI / scan: Change the meaning of missing .attach() in scan handlers
Currently, some scan handlers can be compiled out entirely, which
leaves the device objects they normally attach to without a scan
handler.  This isn't a problem as long as we don't have any default
enumeration mechanism that applies to all devices without a scan
handler.  However, if such a default enumeration is added, it still
should not be applied to devices that are normally attached to by
scan handlers, because that may result in creating "physical" device
objects of a wrong type for them.

Since we are going to create platform device objects for all ACPI
device objects with pnp.type.platform_id set by default, clear
pnp.type.platform_id where there is a matching scan handler without
an .attach() callback and otherwise simply treat that scan handler
as though the .attach() callback was present but always returned 0.

This will allow us to compile out scan handler callbacks and leave
the device ID lists used by them so as to prevent creating platform
device objects for the matching ACPI devices.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2014-05-30 16:04:36 +02:00
Rafael J. Wysocki
549e68455c ACPI / scan: introduce platform_id device PNP type flag
Only certain types of ACPI device objects can be enumerated as
platform devices, so in order to distinguish them from the others
introduce a new ACPI device PNP type flag, platform_id, and set it
for devices with a valid _HID to start with.

This change is based on a Zhang Rui's prototype.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2014-05-30 16:04:36 +02:00
Zhang Rui
01cd4bac8c ACPI / scan: drop unsupported serial IDs from PNP ACPI scan handler ID list
The "serial" PNP driver supports some "unknown" PNP modems
(PNPCXXX/PNPDXXX) by matching magic strings in the PNP device name
or the PNP device card name.

ACPI enumerated PNP devices neither are PNP cards, nor have those
magic strings in device names, so this mechamism never actually works
for ACPI enumerated PNPCXXX/PNPDXXX devices.

Consequently, it is safe to remove those two IDs from the PNP ACPI scan
handler's device ID list.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
[rjw: Subject and changelog]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2014-05-30 16:04:36 +02:00
Zhang Rui
e1d2c4cf0b ACPI / scan: drop IDs that do not comply with the ACPI PNP ID rule
The PNP ACPI scan handler device ID list includes all the IDs from
all of the struct pnp_device_id instances in the tree, but some of
them do not follow the ACPI PNP ID rule (3 letters + 4 hex digits).

For those IDs, the coressponding devices will never be enumerated
via ACPI, so it is safe to remove them from the PNP ACPI ID list.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
[rjw: Subject and changelog]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2014-05-30 16:04:36 +02:00
Zhang Rui
eec15edbb0 ACPI / PNP: use device ID list for PNPACPI device enumeration
ACPI can be used to enumerate PNP devices, but the code does not
handle this in the right way currently.  Namely, if an ACPI device
object
 1. Has a _CRS method,
 2. Has an identification of
    "three capital characters followed by four hex digits",
 3. Is not in the excluded IDs list,
it will be enumerated to PNP bus (that is, a PNP device object will
be create for it).  This means that, actually, the PNP bus type is
used as the default bus type for enumerating _HID devices in ACPI.

However, more and more _HID devices need to be enumerated to the
platform bus instead (that is, platform device objects need to be
created for them).  As a result, the device ID list in acpi_platform.c
is used to enforce creating platform device objects rather than PNP
device objects for matching devices.  That list has been continuously
growing recently, unfortunately, and it is pretty much guaranteed to
grow even more in the future.

To address that problem it is better to enumerate _HID devices
as platform devices by default.  To this end, change the way of
enumerating PNP devices by adding a PNP ACPI scan handler that
will use a device ID list to create PNP devices for the ACPI
device objects whose device IDs are present in that list.

The initial device ID list in the PNP ACPI scan handler contains
all of the pnp_device_id strings from all the existing PNP drivers,
so this change should be transparent to the PNP core and all of the
PNP drivers.  Still, in the future it should be possible to reduce
its size by converting PNP drivers that need not be PNP for any
technical reasons into platform drivers.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
[rjw: Rewrote the changelog, modified the PNP ACPI scan handler code]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2014-05-30 16:04:35 +02:00
Rafael J. Wysocki
aca0a4eb4e ACPI / scan: .match() callback for ACPI scan handlers
Introduce a .match() callback for ACPI scan handlers to allow them to
use more elaborate matching algorithms if necessary.  That is needed
for the upcoming PNP scan handler in particular.

This change is based on a Zhang Rui's prototype.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2014-05-30 16:04:35 +02:00
Rafael J. Wysocki
52d1d0b12b Merge branch 'acpi-lpss' into acpi-enumeration 2014-05-30 15:55:19 +02:00
Takashi Iwai
112cddcada ALSA: firewire: Fix dependency on PCM and rawmidi
Now snd-firewire-lib supports rawmidi in addition to PCM, thus we need
to give a proper dependency.  For fixing and simplification, move the
selections of SND_PCM and SND_RAWMIDI into SND_FIREWIRE_LIB section.
Then each driver doesn't have to select them but only
SND_FIREWIRE_LIB.

Reported-by: Jim Davis <jim.epost@gmail.com>
Reviewed-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Tested-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2014-05-30 15:22:06 +02:00
Paolo Bonzini
53ea2e462e Patch queue for ppc - 2014-05-30
In this round we have a few nice gems. PR KVM gains initial POWER8 support
 as well as LE host awareness, ihe e500 targets can now properly run u-boot,
 LE guests now work with PR KVM including KVM hypercalls and HV KVM guests
 can now use huge pages.
 
 On top of this there are some bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJTiHxMAAoJECszeR4D/txgGggQAL5rYpFkuU8+KiLd7uX4DasT
 DYrKxgOX/G8WRI0OJZpZHGdcw4fefBxHjY2ydnCSDk+EH2IPrlF6+aYnCwoarj3I
 L2Ji/OEfFZVgQtqNkqbuG5Z8onmrgZ86y5TFZKiMcDZe3zLnuGIP4fCLWEeWalor
 JQ9fE0z4WFkCtUD18/S/Cf6bOniGLdphUrmgECo+4SbSxUbixrckBEwuA7TjWure
 VrA0UX35l+VoVtYntKrQqYzsf+QSNWPzcDQnhlso+rz5ap6sB6rlC0PO8SHTPjk4
 BXvHVvripVss7mGv7vZPizp52WXAFsNnVPj1tNSCu3Tn+gRBB8ChCV8xHIWYex8a
 nz8AvdUq1TdWmSy7NwJODjJfclauTVtrhPvrbLNSN3ksSzx4eLh8i+Lf8qe0SW7i
 4oTfxG+PSDlF/ub8KX8BVfM0FpyvHdcR1h1ex3szPAQfiaAy+/ce9c3bB0xzXmLx
 sN3GX0BocdWejMo1U8+/+CVUngtol81BAb+k4GB5mPjW08u5jFVZu+XwxhK2ACx3
 y3hAuaGd1d0YuTztYav7+wuu5gPwhYSA3/tmTR1+0jx9uCQbPpOyPJmprt/9A2DP
 TY0T0I3TUJqWZAtHOsY0P0hCGl+gZQJb8KhjVKHC6t6/ab3grMkpHgp9XDVvKpHJ
 u9lPDCuQRyJBSy9WI2Qa
 =t4YN
 -----END PGP SIGNATURE-----

Merge tag 'signed-kvm-ppc-next' of git://github.com/agraf/linux-2.6 into kvm-next

Patch queue for ppc - 2014-05-30

In this round we have a few nice gems. PR KVM gains initial POWER8 support
as well as LE host awareness, ihe e500 targets can now properly run u-boot,
LE guests now work with PR KVM including KVM hypercalls and HV KVM guests
can now use huge pages.

On top of this there are some bug fixes.

Conflicts:
	include/uapi/linux/kvm.h
2014-05-30 14:51:40 +02:00
Alexander Graf
d8d164a985 KVM: PPC: Book3S PR: Rework SLB switching code
On LPAR guest systems Linux enables the shadow SLB to indicate to the
hypervisor a number of SLB entries that always have to be available.

Today we go through this shadow SLB and disable all ESID's valid bits.
However, pHyp doesn't like this approach very much and honors us with
fancy machine checks.

Fortunately the shadow SLB descriptor also has an entry that indicates
the number of valid entries following. During the lifetime of a guest
we can just swap that value to 0 and don't have to worry about the
SLB restoration magic.

While we're touching the code, let's also make it more readable (get
rid of rldicl), allow it to deal with a dynamic number of bolted
SLB entries and only do shadow SLB swizzling on LPAR systems.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:30 +02:00
Alexander Graf
207438d4e2 KVM: PPC: Book3S PR: Use SLB entry 0
We didn't make use of SLB entry 0 because ... of no good reason. SLB entry 0
will always be used by the Linux linear SLB entry, so the fact that slbia
does not invalidate it doesn't matter as we overwrite SLB 0 on exit anyway.

Just enable use of SLB entry 0 for our shadow SLB code.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:30 +02:00
Paul Mackerras
000a25ddb7 KVM: PPC: Book3S HV: Fix machine check delivery to guest
The code that delivered a machine check to the guest after handling
it in real mode failed to load up r11 before calling kvmppc_msr_interrupt,
which needs the old MSR value in r11 so it can see the transactional
state there.  This adds the missing load.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:29 +02:00
Paul Mackerras
9bc01a9bc7 KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
This adds workarounds for two hardware bugs in the POWER8 performance
monitor unit (PMU), both related to interrupt generation.  The effect
of these bugs is that PMU interrupts can get lost, leading to tools
such as perf reporting fewer counts and samples than they should.

The first bug relates to the PMAO (perf. mon. alert occurred) bit in
MMCR0; setting it should cause an interrupt, but doesn't.  The other
bug relates to the PMAE (perf. mon. alert enable) bit in MMCR0.
Setting PMAE when a counter is negative and counter negative
conditions are enabled to cause alerts should cause an alert, but
doesn't.

The workaround for the first bug is to create conditions where a
counter will overflow, whenever we are about to restore a MMCR0
value that has PMAO set (and PMAO_SYNC clear).  The workaround for
the second bug is to freeze all counters using MMCR2 before reading
MMCR0.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:29 +02:00
Paul Mackerras
6c576e74fd KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
Current, when testing whether a page is dirty (when constructing the
bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit
in the HPT entries mapping the page, and if it is 0, we consider the
page to be clean.  However, the Power ISA doesn't require processors
to set the C bit to 1 immediately when writing to a page, and in fact
allows them to delay the writeback of the C bit until they receive a
TLB invalidation for the page.  Thus it is possible that the page
could be dirty and we miss it.

Now, if there are vcpus running, this is not serious since the
collection of the dirty log is racy already - some vcpu could dirty
the page just after we check it.  But if there are no vcpus running we
should return definitive results, in case we are in the final phase of
migrating the guest.

Also, if the permission bits in the HPTE don't allow writing, then we
know that no CPU can set C.  If the HPTE was previously writable and
the page was modified, any C bit writeback would have been flushed out
by the tlbie that we did when changing the HPTE to read-only.

Otherwise we need to do a TLB invalidation even if the C bit is 0, and
then check the C bit.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:29 +02:00
Alexey Kardashevskiy
687414bebe KVM: PPC: Book3S HV: Fix dirty map for hugepages
The dirty map that we construct for the KVM_GET_DIRTY_LOG ioctl has
one bit per system page (4K/64K).  Currently, we only set one bit in
the map for each HPT entry with the Change bit set, even if the HPT is
for a large page (e.g., 16MB).  Userspace then considers only the
first system page dirty, though in fact the guest may have modified
anywhere in the large page.

To fix this, we make kvm_test_clear_dirty() return the actual number
of pages that are dirty (and rename it to kvm_test_clear_dirty_npages()
to emphasize that that's what it returns).  In kvmppc_hv_get_dirty_log()
we then set that many bits in the dirty map.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:29 +02:00
Paul Mackerras
1066f7724c KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
Currently, when a huge page is faulted in for a guest, we select the
rmap chain to insert the HPTE into based on the guest physical address
that the guest tried to access.  Since there is an rmap chain for each
system page, there are many rmap chains for the area covered by a huge
page (e.g. 256 for 16MB pages when PAGE_SIZE = 64kB), and the huge-page
HPTE could end up in any one of them.

For consistency, and to make the huge-page HPTEs easier to find, we now
put huge-page HPTEs in the rmap chain corresponding to the base address
of the huge page.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:28 +02:00
Paul Mackerras
55765483e1 KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
The global_invalidates() function contains a check that is intended
to tell whether we are currently executing in the context of a hypercall
issued by the guest.  The reason is that the optimization of using a
local TLB invalidate instruction is only valid in that context.  The
check was testing local_paca->kvm_hstate.kvm_vcore, which gets set
when entering the guest but no longer gets cleared when exiting the
guest.  To fix this, we use the kvm_vcpu field instead, which does
get cleared when exiting the guest, by the kvmppc_release_hwthread()
calls inside kvmppc_run_core().

The effect of having the check wrong was that when kvmppc_do_h_remove()
got called from htab_write() on the destination machine during a
migration, it cleared the current cpu's bit in kvm->arch.need_tlb_flush.
This meant that when the guest started running in the destination VM,
it may miss out on doing a complete TLB flush, and therefore may end
up using stale TLB entries from a previous guest that used the same
LPID value.

This should make migration more reliable.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:28 +02:00
Paul Mackerras
e1d8a96daf KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
Commit b005255e12 ("KVM: PPC: Book3S HV: Context-switch new POWER8
SPRs") added a definition of KVM_REG_PPC_WORT with the same register
number as the existing KVM_REG_PPC_VRSAVE (though in fact the
definitions are not identical because of the different register sizes.)

For clarity, this moves KVM_REG_PPC_WORT to the next unused number,
and also adds it to api.txt.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:28 +02:00
Paul Mackerras
2f9c6943c5 KVM: PPC: Book3S: Add ONE_REG register names that were missed
Commit 3b7834743f ("KVM: PPC: Book3S HV: Reserve POWER8 space in get/set_one_reg") added definitions for several KVM_REG_PPC_* symbols
but missed adding some to api.txt.  This adds them.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:27 +02:00
Alexander Graf
f2e91042a8 KVM: PPC: Add CAP to indicate hcall fixes
We worked around some nasty KVM magic page hcall breakages:

  1) NX bit not honored, so ignore NX when we detect it
  2) LE guests swizzle hypercall instruction

Without these fixes in place, there's no way it would make sense to expose kvm
hypercalls to a guest. Chances are immensely high it would trip over and break.

So add a new CAP that gives user space a hint that we have workarounds for the
bugs above in place. It can use those as hint to disable PV hypercalls when
the guest CPU is anything POWER7 or higher and the host does not have fixes
in place.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:27 +02:00
Alexander Graf
aae6559651 KVM: PPC: MPIC: Reset IRQ source private members
When we reset the in-kernel MPIC controller, we forget to reset some hidden
state such as destmask and output. This state is usually set when the guest
writes to the IDR register for a specific IRQ line.

To make sure we stay in sync and don't forget hidden state, treat reset of
the IDR register as a simple write of the IDR register. That automatically
updates all the hidden state as well.

Reported-by: Paul Janzen <pcj@pauljanzen.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:26 +02:00
Alexander Graf
42188365f9 KVM: PPC: Graciously fail broken LE hypercalls
There are LE Linux guests out there that don't handle hypercalls correctly.
Instead of interpreting the instruction stream from device tree as big endian
they assume it's a little endian instruction stream and fail.

When we see an illegal instruction from such a byte reversed instruction stream,
bail out graciously and just declare every hcall as error.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:26 +02:00
Alexander Graf
235959be9a PPC: ePAPR: Fix hypercall on LE guest
We get an array of instructions from the hypervisor via device tree that
we write into a buffer that gets executed whenever we want to make an
ePAPR compliant hypercall.

However, the hypervisor passes us these instructions in BE order which
we have to manually convert to LE when we want to run them in LE mode.

With this fixup in place, I can successfully run LE kernels with KVM
PV enabled on PR KVM.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:26 +02:00
Aneesh Kumar K.V
ddca156ae6 KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
Use make_dsisr instead of open coding it. This also have
the added benefit of handling alignment interrupt on additional
instructions.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:25 +02:00
Aneesh Kumar K.V
7310f3a5b0 KVM: PPC: BOOK3S: Always use the saved DAR value
Although it's optional, IBM POWER cpus always had DAR value set on
alignment interrupt. So don't try to compute these values.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:25 +02:00
Alexander Graf
5c165aeca3 PPC: KVM: Make NX bit available with magic page
Because old kernels enable the magic page and then choke on NXed trampoline
code we have to disable NX by default in KVM when we use the magic page.

However, since commit b18db0b8 we have successfully fixed that and can now
leave NX enabled, so tell the hypervisor about this.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:25 +02:00
Alexander Graf
f3383cf80e KVM: PPC: Disable NX for old magic page using guests
Old guests try to use the magic page, but map their trampoline code inside
of an NX region.

Since we can't fix those old kernels, try to detect whether the guest is sane
or not. If not, just disable NX functionality in KVM so that old guests at
least work at all. For newer guests, add a bit that we can set to keep NX
functionality available.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:24 +02:00
Aneesh Kumar K.V
1f365bb0de KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size.  Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page.  This
capability is called mixed page-size segment (MPSS).  With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].

We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.

This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:24 +02:00
Aneesh Kumar K.V
792fc49787 KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation
Today when KVM tries to reserve memory for the hash page table it
allocates from the normal page allocator first. If that fails it
falls back to CMA's reserved region. One of the side effects of
this is that we could end up exhausting the page allocator and
get linux into OOM conditions while we still have plenty of space
available in CMA.

This patch addresses this issue by first trying hash page table
allocation from CMA's reserved region before falling back to the normal
page allocator. So if we run out of memory, we really are out of memory.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:24 +02:00
Alexander Graf
9916d57e64 KVM: PPC: Book3S PR: Expose TM registers
POWER8 introduces transactional memory which brings along a number of new
registers and MSR bits.

Implementing all of those is a pretty big headache, so for now let's at least
emulate enough to make Linux's context switching code happy.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:23 +02:00
Alexander Graf
2e23f54413 KVM: PPC: Book3S PR: Expose EBB registers
POWER8 introduces a new facility called the "Event Based Branch" facility.
It contains of a few registers that indicate where a guest should branch to
when a defined event occurs and it's in PR mode.

We don't want to really enable EBB as it will create a big mess with !PR guest
mode while hardware is in PR and we don't really emulate the PMU anyway.

So instead, let's just leave it at emulation of all its registers.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:23 +02:00
Alexander Graf
e14e7a1e53 KVM: PPC: Book3S PR: Expose TAR facility to guest
POWER8 implements a new register called TAR. This register has to be
enabled in FSCR and then from KVM's point of view is mere storage.

This patch enables the guest to use TAR.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:23 +02:00
Alexander Graf
616dff8602 KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR
POWER8 introduced a new interrupt type called "Facility unavailable interrupt"
which contains its status message in a new register called FSCR.

Handle these exits and try to emulate instructions for unhandled facilities.
Follow-on patches enable KVM to expose specific facilities into the guest.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:22 +02:00
Alexander Graf
a5948fa092 KVM: PPC: Book3S PR: Emulate TIR register
In parallel to the Processor ID Register (PIR) threaded POWER8 also adds a
Thread ID Register (TIR). Since PR KVM doesn't emulate more than one thread
per core, we can just always expose 0 here.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:22 +02:00
Alexander Graf
f8f6eb0d18 KVM: PPC: Book3S PR: Ignore PMU SPRs
When we expose a POWER8 CPU into the guest, it will start accessing PMU SPRs
that we don't emulate. Just ignore accesses to them.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:22 +02:00
Alexander Graf
f24bc1ed45 KVM: PPC: Book3S: Move little endian conflict to HV KVM
With the previous patches applied, we can now successfully use PR KVM on
little endian hosts which means we can now allow users to select it.

However, HV KVM still needs some work, so let's keep the kconfig conflict
on that one.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:21 +02:00
Alexander Graf
cd087eefe6 KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructions
When the host CPU we're running on doesn't support dcbz32 itself, but the
guest wants to have dcbz only clear 32 bytes of data, we loop through every
executable mapped page to search for dcbz instructions and patch them with
a special privileged instruction that we emulate as dcbz32.

The only guests that want to see dcbz act as 32byte are book3s_32 guests, so
we don't have to worry about little endian instruction ordering. So let's
just always search for big endian dcbz instructions, also when we're on a
little endian host.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:21 +02:00
Alexander Graf
5deb8e7ad8 KVM: PPC: Make shared struct aka magic page guest endian
The shared (magic) page is a data structure that contains often used
supervisor privileged SPRs accessible via memory to the user to reduce
the number of exits we have to take to read/write them.

When we actually share this structure with the guest we have to maintain
it in guest endianness, because some of the patch tricks only work with
native endian load/store operations.

Since we only share the structure with either host or guest in little
endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.

For booke, the shared struct stays big endian. For book3s_64 hv we maintain
the struct in host native endian, since it never gets shared with the guest.

For book3s_64 pr we introduce a variable that tells us which endianness the
shared struct is in and route every access to it through helper inline
functions that evaluate this variable.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:21 +02:00
Alexander Graf
2743103f91 KVM: PPC: PR: Fill pvinfo hcall instructions in big endian
We expose a blob of hypercall instructions to user space that it gives to
the guest via device tree again. That blob should contain a stream of
instructions necessary to do a hypercall in big endian, as it just gets
passed into the guest and old guests use them straight away.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:20 +02:00
Alexander Graf
b59d9d26be KVM: PPC: Book3S PR: PAPR: Access RTAS in big endian
When the guest does an RTAS hypercall it keeps all RTAS variables inside a
big endian data structure.

To make sure we don't have to bother about endianness inside the actual RTAS
handlers, let's just convert the whole structure to host endian before we
call our RTAS handlers and back to big endian when we return to the guest.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:20 +02:00
Alexander Graf
1692aa3faa KVM: PPC: Book3S PR: PAPR: Access HTAB in big endian
The HTAB on PPC is always in big endian. When we access it via hypercalls
on behalf of the guest and we're running on a little endian host, we need
to make sure we swap the bits accordingly.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:20 +02:00