2011-04-04 11:46:58 +08:00
|
|
|
/*
|
|
|
|
* Copyright 2011 IBM Corporation.
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public License
|
|
|
|
* as published by the Free Software Foundation; either version
|
|
|
|
* 2 of the License, or (at your option) any later version.
|
|
|
|
*
|
|
|
|
*/
|
2011-04-15 06:32:06 +08:00
|
|
|
|
2011-04-04 11:46:58 +08:00
|
|
|
#include <linux/types.h>
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/irq.h>
|
|
|
|
#include <linux/smp.h>
|
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/of.h>
|
|
|
|
#include <linux/spinlock.h>
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 08:23:08 +08:00
|
|
|
#include <linux/module.h>
|
2011-04-04 11:46:58 +08:00
|
|
|
|
|
|
|
#include <asm/prom.h>
|
|
|
|
#include <asm/io.h>
|
|
|
|
#include <asm/smp.h>
|
|
|
|
#include <asm/irq.h>
|
|
|
|
#include <asm/errno.h>
|
|
|
|
#include <asm/xics.h>
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 08:23:08 +08:00
|
|
|
#include <asm/kvm_ppc.h>
|
2014-06-11 13:59:28 +08:00
|
|
|
#include <asm/dbell.h>
|
2011-04-04 11:46:58 +08:00
|
|
|
|
|
|
|
struct icp_ipl {
|
|
|
|
union {
|
|
|
|
u32 word;
|
|
|
|
u8 bytes[4];
|
|
|
|
} xirr_poll;
|
|
|
|
union {
|
|
|
|
u32 word;
|
|
|
|
u8 bytes[4];
|
|
|
|
} xirr;
|
|
|
|
u32 dummy;
|
|
|
|
union {
|
|
|
|
u32 word;
|
|
|
|
u8 bytes[4];
|
|
|
|
} qirr;
|
|
|
|
u32 link_a;
|
|
|
|
u32 link_b;
|
|
|
|
u32 link_c;
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct icp_ipl __iomem *icp_native_regs[NR_CPUS];
|
|
|
|
|
|
|
|
static inline unsigned int icp_native_get_xirr(void)
|
|
|
|
{
|
|
|
|
int cpu = smp_processor_id();
|
2013-04-18 04:30:50 +08:00
|
|
|
unsigned int xirr;
|
|
|
|
|
|
|
|
/* Handled an interrupt latched by KVM */
|
|
|
|
xirr = kvmppc_get_xics_latch();
|
|
|
|
if (xirr)
|
|
|
|
return xirr;
|
2011-04-04 11:46:58 +08:00
|
|
|
|
|
|
|
return in_be32(&icp_native_regs[cpu]->xirr.word);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void icp_native_set_xirr(unsigned int value)
|
|
|
|
{
|
|
|
|
int cpu = smp_processor_id();
|
|
|
|
|
|
|
|
out_be32(&icp_native_regs[cpu]->xirr.word, value);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void icp_native_set_cppr(u8 value)
|
|
|
|
{
|
|
|
|
int cpu = smp_processor_id();
|
|
|
|
|
|
|
|
out_8(&icp_native_regs[cpu]->xirr.bytes[0], value);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void icp_native_set_qirr(int n_cpu, u8 value)
|
|
|
|
{
|
|
|
|
out_8(&icp_native_regs[n_cpu]->qirr.bytes[0], value);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void icp_native_set_cpu_priority(unsigned char cppr)
|
|
|
|
{
|
|
|
|
xics_set_base_cppr(cppr);
|
|
|
|
icp_native_set_cppr(cppr);
|
|
|
|
iosync();
|
|
|
|
}
|
|
|
|
|
2013-04-26 03:20:59 +08:00
|
|
|
void icp_native_eoi(struct irq_data *d)
|
2011-04-04 11:46:58 +08:00
|
|
|
{
|
2011-05-04 13:02:15 +08:00
|
|
|
unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
|
2011-04-04 11:46:58 +08:00
|
|
|
|
|
|
|
iosync();
|
|
|
|
icp_native_set_xirr((xics_pop_cppr() << 24) | hw_irq);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void icp_native_teardown_cpu(void)
|
|
|
|
{
|
|
|
|
int cpu = smp_processor_id();
|
|
|
|
|
|
|
|
/* Clear any pending IPI */
|
|
|
|
icp_native_set_qirr(cpu, 0xff);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void icp_native_flush_ipi(void)
|
|
|
|
{
|
|
|
|
/* We take the ipi irq but and never return so we
|
|
|
|
* need to EOI the IPI, but want to leave our priority 0
|
|
|
|
*
|
|
|
|
* should we check all the other interrupts too?
|
|
|
|
* should we be flagging idle loop instead?
|
|
|
|
* or creating some task to be scheduled?
|
|
|
|
*/
|
|
|
|
|
|
|
|
icp_native_set_xirr((0x00 << 24) | XICS_IPI);
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned int icp_native_get_irq(void)
|
|
|
|
{
|
|
|
|
unsigned int xirr = icp_native_get_xirr();
|
|
|
|
unsigned int vec = xirr & 0x00ffffff;
|
|
|
|
unsigned int irq;
|
|
|
|
|
|
|
|
if (vec == XICS_IRQ_SPURIOUS)
|
|
|
|
return NO_IRQ;
|
|
|
|
|
2012-06-04 13:04:37 +08:00
|
|
|
irq = irq_find_mapping(xics_host, vec);
|
2011-04-04 11:46:58 +08:00
|
|
|
if (likely(irq != NO_IRQ)) {
|
|
|
|
xics_push_cppr(vec);
|
|
|
|
return irq;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We don't have a linux mapping, so have rtas mask it. */
|
|
|
|
xics_mask_unknown_vec(vec);
|
|
|
|
|
|
|
|
/* We might learn about it later, so EOI it */
|
|
|
|
icp_native_set_xirr(xirr);
|
|
|
|
|
|
|
|
return NO_IRQ;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
|
powerpc: Consolidate ipi message mux and demux
Consolidate the mux and demux of ipi messages into smp.c and call
a new smp_ops callback to actually trigger the ipi.
The powerpc architecture code is optimised for having 4 distinct
ipi triggers, which are mapped to 4 distinct messages (ipi many, ipi
single, scheduler ipi, and enter debugger). However, several interrupt
controllers only provide a single software triggered interrupt that
can be delivered to each cpu. To resolve this limitation, each smp_ops
implementation created a per-cpu variable that is manipulated with atomic
bitops. Since these lines will be contended they are optimialy marked as
shared_aligned and take a full cache line for each cpu. Distro kernels
may have 2 or 3 of these in their config, each taking per-cpu space
even though at most one will be in use.
This consolidation removes smp_message_recv and replaces the single call
actions cases with direct calls from the common message recognition loop.
The complicated debugger ipi case with its muxed crash handling code is
moved to debug_ipi_action which is now called from the demux code (instead
of the multi-message action calling smp_message_recv).
I put a call to reschedule_action to increase the likelyhood of correctly
merging the anticipated scheduler_ipi() hook coming from the scheduler
tree; that single required call can be inlined later.
The actual message decode is a copy of the old pseries xics code with its
memory barriers and cache line spacing, augmented with a per-cpu unsigned
long based on the book-e doorbell code. The optional data is set via a
callback from the implementation and is passed to the new cause-ipi hook
along with the logical cpu number. While currently only the doorbell
implemntation uses this data it should be almost zero cost to retrieve and
pass it -- it adds a single register load for the argument from the same
cache line to which we just completed a store and the register is dead
on return from the call. I extended the data element from unsigned int
to unsigned long in case some other code wanted to associate a pointer.
The doorbell check_self is replaced by a call to smp_muxed_ipi_resend,
conditioned on the CPU_DBELL feature. The ifdef guard could be relaxed
to CONFIG_SMP but I left it with BOOKE for now.
Also, the doorbell interrupt vector for book-e was not calling irq_enter
and irq_exit, which throws off cpu accounting and causes code to not
realize it is running in interrupt context. Add the missing calls.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-05-11 03:29:39 +08:00
|
|
|
static void icp_native_cause_ipi(int cpu, unsigned long data)
|
2011-04-04 11:46:58 +08:00
|
|
|
{
|
2013-04-18 04:30:50 +08:00
|
|
|
kvmppc_set_host_ipi(cpu, 1);
|
2014-06-11 13:59:28 +08:00
|
|
|
#ifdef CONFIG_PPC_DOORBELL
|
|
|
|
if (cpu_has_feature(CPU_FTR_DBELL) &&
|
|
|
|
(cpumask_test_cpu(cpu, cpu_sibling_mask(smp_processor_id()))))
|
|
|
|
doorbell_cause_ipi(cpu, data);
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
icp_native_set_qirr(cpu, IPI_PRIORITY);
|
2011-04-04 11:46:58 +08:00
|
|
|
}
|
|
|
|
|
powerpc/powernv: Don't call generic code on offline cpus
On PowerNV platforms, when a CPU is offline, we put it into nap mode.
It's possible that the CPU wakes up from nap mode while it is still
offline due to a stray IPI. A misdirected device interrupt could also
potentially cause it to wake up. In that circumstance, we need to clear
the interrupt so that the CPU can go back to nap mode.
In the past the clearing of the interrupt was accomplished by briefly
enabling interrupts and allowing the normal interrupt handling code
(do_IRQ() etc.) to handle the interrupt. This has the problem that
this code calls irq_enter() and irq_exit(), which call functions such
as account_system_vtime() which use RCU internally. Use of RCU is not
permitted on offline CPUs and will trigger errors if RCU checking is
enabled.
To avoid calling into any generic code which might use RCU, we adopt
a different method of clearing interrupts on offline CPUs. Since we
are on the PowerNV platform, we know that the system interrupt
controller is a XICS being driven directly (i.e. not via hcalls) by
the kernel. Hence this adds a new icp_native_flush_interrupt()
function to the native-mode XICS driver and arranges to call that
when an offline CPU is woken from nap. This new function reads the
interrupt from the XICS. If it is an IPI, it clears the IPI; if it
is a device interrupt, it prints a warning and disables the source.
Then it does the end-of-interrupt processing for the interrupt.
The other thing that briefly enabling interrupts did was to check and
clear the irq_happened flag in this CPU's PACA. Therefore, after
flushing the interrupt from the XICS, we also clear all bits except
the PACA_IRQ_HARD_DIS (interrupts are hard disabled) bit from the
irq_happened flag. The PACA_IRQ_HARD_DIS flag is set by power7_nap()
and is left set to indicate that interrupts are hard disabled. This
means we then have to ignore that flag in power7_nap(), which is
reasonable since it doesn't indicate that any interrupt event needs
servicing.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-09-02 12:23:16 +08:00
|
|
|
/*
|
|
|
|
* Called when an interrupt is received on an off-line CPU to
|
|
|
|
* clear the interrupt, so that the CPU can go back to nap mode.
|
|
|
|
*/
|
|
|
|
void icp_native_flush_interrupt(void)
|
|
|
|
{
|
|
|
|
unsigned int xirr = icp_native_get_xirr();
|
|
|
|
unsigned int vec = xirr & 0x00ffffff;
|
|
|
|
|
|
|
|
if (vec == XICS_IRQ_SPURIOUS)
|
|
|
|
return;
|
|
|
|
if (vec == XICS_IPI) {
|
|
|
|
/* Clear pending IPI */
|
|
|
|
int cpu = smp_processor_id();
|
|
|
|
kvmppc_set_host_ipi(cpu, 0);
|
|
|
|
icp_native_set_qirr(cpu, 0xff);
|
|
|
|
} else {
|
|
|
|
pr_err("XICS: hw interrupt 0x%x to offline cpu, disabling\n",
|
|
|
|
vec);
|
|
|
|
xics_mask_unknown_vec(vec);
|
|
|
|
}
|
|
|
|
/* EOI the interrupt */
|
|
|
|
icp_native_set_xirr(xirr);
|
|
|
|
}
|
|
|
|
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 08:23:08 +08:00
|
|
|
void xics_wake_cpu(int cpu)
|
|
|
|
{
|
|
|
|
icp_native_set_qirr(cpu, IPI_PRIORITY);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(xics_wake_cpu);
|
|
|
|
|
2011-04-04 11:46:58 +08:00
|
|
|
static irqreturn_t icp_native_ipi_action(int irq, void *dev_id)
|
|
|
|
{
|
|
|
|
int cpu = smp_processor_id();
|
|
|
|
|
2013-04-18 04:30:50 +08:00
|
|
|
kvmppc_set_host_ipi(cpu, 0);
|
2011-04-04 11:46:58 +08:00
|
|
|
icp_native_set_qirr(cpu, 0xff);
|
|
|
|
|
powerpc: Consolidate ipi message mux and demux
Consolidate the mux and demux of ipi messages into smp.c and call
a new smp_ops callback to actually trigger the ipi.
The powerpc architecture code is optimised for having 4 distinct
ipi triggers, which are mapped to 4 distinct messages (ipi many, ipi
single, scheduler ipi, and enter debugger). However, several interrupt
controllers only provide a single software triggered interrupt that
can be delivered to each cpu. To resolve this limitation, each smp_ops
implementation created a per-cpu variable that is manipulated with atomic
bitops. Since these lines will be contended they are optimialy marked as
shared_aligned and take a full cache line for each cpu. Distro kernels
may have 2 or 3 of these in their config, each taking per-cpu space
even though at most one will be in use.
This consolidation removes smp_message_recv and replaces the single call
actions cases with direct calls from the common message recognition loop.
The complicated debugger ipi case with its muxed crash handling code is
moved to debug_ipi_action which is now called from the demux code (instead
of the multi-message action calling smp_message_recv).
I put a call to reschedule_action to increase the likelyhood of correctly
merging the anticipated scheduler_ipi() hook coming from the scheduler
tree; that single required call can be inlined later.
The actual message decode is a copy of the old pseries xics code with its
memory barriers and cache line spacing, augmented with a per-cpu unsigned
long based on the book-e doorbell code. The optional data is set via a
callback from the implementation and is passed to the new cause-ipi hook
along with the logical cpu number. While currently only the doorbell
implemntation uses this data it should be almost zero cost to retrieve and
pass it -- it adds a single register load for the argument from the same
cache line to which we just completed a store and the register is dead
on return from the call. I extended the data element from unsigned int
to unsigned long in case some other code wanted to associate a pointer.
The doorbell check_self is replaced by a call to smp_muxed_ipi_resend,
conditioned on the CPU_DBELL feature. The ifdef guard could be relaxed
to CONFIG_SMP but I left it with BOOKE for now.
Also, the doorbell interrupt vector for book-e was not calling irq_enter
and irq_exit, which throws off cpu accounting and causes code to not
realize it is running in interrupt context. Add the missing calls.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-05-11 03:29:39 +08:00
|
|
|
return smp_ipi_demux();
|
2011-04-04 11:46:58 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_SMP */
|
|
|
|
|
|
|
|
static int __init icp_native_map_one_cpu(int hw_id, unsigned long addr,
|
|
|
|
unsigned long size)
|
|
|
|
{
|
|
|
|
char *rname;
|
|
|
|
int i, cpu = -1;
|
|
|
|
|
|
|
|
/* This may look gross but it's good enough for now, we don't quite
|
|
|
|
* have a hard -> linux processor id matching.
|
|
|
|
*/
|
|
|
|
for_each_possible_cpu(i) {
|
|
|
|
if (!cpu_present(i))
|
|
|
|
continue;
|
|
|
|
if (hw_id == get_hard_smp_processor_id(i)) {
|
|
|
|
cpu = i;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Fail, skip that CPU. Don't print, it's normal, some XICS come up
|
|
|
|
* with way more entries in there than you have CPUs
|
|
|
|
*/
|
|
|
|
if (cpu == -1)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
rname = kasprintf(GFP_KERNEL, "CPU %d [0x%x] Interrupt Presentation",
|
|
|
|
cpu, hw_id);
|
|
|
|
|
|
|
|
if (!request_mem_region(addr, size, rname)) {
|
|
|
|
pr_warning("icp_native: Could not reserve ICP MMIO"
|
|
|
|
" for CPU %d, interrupt server #0x%x\n",
|
|
|
|
cpu, hw_id);
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
|
|
|
icp_native_regs[cpu] = ioremap(addr, size);
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 08:23:08 +08:00
|
|
|
kvmppc_set_xics_phys(cpu, addr);
|
2011-04-04 11:46:58 +08:00
|
|
|
if (!icp_native_regs[cpu]) {
|
|
|
|
pr_warning("icp_native: Failed ioremap for CPU %d, "
|
|
|
|
"interrupt server #0x%x, addr %#lx\n",
|
|
|
|
cpu, hw_id, addr);
|
|
|
|
release_mem_region(addr, size);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __init icp_native_init_one_node(struct device_node *np,
|
|
|
|
unsigned int *indx)
|
|
|
|
{
|
|
|
|
unsigned int ilen;
|
2013-08-07 00:01:34 +08:00
|
|
|
const __be32 *ireg;
|
2011-04-04 11:46:58 +08:00
|
|
|
int i;
|
|
|
|
int reg_tuple_size;
|
|
|
|
int num_servers = 0;
|
|
|
|
|
|
|
|
/* This code does the theorically broken assumption that the interrupt
|
|
|
|
* server numbers are the same as the hard CPU numbers.
|
|
|
|
* This happens to be the case so far but we are playing with fire...
|
|
|
|
* should be fixed one of these days. -BenH.
|
|
|
|
*/
|
|
|
|
ireg = of_get_property(np, "ibm,interrupt-server-ranges", &ilen);
|
|
|
|
|
|
|
|
/* Do that ever happen ? we'll know soon enough... but even good'old
|
|
|
|
* f80 does have that property ..
|
|
|
|
*/
|
|
|
|
WARN_ON((ireg == NULL) || (ilen != 2*sizeof(u32)));
|
|
|
|
|
|
|
|
if (ireg) {
|
|
|
|
*indx = of_read_number(ireg, 1);
|
|
|
|
if (ilen >= 2*sizeof(u32))
|
|
|
|
num_servers = of_read_number(ireg + 1, 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
ireg = of_get_property(np, "reg", &ilen);
|
|
|
|
if (!ireg) {
|
|
|
|
pr_err("icp_native: Can't find interrupt reg property");
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
reg_tuple_size = (of_n_addr_cells(np) + of_n_size_cells(np)) * 4;
|
|
|
|
if (((ilen % reg_tuple_size) != 0)
|
|
|
|
|| (num_servers && (num_servers != (ilen / reg_tuple_size)))) {
|
|
|
|
pr_err("icp_native: ICP reg len (%d) != num servers (%d)",
|
|
|
|
ilen / reg_tuple_size, num_servers);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < (ilen / reg_tuple_size); i++) {
|
|
|
|
struct resource r;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = of_address_to_resource(np, i, &r);
|
|
|
|
if (err) {
|
|
|
|
pr_err("icp_native: Could not translate ICP MMIO"
|
|
|
|
" for interrupt server 0x%x (%d)\n", *indx, err);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2011-06-10 00:13:32 +08:00
|
|
|
if (icp_native_map_one_cpu(*indx, r.start, resource_size(&r)))
|
2011-04-04 11:46:58 +08:00
|
|
|
return -1;
|
|
|
|
|
|
|
|
(*indx)++;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct icp_ops icp_native_ops = {
|
|
|
|
.get_irq = icp_native_get_irq,
|
|
|
|
.eoi = icp_native_eoi,
|
|
|
|
.set_priority = icp_native_set_cpu_priority,
|
|
|
|
.teardown_cpu = icp_native_teardown_cpu,
|
|
|
|
.flush_ipi = icp_native_flush_ipi,
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
.ipi_action = icp_native_ipi_action,
|
powerpc: Consolidate ipi message mux and demux
Consolidate the mux and demux of ipi messages into smp.c and call
a new smp_ops callback to actually trigger the ipi.
The powerpc architecture code is optimised for having 4 distinct
ipi triggers, which are mapped to 4 distinct messages (ipi many, ipi
single, scheduler ipi, and enter debugger). However, several interrupt
controllers only provide a single software triggered interrupt that
can be delivered to each cpu. To resolve this limitation, each smp_ops
implementation created a per-cpu variable that is manipulated with atomic
bitops. Since these lines will be contended they are optimialy marked as
shared_aligned and take a full cache line for each cpu. Distro kernels
may have 2 or 3 of these in their config, each taking per-cpu space
even though at most one will be in use.
This consolidation removes smp_message_recv and replaces the single call
actions cases with direct calls from the common message recognition loop.
The complicated debugger ipi case with its muxed crash handling code is
moved to debug_ipi_action which is now called from the demux code (instead
of the multi-message action calling smp_message_recv).
I put a call to reschedule_action to increase the likelyhood of correctly
merging the anticipated scheduler_ipi() hook coming from the scheduler
tree; that single required call can be inlined later.
The actual message decode is a copy of the old pseries xics code with its
memory barriers and cache line spacing, augmented with a per-cpu unsigned
long based on the book-e doorbell code. The optional data is set via a
callback from the implementation and is passed to the new cause-ipi hook
along with the logical cpu number. While currently only the doorbell
implemntation uses this data it should be almost zero cost to retrieve and
pass it -- it adds a single register load for the argument from the same
cache line to which we just completed a store and the register is dead
on return from the call. I extended the data element from unsigned int
to unsigned long in case some other code wanted to associate a pointer.
The doorbell check_self is replaced by a call to smp_muxed_ipi_resend,
conditioned on the CPU_DBELL feature. The ifdef guard could be relaxed
to CONFIG_SMP but I left it with BOOKE for now.
Also, the doorbell interrupt vector for book-e was not calling irq_enter
and irq_exit, which throws off cpu accounting and causes code to not
realize it is running in interrupt context. Add the missing calls.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-05-11 03:29:39 +08:00
|
|
|
.cause_ipi = icp_native_cause_ipi,
|
2011-04-04 11:46:58 +08:00
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
2011-08-25 14:07:13 +08:00
|
|
|
int __init icp_native_init(void)
|
2011-04-04 11:46:58 +08:00
|
|
|
{
|
|
|
|
struct device_node *np;
|
|
|
|
u32 indx = 0;
|
|
|
|
int found = 0;
|
|
|
|
|
|
|
|
for_each_compatible_node(np, NULL, "ibm,ppc-xicp")
|
|
|
|
if (icp_native_init_one_node(np, &indx) == 0)
|
|
|
|
found = 1;
|
|
|
|
if (!found) {
|
|
|
|
for_each_node_by_type(np,
|
|
|
|
"PowerPC-External-Interrupt-Presentation") {
|
|
|
|
if (icp_native_init_one_node(np, &indx) == 0)
|
|
|
|
found = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (found == 0)
|
|
|
|
return -ENODEV;
|
|
|
|
|
|
|
|
icp_ops = &icp_native_ops;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|