VMX hardware can cache the contents of a vcpu's vmcs. This cache needs
to be flushed when migrating a vcpu to another cpu, or (which is the case
that interests us here) when disabling hardware virtualization on a cpu.
The current implementation of decaching iterates over the list of all vcpus,
picks the ones that are potentially cached on the cpu that is being offlined,
and flushes the cache. The problem is that it uses mutex_trylock() to gain
exclusive access to the vcpu, which fires off a (benign) warning about using
the mutex in an interrupt context.
To avoid this, and to make things generally nicer, add a new per-cpu list
of potentially cached vcus. This makes the decaching code much simpler. The
list is vmx-specific since other hardware doesn't have this issue.
[andrea: fix crash on suspend/resume]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM turns off hardware virtualization extensions during reboot, in order
to disassociate the memory used by the virtualization extensions from the
processor, and in order to have the system in a consistent state.
Unfortunately virtual machines may still be running while this goes on,
and once virtualization extensions are turned off, any virtulization
instruction will #UD on execution.
Fix by adding an exception handler to virtualization instructions; if we get
an exception during reboot, we simply spin waiting for the reset to complete.
If it's a true exception, BUG() so we can have our stack trace.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The KVM MMU tries to detect when a speculative pte update is not actually
used by demand fault, by checking the accessed bit of the shadow pte. If
the shadow pte has not been accessed, we deem that page table flooded and
remove the shadow page table, allowing further pte updates to proceed
without emulation.
However, if the pte itself points at a page table and only used for write
operations, the accessed bit will never be set since all access will happen
through the emulator.
This is exactly what happens with kscand on old (2.4.x) HIGHMEM kernels.
The kernel points a kmap_atomic() pte at a page table, and then
proceeds with read-modify-write operations to look at the dirty and accessed
bits. We get a false flood trigger on the kmap ptes, which results in the
mmu spending all its time setting up and tearing down shadows.
Fix by setting the shadow accessed bit on emulated accesses.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Attached is a patch that fixes a guest crash when booting older Linux kernels.
The problem stems from the fact that we are currently emulating
MSR_K7_EVNTSEL[0-3], but not emulating MSR_K7_PERFCTR[0-3]. Because of this,
setup_k7_watchdog() in the Linux kernel receives a GPF when it attempts to
write into MSR_K7_PERFCTR, which causes an OOPs.
The patch fixes it by just "fake" emulating the appropriate MSRs, throwing
away the data in the process. This causes the NMI watchdog to not actually
work, but it's not such a big deal in a virtualized environment.
When we get a write to one of these counters, we printk_ratelimit() a warning.
I decided to print it out for all writes, even if the data is 0; it doesn't
seem to make sense to me to special case when data == 0.
Tested by myself on a RHEL-4 guest, and Joerg Roedel on a Windows XP 64-bit
guest.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The in-kernel PIT emulation ignores pending timers if operating
under mode 3, which for example Hurd uses.
This mode should output a square wave, high for (N+1)/2 counts and low
for (N-1)/2 counts. As we only care about the resulting interrupts, the
period is N, and mode 3 is the same as mode 2 with regard to
interrupts.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch allows VMAs that contain no backing page to be used for guest
memory. This is useful for assigning mmio regions to a guest.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
To distinguish between real page faults and nested page faults they should be
traced as different events. This is implemented by this patch.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds the missing kvmtrace markers to the svm
module of kvm.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds some kvmtrace bits to the generic x86 code
where it is instrumented from SVM.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
With an exit handler for INTR intercepts its possible to account them using
kvmtrace.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
With an exit handler for NMI intercepts its possible to account them using
kvmtrace.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch moves the trace entry for APIC accesses from the VMX code to the
generic lapic code. This way APIC accesses from SVM will also be traced.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Noticed by sparse:
arch/x86/kvm/vmx.c:1583:6: warning: symbol 'vmx_disable_intercept_for_msr' was not declared. Should it be static?
arch/x86/kvm/x86.c:3406:5: warning: symbol 'kvm_task_switch_16' was not declared. Should it be static?
arch/x86/kvm/x86.c:3429:5: warning: symbol 'kvm_task_switch_32' was not declared. Should it be static?
arch/x86/kvm/mmu.c:1968:6: warning: symbol 'kvm_mmu_remove_one_alloc_mmu_page' was not declared. Should it be static?
arch/x86/kvm/mmu.c:2014:6: warning: symbol 'mmu_destroy_caches' was not declared. Should it be static?
arch/x86/kvm/lapic.c:862:5: warning: symbol 'kvm_lapic_get_base' was not declared. Should it be static?
arch/x86/kvm/i8254.c:94:5: warning: symbol 'pit_get_gate' was not declared. Should it be static?
arch/x86/kvm/i8254.c:196:5: warning: symbol '__pit_timer_fn' was not declared. Should it be static?
arch/x86/kvm/i8254.c:561:6: warning: symbol '__inject_pit_timer_intr' was not declared. Should it be static?
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
kvm_dev_ioctl casts the arg value to void __user *, just to recast it
again to long. This seems unnecessary.
According to objdump the binary code on x86 is unchanged by this patch.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
random uvesafb failures were reported against Gentoo:
http://bugs.gentoo.org/show_bug.cgi?id=222799
and Mihai Moldovan bisected it back to:
> 8f4d37ec07 is first bad commit
> commit 8f4d37ec07
> Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Fri Jan 25 21:08:29 2008 +0100
>
> sched: high-res preemption tick
Linus suspected it to be hrtick + vm86 interaction and observed:
> Btw, Peter, Ingo: I think that commit is doing bad things. They aren't
> _incorrect_ per se, but they are definitely bad.
>
> Why?
>
> Using random _TIF_WORK_MASK flags is really impolite for doing
> "scheduling" work. There's a reason that arch/x86/kernel/entry_32.S
> special-cases the _TIF_NEED_RESCHED flag: we don't want to exit out of
> vm86 mode unnecessarily.
>
> See the "work_notifysig_v86" label, and how it does that
> "save_v86_state()" thing etc etc.
Right, I never liked having to fiddle with those TIF flags. Initially I
needed it because the hrtimer base lock could not nest in the rq lock.
That however is fixed these days.
Currently the only reason left to fiddle with the TIF flags is remote
wakeups. We cannot program a remote cpu's hrtimer. I've been thinking
about using the new and improved IPI function call stuff to implement
hrtimer_start_on().
However that does require that smp_call_function_single(.wait=0) works
from interrupt context - /me looks at the latest series from Jens - Yes
that does seem to be supported, good.
Here's a stab at cleaning this stuff up ...
Mihai reported test success as well.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Mihai Moldovan <ionic@ionic.de>
Cc: Michal Januszewski <spock@gentoo.org>
Cc: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some cleanups in speedstep-centrino.c for NR_CPUS=4096.
* Use new CPUMASK_PTR (instead of old CPUMASK_VAR).
* Replace arrays sized by NR_CPUS with percpu variables.
* Cleanup some formatting problems (>80 chars per line)
and other checkpatch complaints.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Rename CPUMASK_VAR --> CPUMASK_PTR (and simplify)
* Fix a semantic error in CPUMASK_ALLOC
* Add a bit of commentry to cpumask.h
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* nr_cpu_ids should be used to determine if a percpu area is
available for a given cpu.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Replace NR_CPUS loop with for_each_possible_cpu().
* nr_cpu_ids should be used to determine if a percpu area is
available for a given cpu.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Use nr_cpu_ids instead of NR_CPUS to limit traversal of cpu online map.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* nr_cpu_ids should be used to allocate arrays based on the number of
cpu's present.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's not possible to enable the unknown_nmi_panic sysctl option
until init is run. It's useful to be able to panic the kernel
during boot too, this adds a parameter to enable this option.
Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so NUMAQ can use that to call numaq_pre_time_init()
This allows us to remove a NUMAQ special from arch/x86/kernel/setup.c.
(and paves the way to remove the NUMAQ subarch)
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add these new x86_quirks methods:
int *mpc_record;
int (*mpc_apic_id)(struct mpc_config_processor *m);
void (*mpc_oem_bus_info)(struct mpc_config_bus *m, char *name);
void (*mpc_oem_pci_bus)(struct mpc_config_bus *m);
void (*smp_read_mpc_oem)(struct mp_config_oemtable *oemtable,
unsigned short oemsize);
... and move NUMAQ related mps table handling to numaq_32.c.
also move the call to smp_read_mpc_oem() to smp_read_mpc() directly.
Should not change functionality, albeit it would be nice to get it
tested on real NUMAQ as well ...
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
introduce x86_quirks array of boot-time quirk methods.
No change in functionality intended.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add size table functions for qdiscs and calculate packet size in
qdisc_enqueue().
Based on patch by Patrick McHardy
http://marc.info/?l=linux-netdev&m=115201979221729&w=2
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Arjan van de Ven <arjan@infradead.org>
Date: Sat, 19 Jul 2008 15:47:17 -0700
CONFIG_NONPROMISC_DEVMEM was a rather confusing name - but renaming it
to CONFIG_PROMISC_DEVMEM causes problems on architectures that do not
support this feature; this patch renames it to CONFIG_STRICT_DEVMEM,
so that architectures can opt-in into it.
( the polarity of the option is still the same as it was originally; it
needs to be for now to not break architectures that don't have the
infastructure yet to support this feature)
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: "V.Radhakrishnan" <rk@atr-labs.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
Hash et al. sizing code in SCTP wants to make the
calculation totalram_pages - totalhigh_pages, just
like TCP. But this requires an export for the
CONFIG_HIGHMEM case to work.
Signed-off-by: David S. Miller <davem@davemloft.net>
The caller has alredy checked for them.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use sockaddr_storage{} for generic socket address storage
and ensures proper alignment.
Use sockaddr{} for pointers to omit several casts.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without CONFIG_NET_NS, namespace is always &init_net.
Compiler will be able to omit namespace comparisons with this patch.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The limit of 128 bytes is too small when debugging slab corruption of the skb
cache, for example. So increase the limit to PAGE_SIZE to make debugging
corruptions easier.
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>