In sync-page path, if spte.writable is changed, it will lose page dirty
tracking, for example:
assume spte.writable = 0 in a unsync-page, when it's synced, it map spte
to writable(that is spte.writable = 1), later guest write spte.gfn, it means
spte.gfn is dirty, then guest changed this mapping to read-only, after it's
synced, spte.writable = 0
So, when host release the spte, it detect spte.writable = 0 and not mark page
dirty
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In current code, if ept is enabled(shadow_accessed_mask = 0), the page
accessed tracking is lost.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In the speculative path, we should check guest pte's reserved bits just as
the real processor does
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The index wasn't calculated correctly (off by one) for huge spte so KVM guest
was unstable with transparent hugepages.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If the destination is a memory operand and the memory cannot
map to a valid page, the xchg instruction emulation and locked
instruction will not work on io regions and stuck in endless
loop. We should emulate exchange as write to fix it.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If pit delivers interrupt while pic is masking it OS will never do EOI
and ack notifier will not be called so when pit will be unmasked no pit
interrupts will be delivered any more. Calling mask notifiers solves this
issue.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
With tdp enabled we should get into emulator only when emulating io, so
reexecution will always bring us back into emulator.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Mark inc (0xfe/0 0xff/0) and dec (0xfe/1 0xff/1) as lock prefix capable.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
'level' and 'sptep' are aliases for 'interator.level' and 'iterator.sptep', no
need for them.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently, when we fetch an spte, we only verify that gptes match those that
the walker saw if we build new shadow pages for them.
However, this misses the following race:
vcpu1 vcpu2
walk
change gpte
walk
instantiate sp
fetch existing sp
Fix by validating every gpte, regardless of whether it is used for building
a new sp or not.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Partition the function into three sections:
- fetching indirect shadow pages (host_level > guest_level)
- fetching direct shadow pages (page_level < host_level <= guest_level)
- the final spte (page_level == host_level)
Instead of the current spaghetti.
A slight change from the original code is that we call validate_direct_spte()
more often: previously we called it only for gw->level, now we also call it for
lower levels. The change should have no effect.
[xiao: fix regression caused by validate_direct_spte() called too late]
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Move the code to check whether a gpte has changed since we fetched it into
a helper.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Add a helper to verify that a direct shadow page is valid wrt the required
access permissions; drop the page if it is not valid.
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
To clarify spte fetching code, move large spte handling into a helper.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
To avoid split accesses to 64 bit sptes on i386, use __set_spte() to link
shadow pages together.
(not technically required since shadow pages are __GFP_KERNEL, so upper 32
bits are always clear)
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
To simplify the process of fetching an spte, add a helper that links
a shadow page to an spte.
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Devices register mask notifier using gsi, but irqchip knows about
irqchip/pin, so conversion from irqchip/pin to gsi should be done before
looking for mask notifier to call.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Userspace needs to reset and save/restore these MSRs.
The MCE banks are not exposed since their number varies from vcpu to vcpu.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When shadow pages are in use sometimes KVM try to emulate an instruction
when it accesses a shadowed page. If emulation fails KVM un-shadows the
page and reenter guest to allow vcpu to execute the instruction. If page
is not in shadow page hash KVM assumes that this was attempt to do MMIO
and reports emulation failure to userspace since there is no way to fix
the situation. This logic has a race though. If two vcpus tries to write
to the same shadowed page simultaneously both will enter emulator, but
only one of them will find the page in shadow page hash since the one who
founds it also removes it from there, so another cpu will report failure
to userspace and will abort the guest.
Fix this by checking (in addition to checking shadowed page hash) that
page that caused the emulation belongs to valid memory slot. If it is
then reenter the guest to allow vcpu to reexecute the instruction.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently if guest access address that belongs to memory slot but is not
backed up by page or page is read only KVM treats it like MMIO access.
Remove that capability. It was never part of the interface and should
not be relied upon.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Stanse found that there is an omitted unlock in kvm_create_pit in one fail
path. Add proper unlock there.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Gleb Natapov <gleb@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
Real hardware disregards permission errors when computing page fault error
code bit 0 (page present). Do the same.
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Bit 4 of the page fault error code is set only if EFER.NX is set.
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch change to use DstAcc for decoding 'mov AL, moffs'
and introduced SrcAcc for decoding 'mov moffs, AL'.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If IOPL check fail, the cli/sti emulate GP and then we should
skip writeback since the default write OP is OP_REG.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The source operand of 'mov rm,sreg' is segment register, not
general-purpose register, so remove SrcReg from decoding.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
'and AL,imm8' should be mask as ByteOp, otherwise the dest operand
length will no correct and we may fill the full EAX when writeback.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Fix the comment of out instruction, using the same style as the
other instructions.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
__set_spte() will happily replace an spte with the accessed bit set with
one that has the accessed bit clear. Add a helper update_spte() which checks
for this condition and updates the page flag if needed.
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently, in the window between the check for the accessed bit, and actually
dropping the spte, a vcpu can access the page through the spte and set the bit,
which will be ignored by the mmu.
Fix by using an exchange operation to atmoically fetch the spte and drop it.
Signed-off-by: Avi Kivity <avi@redhat.com>
When we call rmap_remove(), we (almost) always immediately follow it by
an __set_spte() to a nonpresent pte. Since we need to perform the two
operations atomically, to avoid losing the dirty and accessed bits, introduce
a helper drop_spte() and convert all call sites.
The operation is still nonatomic at this point.
Signed-off-by: Avi Kivity <avi@redhat.com>
Commit 341d9b535b6c simplify reload logic while entry guest mode, it
can avoid unnecessary sync-root if KVM_REQ_MMU_RELOAD and
KVM_REQ_MMU_SYNC both set.
But, it cause a issue that when we handle 'KVM_REQ_TLB_FLUSH', the
root is invalid, it is triggered during my test:
Kernel BUG at ffffffffa00212b8 [verbose debug info unavailable]
......
Fixed by directly return if the root is not ready.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
For 32bit machines where the physical address width is
larger than the virtual address width the frame number types
in KVM may overflow. Fix this by changing them to u64.
[sfr: fix build on 32-bit ppc]
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Allow mount options to be stored in the superblock. Also add default
mount option bits for nobarrier, block_validity, discard, and nodelalloc.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Some BIOSes will claim a large chunk of stolen space. Unless we
reclaim it, our aperture for remapping buffer objects will be
constrained. So clamp the stolen space to 32M and ignore the rest.
Fixes https://bugzilla.kernel.org/show_bug.cgi?id=15469 among others.
Adding the ignored stolen memory back into the general pool using the
memory hotplug code is left as an exercise for the reader.
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Reviewed-by: Simon Farnsworth <simon.farnsworth@onelan.com>
Tested-by: Artem S. Tashkinov <t.artem@mailcity.com>
Signed-off-by: Eric Anholt <eric@anholt.net>
The docs warn that to position the cursor such that no part of it is
visible on the pipe is an undefined operation. Avoid such circumstances
upon changing the mode, or at any other time, by unsetting the cursor if
it moves out of bounds.
"For normal high resolution display modes, the cursor must have at least a
single pixel positioned over the active screen.” (p143, p148 of the hardware
registers docs).
Fixes:
Bug 24748 - [965G] Graphics crashes when resolution is changed with KMS
enabled
https://bugs.freedesktop.org/show_bug.cgi?id=24748
v2: Only update the cursor registers if they change.
v3: Fix the unsigned comparision of x,y against width,height.
v4: Always set CUR.BASE or else the cursor may become corrupt.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reported-by: Christian Eggers <ceggers@gmx.de>
Cc: Christopher James Halse Rogers <chalserogers@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Eric Anholt <eric@anholt.net>
When creating an object, we create the handle by which it is known to
the process and which own the reference to the object. That reference to
the new handle is what we want to transfer to the process, not the lost
reference to the object; so free the local object reference *not* the
process's handle reference.
This brings i915_gem_object_create_ioctl() into line with
drm_gem_open_ioctl()
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Eric Anholt <eric@anholt.net>
If we fail to flush outstanding GPU writes but return the memory to the
system, we risk corrupting memory should the GPU recovery and complete
those writes. On the other hand, if we bail early and free the object
then we have a definite use-after-free and real memory corruption.
Choose the lesser of two evils, since in order to recover from the hung
GPU we need to completely reset it, those pending writes should
never happen.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Eric Anholt <eric@anholt.net>
If during the freeing of an object the unbind is interrupted by a system
call, which is quite possible if we have outstanding GPU writes that
must be flushed, the unbind is silently aborted. This still leaves the
AGP region and backing pages allocated, and perhaps more importantly,
the object remains upon the various lists exposing us to memory
corruption.
I think this is the cause behind the use-after-free, such as
Bug 15664 - Graphics hang and kernel backtrace when starting Azureus
with Compiz enabled
https://bugzilla.kernel.org/show_bug.cgi?id=15664
v2: Daniel Vetter reminded me that kernel space programming is never easy.
We cannot simply spin to clear the pending signal and so must deferred
the freeing of the object until later.
v3: Run from the top level retire requests.
v4: Tested with P(return -ERESTARTSYS)=.5 from i915_gem_do_wait_request()
v5: Rebase against Eric's for-linus tree.
v6: Refactor, split and add a comment about avoiding unbounded recursion.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Signed-off-by: Eric Anholt <eric@anholt.net>
Combine the iteration over active render rings into a common function.
This is in preparation for reusing the idle function to also retire
deferred free requests.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Eric Anholt <eric@anholt.net>
Even though "we have enough padding that it should be ok", round up the
watermark entries to the next unit to be on the safe side...
v2: Use the DIV_ROUND_UP macro
v3: Spotted a few more missing round-ups.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Eric Anholt <eric@anholt.net>
Apparently i830 and i845 cannot handle any stride that is not a multiple
of 256, unlike their brethren which do support 64 byte aligned strides.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: stable@kernel.org
Signed-off-by: Eric Anholt <eric@anholt.net>