Commit graph

20,924 commits

Author SHA1 Message Date
Andy Lutomirski
a66734297f perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks
While perfmon2 is a sufficiently evil library (it pokes MSRs
directly) that breaking it is fair game, it's still useful, so we
might as well try to support it.  This allows users to write 2 to
/sys/devices/cpu/rdpmc to disable all rdpmc protection so that hack
like perfmon2 can continue to work.

At some point, if perf_event becomes fast enough to replace
perfmon2, then this can go.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/caac3c1c707dcca48ecbc35f4def21495856f479.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 12:10:49 +01:00
Andy Lutomirski
7911d3f7af perf/x86: Only allow rdpmc if a perf_event is mapped
We currently allow any process to use rdpmc.  This significantly
weakens the protection offered by PR_TSC_DISABLED, and it could be
helpful to users attempting to exploit timing attacks.

Since we can't enable access to individual counters, use a very
coarse heuristic to limit access to rdpmc: allow access only when
a perf_event is mmapped.  This protects seccomp sandboxes.

There is plenty of room to further tighen these restrictions.  For
example, this allows rdpmc for any x86_pmu event, but it's only
useful for self-monitoring tasks.

As a side effect, cap_user_rdpmc will now be false for AMD uncore
events.  This isn't a real regression, since .event_idx is disabled
for these events anyway for the time being.  Whenever that gets
re-added, the cap_user_rdpmc code can be adjusted or refactored
accordingly.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/a2bdb3cf3a1d70c26980d7c6dddfbaa69f3182bf.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 12:10:47 +01:00
Andy Lutomirski
c1317ec2b9 perf: Pass the event to arch_perf_update_userpage()
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/0fea9a7fac3c1eea86cb0a5954184e74f4213666.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 12:10:46 +01:00
Andy Lutomirski
22c4bd9fa9 x86: Add a comment clarifying LDT context switching
The code is correct, but only for a rather subtle reason.  This
confused me for quite a while when I read switch_mm, so clarify the
code to avoid confusing other people, too.

TBH, I wouldn't be surprised if this code was only correct by
accident.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/0db86397f968996fb772c443c251415b0b430ddd.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 12:10:43 +01:00
Andy Lutomirski
1e02ce4ccc x86: Store a per-cpu shadow copy of CR4
Context switches and TLB flushes can change individual bits of CR4.
CR4 reads take several cycles, so store a shadow copy of CR4 in a
per-cpu variable.

To avoid wasting a cache line, I added the CR4 shadow to
cpu_tlbstate, which is already touched in switch_mm.  The heaviest
users of the cr4 shadow will be switch_mm and __switch_to_xtra, and
__switch_to_xtra is called shortly after switch_mm during context
switch, so the cacheline is likely to be hot.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 12:10:42 +01:00
Andy Lutomirski
375074cc73 x86: Clean up cr4 manipulation
CR4 manipulation was split, seemingly at random, between direct
(write_cr4) and using a helper (set/clear_in_cr4).  Unfortunately,
the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code,
which only a small subset of users actually wanted.

This patch replaces all cr4 access in functions that don't leave cr4
exactly the way they found it with new helpers cr4_set_bits,
cr4_clear_bits, and cr4_set_bits_and_update_boot.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 12:10:41 +01:00
Josh Poimboeuf
12cf89b550 livepatch: rename config to CONFIG_LIVEPATCH
Rename CONFIG_LIVE_PATCHING to CONFIG_LIVEPATCH to make the naming of
the config and the code more consistent.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-02-04 11:25:51 +01:00
Ingo Molnar
0967160ad6 Merge branch 'x86/asm' into perf/x86, to avoid conflicts with upcoming patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 09:01:12 +01:00
Ingo Molnar
8f4bf4bcc4 Linux 3.19-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUzvgKAAoJEHm+PkMAQRiG8XQH/1qVbHI4pP0KcnzfZUHq/mXq
 RuS4aJMwLm/Y6cXFraXBDaPde1A3CPtwtpob2C6giKcfu2zXGunY65haOEeJWNpX
 lCbBsLkNC3oDNkygBpVr5Zd6yibaw63WBjjLnpAi7pn2G2Zm2zB8DfILWWWMb7yz
 MH8ZXV+/xIYCTkjNWGWA1iMjmdYqu0PQHPeOgLsYQ+u7rxfM1zb/wHEkjqUZS6iu
 IaaZv7PV2PnFYnqib/iIPYjAEDvSQ4vN/7b82zlFd2Culm9j/568KCCWUPhJTb2l
 X0u4QYs49GnMTWVRa3bgYxS/nTUaE/6DeWs2y2WzqTt0/XDntVUnok0blUeDxGk=
 =o2kS
 -----END PGP SIGNATURE-----

Merge tag 'v3.19-rc7' into perf/core, to merge fixes before applying new changes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 07:58:29 +01:00
Jan Beulich
75aaf4c3e6 x86/raid6: correctly check for assembler capabilities
Just like for AVX2 (which simply needs an #if -> #ifdef conversion),
SSSE3 assembler support should be checked for before using it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
Rafael J. Wysocki
b2cd5dd71a Merge branch 'acpica' into acpi-resources 2015-02-03 22:27:01 +01:00
Wincy Van
705699a139 KVM: nVMX: Enable nested posted interrupt processing
If vcpu has a interrupt in vmx non-root mode, injecting that interrupt
requires a vmexit.  With posted interrupt processing, the vmexit
is not needed, and interrupts are fully taken care of by hardware.
In nested vmx, this feature avoids much more vmexits than non-nested vmx.

When L1 asks L0 to deliver L1's posted interrupt vector, and the target
VCPU is in non-root mode, we use a physical ipi to deliver POSTED_INTR_NV
to the target vCPU.  Using POSTED_INTR_NV avoids unexpected interrupts
if a concurrent vmexit happens and L1's vector is different with L0's.
The IPI triggers posted interrupt processing in the target physical CPU.

In case the target vCPU was not in guest mode, complete the posted
interrupt delivery on the next entry to L2.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:15:08 +01:00
Wincy Van
608406e290 KVM: nVMX: Enable nested virtual interrupt delivery
With virtual interrupt delivery, the hardware lets KVM use a more
efficient mechanism for interrupt injection. This is an important feature
for nested VMX, because it reduces vmexits substantially and they are
much more expensive with nested virtualization.  This is especially
important for throughput-bound scenarios.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:07:38 +01:00
Wincy Van
82f0dd4b27 KVM: nVMX: Enable nested apic register virtualization
We can reduce apic register virtualization cost with this feature,
it is also a requirement for virtual interrupt delivery and posted
interrupt processing.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:07:03 +01:00
Wincy Van
b9c237bb1d KVM: nVMX: Make nested control MSRs per-cpu
To enable nested apicv support, we need per-cpu vmx
control MSRs:
  1. If in-kernel irqchip is enabled, we can enable nested
     posted interrupt, we should set posted intr bit in
     the nested_vmx_pinbased_ctls_high.
  2. If in-kernel irqchip is disabled, we can not enable
     nested posted interrupt, the posted intr bit
     in the nested_vmx_pinbased_ctls_high will be cleared.

Since there would be different settings about in-kernel
irqchip between VMs, different nested control MSRs
are needed.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:06:51 +01:00
Wincy Van
f2b93280ed KVM: nVMX: Enable nested virtualize x2apic mode
When L2 is using x2apic, we can use virtualize x2apic mode to
gain higher performance, especially in apicv case.

This patch also introduces nested_vmx_check_apicv_controls
for the nested apicv patches.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:06:17 +01:00
Wincy Van
3af18d9c5f KVM: nVMX: Prepare for using hardware MSR bitmap
Currently, if L1 enables MSR_BITMAP, we will emulate this feature, all
of L2's msr access is intercepted by L0.  Features like "virtualize
x2apic mode" require that the MSR bitmap is enabled, or the hardware
will exit and for example not virtualize the x2apic MSRs.  In order to
let L1 use these features, we need to build a merged bitmap that only
not cause a VMEXIT if 1) L1 requires that 2) the bit is not required by
the processor for APIC virtualization.

For now the guests are still run with MSR bitmap disabled, but this
patch already introduces nested_vmx_merge_msr_bitmap for future use.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03 17:02:32 +01:00
Ingo Molnar
b57c0b5175 x86: Entry cleanups and a bugfix for 3.20
This fixes a bug in the RCU code I added in ist_enter.  It also includes
 the sysret stuff discussed here:
 
 http://lkml.kernel.org/g/cover.1421453410.git.luto%40amacapital.net
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUzhZ0AAoJEK9N98ZeDfrksUEH/j7wkUlMGan5h1AQIZQW6gKk
 OjlE1a4rfcgKocgkc0ix6UMc8Ks/NAUWKpeHR08eqR+Xi6Yk29cqLkboTEmAdYJ3
 jQvKjGu51kiprNjAGqF5wdqxvCT3oBSdm7CWdtY4zHkEr+2W93Ht9PM7xZhj4r+P
 ekUC8mIKQrhyhlC7g7VpXLAi3Bk4mO+f499T7XBVsVoywWpgVpOMYMhtUobV1reW
 V7/zul/dMerzNLB0t3amvdgCLphHBQTQ0fHBAN62RY78UvSDt36EZFyS65isirsR
 LhO4FpWzF5YNMRk8Dep/fB8jYlhsCi40ZIlOtGSE6kNJyLhPt+oLnkpgOwWAMQc=
 =uiRw
 -----END PGP SIGNATURE-----

Merge tag 'pr-20150201-x86-entry' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux into x86/asm

Pull "x86: Entry cleanups and a bugfix for 3.20" from Andy Lutomirski:

 " This fixes a bug in the RCU code I added in ist_enter.  It also includes
   the sysret stuff discussed here:

     http://lkml.kernel.org/g/cover.1421453410.git.luto%40amacapital.net "

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-03 12:24:08 +01:00
Ingo Molnar
ad6e46869a x86, vdso: One trivial last-minute VDSO build improvement
Andrey noticed that the VDSO build wasn't cleaning itself up.  This
 one-liner fixes it.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUzhiRAAoJEK9N98ZeDfrkpoIH/iS3dJ8IkWMhAcgYro87Hy+0
 8YVJjHj+5DgSb4+SSB7B6DXCrmzvVFEXGxz9RraPxshDKLHWNh2qiuy0XqsGwdf8
 Z7qkrbfcF1Dkp5RQG9vdkkHwwsh29Ky6R+1ewJVnPz/41Iy7G3uSI6QXLseytI2J
 myWSI/4yVEbL0CmKU+XXtV/oy1D6816vJSS9HG/Zm6WsO3FzR3NjPct1Wm6Br1/4
 hDEGWfKEUg5l0fztCfbCQJO450mNfp/bXEdsy5wPuJTahj3p4MRz/ZoDpLcvFhyL
 Zldofc2XmHlBKwzbPs/5fO0l3pxko6y0kFWcCWa7wMy7TLhwoXsuRqiSowCy4AQ=
 =cf7M
 -----END PGP SIGNATURE-----

Merge tag 'pr-20150201-x86-vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux into x86/asm

Pull VDSO fix fro Andy Lutomirski:

 "x86, vdso: One trivial last-minute VDSO build improvement

  Andrey noticed that the VDSO build wasn't cleaning itself up.  This
  one-liner fixes it."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-03 12:22:47 +01:00
Ingo Molnar
8dbcb8737c Linux 3.19-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUzvgKAAoJEHm+PkMAQRiG8XQH/1qVbHI4pP0KcnzfZUHq/mXq
 RuS4aJMwLm/Y6cXFraXBDaPde1A3CPtwtpob2C6giKcfu2zXGunY65haOEeJWNpX
 lCbBsLkNC3oDNkygBpVr5Zd6yibaw63WBjjLnpAi7pn2G2Zm2zB8DfILWWWMb7yz
 MH8ZXV+/xIYCTkjNWGWA1iMjmdYqu0PQHPeOgLsYQ+u7rxfM1zb/wHEkjqUZS6iu
 IaaZv7PV2PnFYnqib/iIPYjAEDvSQ4vN/7b82zlFd2Culm9j/568KCCWUPhJTb2l
 X0u4QYs49GnMTWVRa3bgYxS/nTUaE/6DeWs2y2WzqTt0/XDntVUnok0blUeDxGk=
 =o2kS
 -----END PGP SIGNATURE-----

Merge tag 'v3.19-rc7' into x86/asm, to refresh the branch before pulling in new changes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-03 12:22:18 +01:00
Stuart R. Anderson
ea9e9d8029 Specify PCI based UART for earlyprintk
Add support for specifying PCI based UARTs for earlyprintk
using a syntax like "earlyprintk=pciserial,00:18.1,115200",
where 00:18.1 is the BDF of a UART device.

[Slightly tidied from Stuart's original patch]
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-02 10:11:27 -08:00
Andy Shevchenko
874e52086f x86, mrst: remove Moorestown specific serial drivers
Intel Moorestown platform support was removed few years ago. This is a follow
up which removes Moorestown specific code for the serial devices. It includes
mrst_max3110 and earlyprintk bits.

This was used on SFI (Medfield, Clovertrail) based platforms as well, though
new ones use normal serial interface for the console service.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: David Cohen <david.a.cohen@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-02 10:11:24 -08:00
Marcelo Tosatti
2e6d015799 KVM: x86: revert "add method to test PIR bitmap vector"
Revert 7c6a98dfa1, given
that testing PIR is not necessary anymore.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-02 18:36:34 +01:00
Marcelo Tosatti
f933986038 KVM: x86: fix lapic_timer_int_injected with APIC-v
With APICv, LAPIC timer interrupt is always delivered via IRR:
apic_find_highest_irr syncs PIR to IRR.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-02 18:36:25 +01:00
Charlotte Richardson
51ac3d2f0c PCI: Add NEC variants to Stratus ftServer PCIe DMI check
NEC OEMs the same platforms as Stratus does, which have multiple devices on
some PCIe buses under downstream ports.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=51331
Fixes: 1278998f8f ("PCI: Work around Stratus ftServer broken PCIe hierarchy (fix DMI check)")
Signed-off-by: Charlotte Richardson <charlotte.richardson@stratus.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: stable@vger.kernel.org	# v3.5+
CC: Myron Stowe <myron.stowe@redhat.com>
2015-02-02 09:36:23 -06:00
Andy Lutomirski
96b6352c12 x86_64, entry: Remove the syscall exit audit and schedule optimizations
We used to optimize rescheduling and audit on syscall exit.  Now
that the full slow path is reasonably fast, remove these
optimizations.  Syscall exit auditing is now handled exclusively by
syscall_trace_leave.

This adds something like 10ns to the previously optimized paths on
my computer, presumably due mostly to SAVE_REST / RESTORE_REST.

I think that we should eventually replace both the syscall and
non-paranoid interrupt exit slow paths with a pair of C functions
along the lines of the syscall entry hooks.

Link: http://lkml.kernel.org/r/22f2aa4a0361707a5cfb1de9d45260b39965dead.1421453410.git.luto@amacapital.net
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-02-01 04:03:02 -08:00
Andy Lutomirski
2a23c6b8a9 x86_64, entry: Use sysret to return to userspace when possible
The x86_64 entry code currently jumps through complex and
inconsistent hoops to try to minimize the impact of syscall exit
work.  For a true fast-path syscall, almost nothing needs to be
done, so returning is just a check for exit work and sysret.  For a
full slow-path return from a syscall, the C exit hook is invoked if
needed and we join the iret path.

Using iret to return to userspace is very slow, so the entry code
has accumulated various special cases to try to do certain forms of
exit work without invoking iret.  This is error-prone, since it
duplicates assembly code paths, and it's dangerous, since sysret
can malfunction in interesting ways if used carelessly.  It's
also inefficient, since a lot of useful cases aren't optimized
and therefore force an iret out of a combination of paranoia and
the fact that no one has bothered to write even more asm code
to avoid it.

I would argue that this approach is backwards.  Rather than trying
to avoid the iret path, we should instead try to make the iret path
fast.  Under a specific set of conditions, iret is unnecessary.  In
particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not
set, and both SS and CS are as expected, then
movq 32(%rsp),%rsp;sysret does the same thing as iret.  This set of
conditions is nearly always satisfied on return from syscalls, and
it can even occasionally be satisfied on return from an irq.

Even with the careful checks for sysret applicability, this cuts
nearly 80ns off of the overhead from syscalls with unoptimized exit
work.  This includes tracing and context tracking, and any return
that invokes KVM's user return notifier.  For example, the cost of
getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
~280ns on my computer.

This may allow the removal and even eventual conversion to C
of a respectable amount of exit asm.

This may require further tweaking to give the full benefit on Xen.

It may be worthwhile to adjust signal delivery and exec to try hit
the sysret path.

This does not optimize returns to 32-bit userspace.  Making the same
optimization for CS == __USER32_CS is conceptually straightforward,
but it will require some tedious code to handle the differences
between sysretl and sysexitl.

Link: http://lkml.kernel.org/r/71428f63e681e1b4aa1a781e3ef7c27f027d1103.1421453410.git.luto@amacapital.net
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-02-01 04:03:01 -08:00
Andy Lutomirski
b926e6f61a x86, traps: Fix ist_enter from userspace
context_tracking_user_exit() has no effect if in_interrupt() returns true,
so ist_enter() didn't work.  Fix it by calling exception_enter(), and thus
context_tracking_user_exit(), before incrementing the preempt count.

This also adds an assertion that will catch the problem reliably if
CONFIG_PROVE_RCU=y to help prevent the bug from being reintroduced.

Link: http://lkml.kernel.org/r/261ebee6aee55a4724746d0d7024697013c40a08.1422709102.git.luto@amacapital.net
Fixes: 9592747538 x86, traps: Track entry into and exit from IST context
Reported-and-tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-02-01 04:02:53 -08:00
Linus Torvalds
6155bc1431 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Mostly tooling fixes, but also an event groups fix, two PMU driver
  fixes and a CPU model variant addition"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Tighten (and fix) the grouping condition
  perf/x86/intel: Add model number for Airmont
  perf/rapl: Fix crash in rapl_scale()
  perf/x86/intel/uncore: Move uncore_box_init() out of driver initialization
  perf probe: Fix probing kretprobes
  perf symbols: Introduce 'for' method to iterate over the symbols with a given name
  perf probe: Do not rely on map__load() filter to find symbols
  perf symbols: Introduce method to iterate symbols ordered by name
  perf symbols: Return the first entry with a given name in find_by_name method
  perf annotate: Fix memory leaks in LOCK handling
  perf annotate: Handle ins parsing failures
  perf scripting perl: Force to use stdbool
  perf evlist: Remove extraneous 'was' on error message
2015-01-30 14:34:55 -08:00
Linus Torvalds
1f59fe7667 The ARM changes are largish, but not too scary. And a simple fix
for x86 (bug introduced in 3.19).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJUy2ulAAoJEL/70l94x66D18kIAJhuh2k5Mt3TfP/zfhi2Y6ER
 IAZqyFODs8txZ3v432PB8yWWvr2XfJ3gwfjvurLygQJ3jCGZqDrmucbUUXzEaPUk
 mPnLpxV0ZEmNweS2HLGPX9HJ6zfsZ1dHRk55Tko9ynAO731q7yPjj6HC0th8wzvE
 BRv5y/18rY2zyar+5Azpj5wpOSllq0ynMgjWXGSlaTLbQoyvgZtzbqNY6nsAGrKw
 e8hSUPogfGUmZkBHHHVDYKpgHvWS1hARyuGFo8LeKXKPo7qhYxZHCDpch8TXnq2y
 21IvQfYddGpcMsaTroA5qyXFigxCX+1j3po6MS3ZH9GGXS5fC3sI8t0EDxKiO6Q=
 =O4X0
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "The ARM changes are largish, but not too scary.  And a simple fix for
  x86 (bug introduced in 3.19)"

(Paolo sayus these are the "Final" fixes. We'll see).

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: check LAPIC presence when building apic_map
  arm/arm64: KVM: Use kernel mapping to perform invalidation on page fault
  arm/arm64: KVM: Invalidate data cache on unmap
  arm/arm64: KVM: Use set/way op trapping to track the state of the caches
2015-01-30 10:45:24 -08:00
Paolo Bonzini
ad15a29647 kvm: vmx: fix oops with explicit flexpriority=0 option
A function pointer was not NULLed, causing kvm_vcpu_reload_apic_access_page to
go down the wrong path and OOPS when doing put_page(NULL).

This did not happen on old processors, only when setting the module option
explicitly.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30 16:18:49 +01:00
Radim Krčmář
df04d1d191 KVM: x86: check LAPIC presence when building apic_map
We forgot to re-check LAPIC after splitting the loop in commit
173beedc16 (KVM: x86: Software disabled APIC should still deliver
NMIs, 2014-11-02).

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Fixes: 173beedc16
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30 12:28:31 +01:00
Radim Krčmář
8a395363e2 KVM: x86: fix x2apic logical address matching
We cannot hit the bug now, but future patches will expose this path.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30 12:26:46 +01:00
Radim Krčmář
3697f302ab KVM: x86: replace 0 with APIC_DEST_PHYSICAL
To make the code self-documenting.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30 12:26:46 +01:00
Radim Krčmář
9368b56762 KVM: x86: cleanup kvm_apic_match_*()
The majority of this patch turns
  result = 0; if (CODE) result = 1; return result;
into
  return CODE;
because we return bool now.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30 12:26:45 +01:00
Radim Krčmář
52c233a440 KVM: x86: return bool from kvm_apic_match*()
And don't export the internal ones while at it.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30 12:26:45 +01:00
Kai Huang
843e433057 KVM: VMX: Add PML support in VMX
This patch adds PML support in VMX. A new module parameter 'enable_pml' is added
to allow user to enable/disable it manually.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30 09:39:54 +01:00
Linus Torvalds
33692f2759 vm: add VM_FAULT_SIGSEGV handling support
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.

That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works.  However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.

In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV.  And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.

However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d45 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space.  And user space really
expected SIGSEGV, not SIGBUS.

To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it.  They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.

This is the mindless minimal patch to do this.  A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.

Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.

Reported-and-tested-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-29 10:51:32 -08:00
Kai Huang
88178fd4f7 KVM: x86: Add new dirty logging kvm_x86_ops for PML
This patch adds new kvm_x86_ops dirty logging hooks to enable/disable dirty
logging for particular memory slot, and to flush potentially logged dirty GPAs
before reporting slot->dirty_bitmap to userspace.

kvm x86 common code calls these hooks when they are available so PML logic can
be hidden to VMX specific. SVM won't be impacted as these hooks remain NULL
there.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-29 15:31:41 +01:00
Kai Huang
1c91cad423 KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access
This patch changes the second parameter of kvm_mmu_slot_remove_write_access from
'slot id' to 'struct kvm_memory_slot *' to align with kvm_x86_ops dirty logging
hooks, which will be introduced in further patch.

Better way is to change second parameter of kvm_arch_commit_memory_region from
'struct kvm_userspace_memory_region *' to 'struct kvm_memory_slot * new', but it
requires changes on other non-x86 ARCH too, so avoid it now.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-29 15:31:37 +01:00
Kai Huang
9b51a63024 KVM: MMU: Explicitly set D-bit for writable spte.
This patch avoids unnecessary dirty GPA logging to PML buffer in EPT violation
path by setting D-bit manually prior to the occurrence of the write from guest.

We only set D-bit manually in set_spte, and leave fast_page_fault path
unchanged, as fast_page_fault is very unlikely to happen in case of PML.

For the hva <-> pa change case, the spte is updated to either read-only (host
pte is read-only) or be dropped (host pte is writeable), and both cases will be
handled by above changes, therefore no change is necessary.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-29 15:31:33 +01:00
Kai Huang
f4b4b18086 KVM: MMU: Add mmu help functions to support PML
This patch adds new mmu layer functions to clear/set D-bit for memory slot, and
to write protect superpages for memory slot.

In case of PML, CPU logs the dirty GPA automatically to PML buffer when CPU
updates D-bit from 0 to 1, therefore we don't have to write protect 4K pages,
instead, we only need to clear D-bit in order to log that GPA.

For superpages, we still write protect it and let page fault code to handle
dirty page logging, as we still need to split superpage to 4K pages in PML.

As PML is always enabled during guest's lifetime, to eliminate unnecessary PML
GPA logging, we set D-bit manually for the slot with dirty logging disabled.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-29 15:31:29 +01:00
Kai Huang
3b0f1d01e5 KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic for log dirty
We don't have to write protect guest memory for dirty logging if architecture
supports hardware dirty logging, such as PML on VMX, so rename it to be more
generic.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-29 15:30:38 +01:00
Ingo Molnar
6d84d1d130 One final fix for 3.19 to address a wrongful deregistering of the
microcode loader module.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUyMQuAAoJEBLB8Bhh3lVKoUcP/2NW+guS0v8/X7vBtO0R1TR6
 zhjXKOcTIHlM/IBES/vszxKnq/jiiq1X+ZFwucx644RxmPo5fS0+zGIoNNchQgb2
 g4Y9RChANXUG66CkMW5zhjOdu6XTuLuJnoY9B3cw8Ro7lTIj6XfYq60Is/U1avJX
 l7d99uTEt2mUGjlvYXnpSyFwOWiMOFDruQ9wtXFJ4BfJXKsObe6NHcQYzaY2Tu93
 xYMbakh3i+EmVhA4gmhtYlpm6LAXZ21vdSEOselfulgoyQm/SaU2/BGJ384RNNmC
 AdfLTM9qRfxUwUbA/jXak6YUDca3RznPPcBSyYhssLJkUvx8q4D6/CXIlh4ygPWr
 j2fc3gJt2KXzZzUvMx5MYMSyCtGm7Whx4XMLXkZBrRQK0TKwpTFqHReL/bY7nQHC
 iq22AloRA49rPo7GFYYm6xPOTCZUVo9VlVRIcAVqcIgjtkutwmwFyoVmuSrFlnpg
 tDQcG8pexxtmbbRHdlIYpN+BeKNikA0y+aiyoP8SSn0D3dduAnQ4lKZazE+i+fnT
 /hMz9eJVjk0ccCaCHC/gyLOgBWJlLUyfYz7nfCvQE4dKMTmyDJZZE1hH9Jr1OPQW
 zmTge8KqRtXbFNqnfNEE3UK/oBSuD45kx/oSa7BLlzZCjyVsfa1xjhv3rJFw0gHc
 TeMp8vkcTVgdX4EONupN
 =vHs7
 -----END PGP SIGNATURE-----

Merge tag 'microcode_fix_for_3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/urgent

Pull microcode fix from Borislav Petkov:

 "One final fix for 3.19 to address a wrongful deregistering of the
  microcode loader module."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-29 07:51:20 +01:00
Andrey Skvortsov
050835e9d3 x86, vdso: teach 'make clean' remove vdso64 binaries
After 'make clean' vdso64.so and vdso64.dbg.so were left in arch/x86/vdso/.

Link: http://lkml.kernel.org/r/1422453867-17326-1-git-send-email-andrej.skvortzov@gmail.com
Signed-off-by: Andrey Skvortsov <andrej.skvortzov@gmail.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-01-28 18:44:18 -08:00
Yijing Wang
6a878e5085 PCI: Fail MSI-X mappings if there's no space assigned to MSI-X BAR
Unlike MSI, which is configured via registers in the MSI capability in
Configuration Space, MSI-X is configured via tables in Memory Space.
These MSI-X tables are mapped by a device BAR, and if no Memory Space
has been assigned to the BAR, MSI-X cannot be used.

Fail MSI-X setup if no space has been assigned for the BAR.

Previously, we ioremapped the MSI-X table even if the resource hadn't been
assigned.  In this case, the resource address is undefined (and is often
zero), which may lead to warnings or oopses in this path:

  pci_enable_msix
    msix_capability_init
      msix_map_region
        ioremap_nocache

The PCI core sets resource flags to zero when it can't assign space for the
resource (see reset_resource()).  There are also some cases where it sets
the IORESOURCE_UNSET flag, e.g., pci_reassigndev_resource_alignment(),
pci_assign_resource(), etc.  So we must check for both cases.

[bhelgaas: changelog]
Reported-by: Zhang Jukuo <zhangjukuo@huawei.com>
Tested-by: Zhang Jukuo <zhangjukuo@huawei.com>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2015-01-28 09:25:57 -06:00
Ingo Molnar
b3890e4704 Merge branch 'perf/hw_breakpoints' into perf/core
The new hw_breakpoint bits are now ready for v3.20, merge them
into the main branch, to avoid conflicts.

Conflicts:
	tools/perf/Documentation/perf-record.txt

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-28 15:48:59 +01:00
Ingo Molnar
772a9aca12 This is my accumulated x86 entry work, part 1, for 3.20. The meat
of this is an IST rework.  When an IST exception interrupts user
 space, we will handle it on the per-thread kernel stack instead of
 on the IST stack.  This sounds messy, but it actually simplifies the
 IST entry/exit code, because it eliminates some ugly games we used
 to play in order to handle rescheduling, signal delivery, etc on the
 way out of an IST exception.
 
 The IST rework introduces proper context tracking to IST exception
 handlers.  I haven't seen any bug reports, but the old code could
 have incorrectly treated an IST exception handler as an RCU extended
 quiescent state.
 
 The memory failure change (included in this pull request with
 Borislav and Tony's permission) eliminates a bunch of code that
 is no longer needed now that user memory failure handlers are
 called in process context.
 
 Finally, this includes a few on Denys' uncontroversial and Obviously
 Correct (tm) cleanups.
 
 The IST and memory failure changes have been in -next for a while.
 
 LKML references:
 
 IST rework:
 http://lkml.kernel.org/r/cover.1416604491.git.luto@amacapital.net
 
 Memory failure change:
 http://lkml.kernel.org/r/54ab2ffa301102cd6e@agluck-desk.sc.intel.com
 
 Denys' cleanups:
 http://lkml.kernel.org/r/1420927210-19738-1-git-send-email-dvlasenk@redhat.com
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUtvkFAAoJEK9N98ZeDfrkcfsIAJxZ0UBUCEDvulbqgk/iPGOa
 fIpKLMowS7CpKtw6Wdc/YvAIkeHXWm1vU44Hj0TrjSrXCgVF8yCngs/xlXtOjoa1
 dosXQqgqVJJ+hyui7chAEWyalLW7bEO8raq/6snhiMrhiuEkVKpEr7Fer4FVVCZL
 4VALmNQQsbV+Qq4pXIhuagZC0Nt/XKi/+/cKvhS4p//q1F/TbHTz0FpDUrh0jPMh
 18WFy0jWgxdkMRnSp/wJhekvdXX6PwUy5BdES9fjw8LQJZxxFpqN3Fe1kgfyzV0k
 yuvEHw1hPt2aBGj3q69wQvDVyyn4OqMpRDBhk4S+GJYmVh7mFyFMN4BDMEy/EY8=
 =LXVl
 -----END PGP SIGNATURE-----

Merge tag 'pr-20150114-x86-entry' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux into x86/asm

Pull x86/entry enhancements from Andy Lutomirski:

" This is my accumulated x86 entry work, part 1, for 3.20.  The meat
  of this is an IST rework.  When an IST exception interrupts user
  space, we will handle it on the per-thread kernel stack instead of
  on the IST stack.  This sounds messy, but it actually simplifies the
  IST entry/exit code, because it eliminates some ugly games we used
  to play in order to handle rescheduling, signal delivery, etc on the
  way out of an IST exception.

  The IST rework introduces proper context tracking to IST exception
  handlers.  I haven't seen any bug reports, but the old code could
  have incorrectly treated an IST exception handler as an RCU extended
  quiescent state.

  The memory failure change (included in this pull request with
  Borislav and Tony's permission) eliminates a bunch of code that
  is no longer needed now that user memory failure handlers are
  called in process context.

  Finally, this includes a few on Denys' uncontroversial and Obviously
  Correct (tm) cleanups.

  The IST and memory failure changes have been in -next for a while.

  LKML references:

  IST rework:
  http://lkml.kernel.org/r/cover.1416604491.git.luto@amacapital.net

  Memory failure change:
  http://lkml.kernel.org/r/54ab2ffa301102cd6e@agluck-desk.sc.intel.com

  Denys' cleanups:
  http://lkml.kernel.org/r/1420927210-19738-1-git-send-email-dvlasenk@redhat.com
"

This tree semantically depends on and is based on the following RCU commit:

  734d168013 ("rcu: Make rcu_nmi_enter() handle nesting")

... and for that reason won't be pushed upstream before the RCU bits hit Linus's tree.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-28 15:33:26 +01:00
Ingo Molnar
41ca5d4e9b Merge commit 3669ef9fa7 ("x86, tls: Interpret an all-zero struct user_desc as 'no segment'") into x86/asm
Pick up the latestest asm fixes before advancing it any further.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-28 15:30:32 +01:00
Jennifer Herbert
8da7633f16 xen: mark grant mapped pages as foreign
Use the "foreign" page flag to mark pages that have a grant map.  Use
page->private to store information of the grant (the granting domain
and the grant reference).

Signed-off-by: Jennifer Herbert <jennifer.herbert@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-01-28 14:03:12 +00:00