Commit graph

22,795 commits

Author SHA1 Message Date
Xiao Guangrong
d625b155d2 KVM: MMU: introduce is_shadow_zero_bits_set()
We have the same data struct to check reserved bits on guest page tables
and shadow page tables, split is_rsvd_bits_set() so that the logic can be
shared between these two paths

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-05 12:47:25 +02:00
Xiao Guangrong
c258b62b26 KVM: MMU: introduce the framework to check zero bits on sptes
We have abstracted the data struct and functions which are used to check
reserved bit on guest page tables, now we extend the logic to check
zero bits on shadow page tables

The zero bits on sptes include not only reserved bits on hardware but also
the bits that SPTEs willnever use.  For example, shadow pages will never
use GB pages unless the guest uses them too.

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-05 12:47:24 +02:00
Xiao Guangrong
81b8eebbc3 KVM: MMU: split reset_rsvds_bits_mask_ept
Since shadow ept page tables and Intel nested guest page tables have the
same format, split reset_rsvds_bits_mask_ept so that the logic can be
reused by later patches which check zero bits on sptes

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-05 12:47:24 +02:00
Xiao Guangrong
6dc98b868b KVM: MMU: split reset_rsvds_bits_mask
Since softmmu & AMD nested shadow page tables and guest page tables have
the same format, split reset_rsvds_bits_mask so that the logic can be
reused by later patches which check zero bits on sptes

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-05 12:47:23 +02:00
Xiao Guangrong
a0a64f50aa KVM: MMU: introduce rsvd_bits_validate
These two fields, rsvd_bits_mask and bad_mt_xwr, in "struct kvm_mmu" are
used to check if reserved bits set on guest ptes, move them to a data
struct so that the approach can be applied to check host shadow page
table entries as well

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-05 12:47:23 +02:00
Xiao Guangrong
d2b0f98125 KVM: MMU: move FNAME(is_rsvd_bits_set) to mmu.c
FNAME(is_rsvd_bits_set) does not depend on guest mmu mode, move it
to mmu.c to stop being compiled multiple times

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-05 12:47:22 +02:00
Xiao Guangrong
6f691251c0 KVM: MMU: fix validation of mmio page fault
We got the bug that qemu complained with "KVM: unknown exit, hardware
reason 31" and KVM shown these info:
[84245.284948] EPT: Misconfiguration.
[84245.285056] EPT: GPA: 0xfeda848
[84245.285154] ept_misconfig_inspect_spte: spte 0x5eaef50107 level 4
[84245.285344] ept_misconfig_inspect_spte: spte 0x5f5fadc107 level 3
[84245.285532] ept_misconfig_inspect_spte: spte 0x5141d18107 level 2
[84245.285723] ept_misconfig_inspect_spte: spte 0x52e40dad77 level 1

This is because we got a mmio #PF and the handler see the mmio spte becomes
normal (points to the ram page)

However, this is valid after introducing fast mmio spte invalidation which
increases the generation-number instead of zapping mmio sptes, a example
is as follows:
1. QEMU drops mmio region by adding a new memslot
2. invalidate all mmio sptes
3.

        VCPU 0                        VCPU 1
    access the invalid mmio spte
                            access the region originally was MMIO before
                            set the spte to the normal ram map

    mmio #PF
    check the spte and see it becomes normal ram mapping !!!

This patch fixes the bug just by dropping the check in mmio handler, it's
good for backport. Full check will be introduced in later patches

Reported-by: Pavel Shirshov <ru.pchel@gmail.com>
Tested-by: Pavel Shirshov <ru.pchel@gmail.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-05 12:47:21 +02:00
Alex Williamson
9c33ae0c61 KVM: MTRR: Use default type for non-MTRR-covered gfn before WARN_ON
The patch was munged on commit to re-order these tests resulting in
excessive warnings when trying to do device assignment.  Return to
original ordering: https://lkml.org/lkml/2015/7/15/769

Fixes: 3e5d2fdced ("KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-05 12:47:21 +02:00
Alex Williamson
fc1a8126bf KVM: MTRR: Use default type for non-MTRR-covered gfn before WARN_ON
The patch was munged on commit to re-order these tests resulting in
excessive warnings when trying to do device assignment.  Return to
original ordering: https://lkml.org/lkml/2015/7/15/769

Fixes: 3e5d2fdced ("KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-05 11:57:57 +02:00
Andy Lutomirski
88cd622f92 x86/entry: Remove do_notify_resume(), syscall_trace_leave(), and their TIF masks
They are no longer used. Good riddance!

Deleting the TIF_ macros is really nice. It was never clear why
there were so many variants.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/22c61682f446628573dde0f1d573ab821677e06da.1438378274.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-05 10:54:35 +02:00
Andy Lutomirski
5d73fc7099 x86/entry/32: Migrate to C exit path
This removes the hybrid asm-and-C implementation of exit work.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/2baa438619ea6c027b40ec9fceacca52f09c74d09.1438378274.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-05 10:54:35 +02:00
Andy Lutomirski
c5f69fde26 x86/entry/32: Remove 32-bit syscall audit optimizations
The asm audit optimizations are ugly and obfuscate the code too
much. Remove them.

This will regress performance if syscall auditing is enabled on
32-bit kernels and SYSENTER is in use. If this becomes a
problem, interested parties are encouraged to implement the
equivalent of the 64-bit opportunistic SYSRET optimization.

Alternatively, a case could be made that, on 32-bit kernels, a
less messy asm audit optimization could be done. 32-bit kernels
don't have the complicated partial register saving tricks that
64-bit kernels have, so the SYSENTER post-syscall path could
just call the audit hooks directly.  Any reimplementation of
this ought to demonstrate that it only calls the audit hook once
per syscall, though, which does not currently appear to be true.

Someone would have to make the case that doing so would be
better than implementing opportunistic SYSEXIT, though.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/212be39dd8c90b44c4b7bbc678128d6b88bdb9912.1438378274.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-05 10:54:35 +02:00
Denys Vlasenko
d14edb1648 x86/hweight: Force inlining of __arch_hweight{32,64}()
With this config:

  http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os

gcc-4.7.2 generates many copies of these tiny functions:

	__arch_hweight32 (35 copies):
	55                      push   %rbp
	e8 66 9b 4a 00          callq  __sw_hweight32
	48 89 e5                mov    %rsp,%rbp
	5d                      pop    %rbp
	c3                      retq

	__arch_hweight64 (8 copies):
	55                      push   %rbp
	e8 5e c2 8a 00          callq  __sw_hweight64
	48 89 e5                mov    %rsp,%rbp
	5d                      pop    %rbp
	c3                      retq

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

This patch fixes this via s/inline/__always_inline/

To avoid touching 32-bit case where such change was not tested
to be a win, reformat __arch_hweight64() to have completely
disjoint 64-bit and 32-bit implementations. IOW: made #ifdef /
32 bits and 64 bits instead of having #ifdef / #else / #endif
inside a single function body. Only 64-bit __arch_hweight64() is
__always_inline'd.

	    text     data      bss       dec  filename
	86971120 17195912 36659200 140826232  vmlinux.before
	86970954 17195912 36659200 140826066  vmlinux

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1438697716-28121-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-05 09:38:09 +02:00
Denis V. Lunev
cc2dd4027a mshyperv: fix recognition of Hyper-V guest crash MSR's
Hypervisor Top Level Functional Specification v3.1/4.0 notes that cpuid
(0x40000003) EDX's 10th bit should be used to check that Hyper-V guest
crash MSR's functionality available.

This patch should fix this recognition. Currently the code checks EAX
register instead of EDX.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-04 22:30:44 -07:00
Vitaly Kuznetsov
b4370df2b1 Drivers: hv: vmbus: add special crash handler
Full kernel hang is observed when kdump kernel starts after a crash. This
hang happens in vmbus_negotiate_version() function on
wait_for_completion() as Hyper-V host (Win2012R2 in my testing) never
responds to CHANNELMSG_INITIATE_CONTACT as it thinks the connection is
already established. We need to perform some mandatory minimalistic
cleanup before we start new kernel.

Reported-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-04 22:28:38 -07:00
Vitaly Kuznetsov
2517281d63 Drivers: hv: vmbus: add special kexec handler
When general-purpose kexec (not kdump) is being performed in Hyper-V guest
the newly booted kernel fails with an MCE error coming from the host. It
is the same error which was fixed in the "Drivers: hv: vmbus: Implement
the protocol for tearing down vmbus state" commit - monitor pages remain
special and when they're being written to (as the new kernel doesn't know
these pages are special) bad things happen. We need to perform some
minimalistic cleanup before booting a new kernel on kexec. To do so we
need to register a special machine_ops.shutdown handler to be executed
before the native_machine_shutdown(). Registering a shutdown notification
handler via the register_reboot_notifier() call is not sufficient as it
happens to early for our purposes. machine_ops is not being exported to
modules (and I don't think we want to export it) so let's do this in
mshyperv.c

The minimalistic cleanup consists of cleaning up clockevents, synic MSRs,
guest os id MSR, and hypercall MSR.

Kdump doesn't require all this stuff as it lives in a separate memory
space.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-04 22:25:29 -07:00
Bjorn Helgaas
5a4f3cf0d1 Merge branches 'pci/irq', 'pci/misc', 'pci/resource' and 'pci/virtualization' into next
* pci/irq:
  PCI/MSI: Free legacy IRQ when enabling MSI/MSI-X
  PCI: Add helpers to manage pci_dev->irq and pci_dev->irq_managed
  PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()
  PCI: Add pcibios_alloc_irq() and pcibios_free_irq()

* pci/misc:
  PCI: Remove unused "pci_probe" flags
  PCI: Add VPD function 0 quirk for Intel Ethernet devices
  PCI: Add dev_flags bit to access VPD through function 0
  PCI / ACPI: Fix pci_acpi_optimize_delay() comment
  PCI: Remove a broken link in quirks.c
  PCI: Remove useless redundant code
  PCI: Simplify pci_find_(ext_)capability() return value checks
  PCI: Move PCI_FIND_CAP_TTL to pci.h and use it in quirks
  PCI: Add pcie_downstream_port() (true for Root and Switch Downstream Ports)
  PCI: Fix pcie_port_device_resume() comment
  PCI: Shift PCI_CLASS_NOT_DEFINED consistently with other classes
  PCI: Revert aeb30016fe ("PCI: add Intel USB specific reset method")
  PCI: Fix TI816X class code quirk
  PCI: Fix generic NCR 53c810 class code quirk
  PCI: Use PCI_CLASS_SERIAL_USB instead of bare number
  PCI: Add quirk for Intersil/Techwell TW686[4589] AV capture cards
  PCI: Remove Intel Cherrytrail D3 delays

* pci/resource:
  PCI: Call pci_read_bridge_bases() from core instead of arch code

* pci/virtualization:
  PCI: Restore ACS configuration as part of pci_restore_state()
2015-08-04 20:54:05 -05:00
Peter Zijlstra
75f80859b1 perf/x86/intel/pebs: Robustify PEBS buffer drain
Vince Weaver and Stephane Eranian reported warnings in the PEBS
code when running the perf fuzzer. Stephane wrote:

  > I can reproduce the problem on my HSW running the fuzzer.
  >
  > I can see why this could be happening if you are mixing PEBS and non PEBS events
  > in the bottom 4 counters. I suspect:
  >         for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
  >                 if ((counts[bit] == 0) && (error[bit] == 0))
  >                         continue;
  >
  > This test is not correct when you have non-PEBS events mixed with
  > PEBS events and they overflow at the same time. They will have
  > counts[i] != 0 but error[i] == 0, and thus you fall thru the loop
  > and hit the assert. Or it is something along those lines.

The only way I can make this work is if ->status only has !PEBS events
set, because if it has both set we'll take that slow path which masks
out the !PEBS bits.

After masking there are 3 options:

 - there is one bit set, and its @bit, we increment counts[bit].

 - there are multiple bits set, we increment error[] for each set bit,
   we do not increment counts[].

 - there are no bits set, we do nothing.

The intent was to never increment counts[] for !PEBS events.

Now if we start out with only a single !PEBS event set, we'll pass the
test and increment counts[] for a !PEBS and hit the warn.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:17:01 +02:00
Liang, Kan
2a853e1123 perf/x86/intel/pebs: Fix event disable PEBS buffer drain
When disabling a PEBS event, we need to drain the buffer. Doing so
requires a correct cpuc->pebs_active mask.

The current code clears the pebs_active bit before draining the
buffer. Fix that.

Signed-off-by: "Liang, Kan" <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver<vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/37D7C6CF3E00A74B8858931C1DB2F07701885A65@SHSMSX103.ccr.corp.intel.com
[ Fixed the SOB. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:17:00 +02:00
Andy Lutomirski
b7b7c7821d perf/x86: Add an MSR PMU driver
This patch adds an MSR PMU to support free running MSR counters. Such
as time and freq related counters includes TSC, IA32_APERF, IA32_MPERF
and IA32_PPERF, but also SMI_COUNT.

The events are exposed in sysfs for use by perf stat and other tools.
The files are under /sys/devices/msr/events/

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kan Liang <kan.liang@intel.com>
[ s/freq/msr/, added SMI_COUNT, fixed bugs. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: adrian.hunter@intel.com
Cc: dsahern@gmail.com
Cc: eranian@google.com
Cc: jolsa@kernel.org
Cc: mark.rutland@arm.com
Cc: namhyung@kernel.org
Link: http://lkml.kernel.org/r/1437407346-31186-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:17:00 +02:00
Kan Liang
070e98873c perf/x86/intel/uncore: Add Broadwell-DE uncore support
The uncore subsystem for Broadwell-DE is similar to Haswell-EP.  There
are some differences in pci device IDs, box number and constraints.

Please refer to the public document:

  http://www.intel.com/content/www/us/en/processors/xeon/xeon-d-1500-uncore-performance-monitoring.html

Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1435839172-15114-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:17:00 +02:00
Andi Kleen
8c4fe7095d perf/x86/intel: Use 0x11 as extra reg test value
The next patch adds a new perf extra register where 0x1ff is not a valid
value. Use 0x11 instead.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435707205-6676-3-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:59 +02:00
Andi Kleen
47732d8863 perf/x86: Make merge_attr() global to use from perf_event_intel
merge_attr() allows to merge two sysfs attribute tables.
Export it to be usable by other files too.

Next patch is going to use that to extend the sysfs format
attributes for a CPU.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1435612935-24425-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:59 +02:00
Andi Kleen
90405aa022 perf/x86/intel/lbr: Limit LBR accesses to TOS in callstack mode
In callstack mode the LBR is not a ring buffer, but a stack that grows up
and down. This means in  this case we don't need to access all LBRs, only the
ones up to TOS. Do this optimization for the normal LBR read, and the context
switch save/restore code. For save/restore it can be done unconditionally, as
it only runs when call stack mode is active.

This recovers some of the cost of going to 32 LBRs on Skylake.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: eranian@google.com
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/1432786398-23861-6-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:59 +02:00
Andi Kleen
e0573364b8 perf/x86/intel/lbr: Use correct index to save/restore LBR_INFO with call stack
Use the correct index to save/restore the LBR_INFO_x MSR in
callstack mode. This is more a cleanup, as even with the wrong
index the register was correctly saved/restored, and also
LBR callgraph mode in perf tools do not really need anything in
LBR_INFO. But still better to use the right index.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: eranian@google.com
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/1432786398-23861-5-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:59 +02:00
Andi Kleen
9a92e16fd7 perf/x86/intel: Add Intel Skylake PMU support
Add perf core PMU support for future Intel Skylake CPU cores.

The code is based on Haswell/Broadwell.

There is a new cache event list, based on the updated Haswell
event list.

Skylake has removed most counter constraints on basic
events, so the basic constraints table now only has a single
entry (plus the fixed counters).

TSX support and various other setups are all shared with Haswell.

Skylake has 32 LBR entries. Add a new LBR init function
to set this up. The filters are all the same as Haswell.

It also has a new LBR format with a separate LBR_INFO_* MSR,
but that has been already added earlier.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285767-27027-7-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:58 +02:00
Andi Kleen
425507fa5f perf/x86/intel/lbr: Optimize v4 LBR unfreezing
In Arch perfmon v4 the GLOBAL_STATUS reset automatically unfreezes
LBRs. So no need to do it manually in the LBR code. Add a check
to skip it.

v2: Move test up to beginning of function.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285767-27027-9-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:58 +02:00
Andi Kleen
0f29e573dd perf/x86/intel: Move PMU ACK to after LBR read
With Arch Perfmon v4 the PMU ack unfreezes the LBRs. So we need to do
the PMU ack after the LBR reading, otherwise the LBRs would be polluted
by the PMI handler.

This is a minimal change. In principle the ACK could be moved much later.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285767-27027-10-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:58 +02:00
Andi Kleen
d8020bee1d perf/x86/intel: Handle new arch perfmon v4 status bits
ArchPerfmon v4 has some new status bits in GLOBAL_STATUS.

These need to be ignored when deciding whether a NMI
was an NMI, to avoid eating all NMIs when they
stay set, see:

    b292d7a104 ("perf/x86/intel: ignore CondChgd bit to avoid false NMI handling")

This patch ignores the new ASIF bit, which indicates
that SGX interfered with the PMU, and also the new
LBR freezing bits, which are set when the LBRs get
frozen, plus the existing CondChange (set by JTAG
debuggers and some buggy BIOSes)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285767-27027-8-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:57 +02:00
Andi Kleen
50eab8f6ec perf/x86/intel/lbr: Add support for LBRv5
Add support for the new LBRv5 format used on Intel Skylake CPUs.

The flags for mispredict, abort, in_tx etc. moved to range of separate
LBR_INFO_* MSRs. Teach the LBR code to read those. The original
LBR registers stay the same, except they have full sign
extension now.

LBR_INFO also reports a cycle count to the last branch.
Report the cycle information using the new "cycles" branch_info
output field.

In addition we have to context switch and clear the new INFO
MSRs to avoid any information leaks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285767-27027-6-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:57 +02:00
Andi Kleen
b83ff1c861 x86: Add new MSRs and MSR bits used for Intel Skylake PMU support
Add new MSRs (LBR_INFO) and some new MSR bits used by the Intel Skylake
PMU driver.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285767-27027-4-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:56 +02:00
Andi Kleen
a7b58d211b perf/x86/intel/lbr: Allow time stamp for free running PEBSv3
With PEBSv3 the PEBS record contains a time stamp. That means we can allow
free-running PEBS without a PMI even if the user program requested a time stamp.
This avoids the need to use -T to get free running PEBS, and also avoids
any problems with mis-identifying MMAPs later.

Move the free_running_flags state into a variable in x86_pmu and use it.
This only works when no explicit clock_id is set.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: eranian@google.com
Cc: jolsa@redhat.com
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1432786398-23861-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:56 +02:00
Andi Kleen
2f7ebf2ec2 perf/x86/intel: Add support for PEBSv3 profiling
PEBSv3 is the same as the existing PEBSv2 used on Haswell,
but it adds a new TSC field. Add support to the generic
PEBS handler to handle the new format, and overwrite
the perf time stamp using the new native_sched_clock_from_tsc().

Right now the time stamp is just slightly more accurate,
as it is nearer the actual event trigger point. With
the PEBS threshold > 1 patchkit it will be much more accurate,
avoid the problems with MMAP mismatches earlier.
The accurate time stamping is only implemented for
the default trace clock for now.

v2: Use _skl prefix. Check for default clock_id.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285767-27027-3-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:56 +02:00
Andi Kleen
a94cab2376 perf/x86: Add a native_perf_sched_clock_from_tsc()
PEBSv3 has a raw TSC time stamp in its memory buffer that
later needs to to be converted to perf_clock.

Add a native_sched_clock_from_tsc() that works the same
as native_sched_clock(), but starts with an already given
TSC value.

Paravirt is ignored, it will just get the native clock.
But there isn't a para virtualized PEBS anyway.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285767-27027-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:55 +02:00
Alexander Shishkin
b1bf72d669 perf/x86/intel/pt: Add new timing packet enables
Intel PT chapter in the new Intel Architecture SDM adds several packets
corresponding enable bits and registers that control packet generation.
Also, additional bits in the Intel PT CPUID leaf were added to enumerate
presence and parameters of these new packets and features.

The packets and enables are:

  * CYC: cycle accurate mode, provides the number of cycles elapsed since
    previous CYC packet; its presence and available threshold values are
    enumerated via CPUID;

  * MTC: mini time counter packets, used for tracking TSC time between
    full TSC packets; its presence and available resolution options are
    enumerated via CPUID;

  * PSB packet period is now configurable, available period values are
    enumerated via CPUID.

This patch adds corresponding bit and register definitions, pmu driver
capabilities based on CPUID enumeration, new attribute format bits for
the new featurens and extends event configuration validation function
to take these into account.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1438262131-12725-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:55 +02:00
Alexander Shishkin
9a6694cfa2 perf/x86/intel/pt: Do not force sync packets on every schedule-in
Currently, the PT driver zeroes out the status register every time before
starting the event. However, all the writable bits are already taken care
of in pt_handle_status() function, except the new PacketByteCnt field,
which in new versions of PT contains the number of packet bytes written
since the last sync (PSB) packet. Zeroing it out before enabling PT forces
a sync packet to be written. This means that, with the existing code, a
sync packet (PSB and PSBEND, 18 bytes in total) will be generated every
time a PT event is scheduled in.

To avoid these unnecessary syncs and save a WRMSR in the fast path, this
patch changes the default behavior to not clear PacketByteCnt field, so
that the sync packets will be generated with the period specified as
"psb_period" attribute config field. This has little impact on the trace
data as the other packets that are normally sent within PSB+ (between PSB
and PSBEND) have their own generation scenarios which do not depend on the
sync packets.

One exception where we do need to force PSB like this when tracing starts,
so that the decoder has a clear sync point in the trace. For this purpose
we aready have hw::itrace_started flag, which we are currently using to
output PERF_RECORD_ITRACE_START. This patch moves setting itrace_started
from perf core to the pmu::start, where it should still be 0 on the very
first run.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1438264104-16189-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:55 +02:00
Andy Lutomirski
27747f8bc3 perf/x86/hw_breakpoints: Fix check for kernel-space breakpoints
The check looked wrong, although I think it was actually safe.  TASK_SIZE
is unnecessarily small for compat tasks, and it wasn't possible to make
a range breakpoint so large it started in user space and ended in kernel
space.

Nonetheless, let's fix up the check for the benefit of future
readers.  A breakpoint is in the kernel if either end is in the
kernel.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/136be387950e78f18cea60e9d1bef74465d0ee8f.1438312874.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:55 +02:00
Andy Lutomirski
ab513927ab perf/x86/hw_breakpoints: Improve range breakpoint validation
Range breakpoints will do the wrong thing if the address isn't
aligned.  While we're there, add comments about why it's safe for
instruction breakpoints.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ae25d14d61f2f43b78e0a247e469f3072df7e201.1438312874.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:54 +02:00
Andy Lutomirski
e5779e8e12 perf/x86/hw_breakpoints: Disallow kernel breakpoints unless kprobe-safe
Code on the kprobe blacklist doesn't want unexpected int3
exceptions. It probably doesn't want unexpected debug exceptions
either. Be safe: disallow breakpoints in nokprobes code.

On non-CONFIG_KPROBES kernels, there is no kprobe blacklist.  In
that case, disallow kernel breakpoints entirely.

It will be particularly important to keep hw breakpoints out of the
entry and NMI code once we move debug exceptions off the IST stack.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e14b152af99640448d895e3c2a8c2d5ee19a1325.1438312874.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:54 +02:00
Kan Liang
ae3f011fc2 perf/x86/intel: Fix SLM MSR_OFFCORE_RSP1 valid_mask
AVG_LATENCY(bit 38) is only available on MSR_OFFCORE_RSP0.
So the bit should be removed from RSP1 valid_mask.

Since RSP0 and RSP1 may have different valid_mask, intel_alt_er should
validate the config on the alternate offcore reg before replacing it.

Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435170215-5017-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:54 +02:00
Alexander Shishkin
c749b3e963 perf/x86/intel/lbr: Kill off intel_pmu_needs_lbr_smpl for good
The x86_lbr_exclusive commit (4807034248 "perf/x86: Mark Intel PT and
LBR/BTS as mutually exclusive") mistakenly moved intel_pmu_needs_lbr_smpl()
to perf_event.h, while another commit (a46a230001 "perf: Simplify the
branch stack check") removed it in favor of needs_branch_stack().

This patch gets rid of intel_pmu_needs_lbr_smpl() for good.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1435140349-32588-3-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:53 +02:00
Alexander Shishkin
e9b3bd379c perf/x86/intel/bts: Drop redundant declarations
Both intel_pmu_enable_bts() and intel_pmu_disable_bts() are in perf_event.h
header file, no need to have them declared again in the driver.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1435140349-32588-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:53 +02:00
Andi Kleen
3a999587b4 perf/x86/intel/uncore: Use Sandy Bridge client PMU on Haswell/Broadwell
Haswell and Broadwell have the same uncore CBOX/ARB PMU as Sandy Bridge.
Add the respective model numbers to enable the SNB uncore PMU.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1434347862-28490-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:53 +02:00
Andi Kleen
e3a13192d8 perf/x86/intel/uncore: Add support for ARB uncore PMU on Sandy/IvyBridge
Add a new "ARB" uncore PMU that is used to monitor the uncore queue
arbiter. This is useful to measure uncore queue occupancy and similar
statistics. The registers all have the same format as the
existing CBOX PMU.

Also move the event constraints from the CBOX to ARB. The 0x80+
events are ARB events and cannot be scheduled on a CBOX PMU.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1434347862-28490-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:52 +02:00
Vaishali Thakkar
070a7cdfa4 perf/x86/intel/uncore: Remove use of macro DEFINE_PCI_DEVICE_TABLE()
The DEFINE_PCI_DEVICE_TABLE() macro is deprecated. Use
'struct pci_device_id' instead of DEFINE_PCI_DEVICE_TABLE(),
with the goal of getting rid of this macro completely.

This Coccinelle semantic patch performs this transformation:

@@
identifier a;
declarer name DEFINE_PCI_DEVICE_TABLE;
initializer i;
@@
- DEFINE_PCI_DEVICE_TABLE(a)
+ const struct pci_device_id a[] = i;

Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150717052759.GA6265@vaishali-Ideapad-Z570
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:52 +02:00
Dasaratharaman Chandramouli
3a2a779732 perf/x86/intel/rapl: Add support for Knights Landing (KNL)
Knights Landing DRAM RAPL supports PKG and DRAM RAPL domains.
DRAM RAPL has a different fixed energy unit (2^-16J) similar to
that of HSW.

Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Stephane Eranian <eranian@google.com>
Acked-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jacob Pan Jun <jacob.jun.pan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nikhil Rao <nikhil.rao@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/aa63b4a3af3160152fea1a10c807f4200527280c.1432665809.git.dasaratharaman.chandramouli@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:52 +02:00
Konstantin Khlebnikov
fe32d3cd5e sched/preempt: Fix cond_resched_lock() and cond_resched_softirq()
These functions check should_resched() before unlocking spinlock/bh-enable:
preempt_count always non-zero => should_resched() always returns false.
cond_resched_lock() worked iff spin_needbreak is set.

This patch adds argument "preempt_offset" to should_resched().

preempt_count offset constants for that:

  PREEMPT_DISABLE_OFFSET  - offset after preempt_disable()
  PREEMPT_LOCK_OFFSET     - offset after spin_lock()
  SOFTIRQ_DISABLE_OFFSET  - offset after local_bh_distable()
  SOFTIRQ_LOCK_OFFSET     - offset after spin_lock_bh()

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: bdb4380658 ("sched: Extract the basic add/sub preempt_count modifiers")
Link: http://lkml.kernel.org/r/20150715095204.12246.98268.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:21:24 +02:00
Peter Zijlstra
3bbfafb77a x86, tsc, locking/static_keys: Employ static_branch_likely()
Because of the static_key restrictions we had to take an unconditional
jump for the most likely case, causing $I bloat.

Rewrite to use the new primitives.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 11:34:16 +02:00
Peter Zijlstra
11276d5306 locking/static_keys: Add a new static_key interface
There are various problems and short-comings with the current
static_key interface:

 - static_key_{true,false}() read like a branch depending on the key
   value, instead of the actual likely/unlikely branch depending on
   init value.

 - static_key_{true,false}() are, as stated above, tied to the
   static_key init values STATIC_KEY_INIT_{TRUE,FALSE}.

 - we're limited to the 2 (out of 4) possible options that compile to
   a default NOP because that's what our arch_static_branch() assembly
   emits.

So provide a new static_key interface:

  DEFINE_STATIC_KEY_TRUE(name);
  DEFINE_STATIC_KEY_FALSE(name);

Which define a key of different types with an initial true/false
value.

Then allow:

   static_branch_likely()
   static_branch_unlikely()

to take a key of either type and emit the right instruction for the
case.

This means adding a second arch_static_branch_jump() assembly helper
which emits a JMP per default.

In order to determine the right instruction for the right state,
encode the branch type in the LSB of jump_entry::key.

This is the final step in removing the naming confusion that has led to
a stream of avoidable bugs such as:

  a833581e37 ("x86, perf: Fix static_key bug in load_mm_cr4()")

... but it also allows new static key combinations that will give us
performance enhancements in the subsequent patches.

Tested-by: Rabin Vincent <rabin@rab.in> # arm
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> # ppc
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # s390
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 11:34:15 +02:00
Peter Zijlstra
76b235c6bc jump_label: Rename JUMP_LABEL_{EN,DIS}ABLE to JUMP_LABEL_{JMP,NOP}
Since we've already stepped away from ENABLE is a JMP and DISABLE is a
NOP with the branch_default bits, and are going to make it even worse,
rename it to make it all clearer.

This way we don't mix multiple levels of logic attributes, but have a
plain 'physical' name for what the current instruction patching status
of a jump label is.

This is a first step in removing the naming confusion that has led to
a stream of avoidable bugs such as:

  a833581e37 ("x86, perf: Fix static_key bug in load_mm_cr4()")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
[ Beefed up the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 11:34:12 +02:00