PMU checking can fail due to various reasons. On native machine, this
is mostly caused by faulty hardware and it is reasonable to use
KERN_ERR in reporting. However, when kernel is running on virtualized
environment, this checking can fail if virtual PMU is not supported
(e.g. KVM on AMD host). It is annoying to see an error message on
splash screen, even though we know such failure is benign on
virtualized environment.
This patch checks if the kernel is running in a virtualized environment.
If so, it will use KERN_INFO in reporting, which reduces the syslog
priority of them. This patch was tested successfully on KVM.
Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/1411617314-24659-1-git-send-email-wei@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I was looking for the trinity oops cause in the uncore driver.
(so far didn't found it)
However I found this tiny race: when a box is set up two threads on the
same CPU, they may be setting up the box in parallel (e.g. with kernel
preemption). This could lead to the reference count being increasing
too much. Always recheck there is no existing cpu reference inside the lock.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/1411424826-15629-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Quark x1000 advertises PGE via the standard CPUID method
PGE bits exist in Quark X1000's PTEs. In order to flush
an individual PTE it is necessary to reload CR3 irrespective
of the PTE.PGE bit.
See Quark Core_DevMan_001.pdf section 6.4.11
This bug was fixed in Galileo kernels, unfixed vanilla kernels are expected to
crash and burn on this platform.
Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1411514784-14885-1-git-send-email-pure.logic@nexus-software.ie
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch restructures the memory controller (IMC) uncore PMU support
for client SNB/IVB/HSW processors. The main change is that it can now
cope with more than one PCI device ID per processor model. There are
many flavors of memory controllers for each processor. They have
different PCI device ID, yet they behave the same w.r.t. the memory
controller PMU that we are interested in.
The patch now supports two distinct memory controllers for IVB
processors: one for mobile, one for desktop.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140917090616.GA11281@quad
Cc: ak@linux.intel.com
Cc: kan.liang@intel.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The PCU frequency band filters use 8 bit each in a register.
When setting up the value the shift value was not correctly
scaled, which resulted in all filters except for band 0 to
be zero. Fix the scaling.
This allows to correctly monitor multiple uncore frequency bands.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1409872109-31645-5-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The IvyBridge-EP uncore driver was missing three filter flags:
NC, ISOC, C6 which are useful in some cases. Support them in the same way
as the Haswell EP driver, by allowing to set them and exposing
them in the sysfs formats.
Also fix a typo in a define.
Relies on the Haswell EP driver to be applied earlier.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1409872109-31645-4-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Current code registers PMUs for all possible uncore pci devices.
This is not good because, on some machines, one or more uncore pci
devices can be missing. The missing pci device make corresponding
PMU unusable. Register the PMU only if the uncore device exists.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1409872109-31645-3-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The uncore subsystem in Haswell-EP is similar to Sandy/Ivy
Bridge-EP. There are some differences in config register
encoding and pci device IDs. The Haswell-EP uncore also
supports a few new events. Add the Haswell-EP driver to
the snbep split driver.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
[ Add missing break. Add imc events. Add cbox nc/isoc/c6. ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1409872109-31645-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the newly added Broadwell cache event list for Haswell too.
All Haswell and Broadwell events and offcore masks used in these lists
are identical.
However Haswell is very different from the Sandy Bridge
list that was used previously. That fixes a wide range of mis-counting
cache events.
The node events are now only for retired memory events, so prefetching
and speculative memory accesses are not included. They are PEBS
capable now, which makes it much easier to sample for them, plus it's
possible to create address maps with -d.
The prefetch events are gone now. They way the hardware counts
them is very misleading (some prefetches included, others not), so
it seemed best to leave them out.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1409683455-29168-5-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On Broadwell INST_RETIRED.ALL cannot be used with any period
that doesn't have the lowest 6 bits cleared. And the period
should not be smaller than 128.
Add a new callback to enforce this, and set it for Broadwell.
This is erratum BDM57 and BDM11.
How does this handle the case when an app requests a specific
period with some of the bottom bits set
The apps thinks it is sampling at X occurences per sample, when it is
in fact at X - 63 (worst case).
Short answer:
Any useful instruction sampling period needs to be 4-6 orders
of magnitude larger than 128, as an PMI every 128 instructions
would instantly overwhelm the system and be throttled.
So the +-64 error from this is really small compared to the
period, much smaller than normal system jitter.
Long answer:
<write up by Peter:>
IFF we guarantee perf_event_attr::sample_period >= 128.
Suppose we start out with sample_period=192; then we'll set period_left
to 192, we'll end up with left = 128 (we truncate the lower bits). We
get an interrupt, find that period_left = 64 (>0 so we return 0 and
don't get an overflow handler), up that to 128. Then we trigger again,
at n=256. Then we find period_left = -64 (<=0 so we return 1 and do get
an overflow). We increment with sample_period so we get left = 128. We
fire again, at n=384, period_left = 0 (<=0 so we return 1 and get an
overflow). And on and on.
So while the individual interrupts are 'wrong' we get then with
interval=256,128 in exactly the right ratio to average out at 192. And
this works for everything >=128.
So the num_samples*fixed_period thing is still entirely correct +- 127,
which is good enough I'd say, as you already have that error anyhow.
So no need to 'fix' the tools, al we need to do is refuse to create
INST_RETIRED:ALL events with sample_period < 128.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Cc: Mark Davies <junk@eslaf.co.uk>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1409683455-29168-4-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add Broadwell support for Broadwell Client to perf. This is very
similar to Haswell. It uses a new cache event table, because there
were various changes there.
The constraint list has one new event that needs to be handled over
Haswell.
The PEBS event list is the same, so we reuse Haswell's.
[fengguang.wu: make intel_bdw_event_constraints[] static]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1409683455-29168-3-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
71 is a Broadwell, not a Haswell. The model number was added
by mistake earlier.
Remove it for now, until it can be re-added later with
real Broadwell support.
In practice it does not cause a lot of issues because the Broadwell
PMU is very similar to Haswell, but some details were wrong,
and it's better to handle it correctly.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/1409683455-29168-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA.
In that case, KVM will fail to patch VMCALL instructions to VMMCALL
as required on AMD processors.
The failure mode is currently a divide-by-zero exception, which obviously
is a KVM bug that has to be fixed. However, picking the right instruction
between VMCALL and VMMCALL will be faster and will help if you cannot upgrade
the hypervisor.
Reported-by: Chris Webb <chris@arachsys.com>
Tested-by: Chris Webb <chris@arachsys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
intel_init_thermal() is called from a) at the time of system initializing
and b) at the time of system resume to initialize thermal
monitoring.
In case when thermal monitoring is handled by SMI, we get to know it via
printk(). Currently it gives the message at both cases, but its okay if
we get it only once and no need to get the same message at every time
system resumes.
So, limit showing this message only at system boot time by avoid showing
at system resume and reduce abusing kernel log buffer.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1411068135.5121.10.camel@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Hang is observed on virtual machines during CPU hotplug,
especially in big guests with many CPUs. (It reproducible
more often if host is over-committed).
It happens because master CPU gives up waiting on
secondary CPU and allows it to run wild. As result
AP causes locking or crashing system. For example
as described here:
https://lkml.org/lkml/2014/3/6/257
If master CPU have sent STARTUP IPI successfully,
and AP signalled to master CPU that it's ready
to start initialization, make master CPU wait
indefinitely till AP is onlined.
To ensure that AP won't ever run wild, make it
wait at early startup till master CPU confirms its
intention to wait for AP. If AP doesn't respond in 10
seconds, the master CPU will timeout and cancel
AP onlining.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1403266991-12233-1-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The original motivation for these patches was for an Intel CPU
feature called MPX. The patch to add a disabled feature for it
will go in with the other parts of the support.
But, in the meantime, there are a few other features than MPX
that we can make assumptions about at compile-time based on
compile options. Add them to disabled-features.h and check them
with cpu_feature_enabled().
Note that this gets rid of the last things that needed an #ifdef
CONFIG_X86_64 in cpufeature.h. Yay!
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140911211524.C0EC332A@viggo.jf.intel.com
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The new split Intel uncore driver code that recently went
into tip added a section mismatch, which the build process
complains about.
uncore_pmu_register() can be called from uncore_pci_probe,()
which is not __init and can be called from pci driver ->probe.
I'm not fully sure if it's actually possible to call the probe
function later, but it seems safer to mark uncore_pmu_register
not __init.
This also fixes the warning.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1409332858-29039-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A few of the initialization functions are missing the __init annotation.
Fix this and thereby allow ~680 additional bytes of code to be released
after initialization.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1409071785-26015-1-git-send-email-minipli@googlemail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.
Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.
__get_cpu_var() is defined as :
#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.
this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.
This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.
Transformations done to __get_cpu_var()
1. Determine the address of the percpu instance of the current processor.
DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(&y);
2. Same as #1 but this time an array structure is involved.
DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(y);
3. Retrieve the content of the current processors instance of a per cpu
variable.
DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)
Converts to
int x = __this_cpu_read(y);
4. Retrieve the content of a percpu struct
DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);
Converts to
memcpy(&x, this_cpu_ptr(&y), sizeof(x));
5. Assignment to a per cpu variable
DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;
Converts to
__this_cpu_write(y, x);
6. Increment/Decrement etc of a per cpu variable
DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++
Converts to
__this_cpu_inc(y)
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The table mapping CPUID bits to human-readable strings takes up a
non-trivial amount of space, and only exists to support /proc/cpuinfo
and a couple of kernel messages. Since programs depend on the format of
/proc/cpuinfo, force inclusion of the table when building with /proc
support; otherwise, support omitting that table to save space, in which
case the kernel messages will print features numerically instead.
In addition to saving 1408 bytes out of vmlinux, this also saves 1373
bytes out of the uncompressed setup code, which contributes directly to
the size of bzImage.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
arch/x86/kernel/cpu/proc.c only exists to support files in /proc; omit that
file when compiling without CONFIG_PROC_FS.
Saves 645 additional bytes on 32-bit x86 when !CONFIG_PROC_FS:
add/remove: 0/5 grow/shrink: 0/0 up/down: 0/-645 (-645)
function old new delta
c_stop 1 - -1
c_next 11 - -11
cpuinfo_op 16 - -16
c_start 24 - -24
show_cpuinfo 593 - -593
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Miscellaneous
- Remove DEFINE_PCI_DEVICE_TABLE macro use (Benoit Taine)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT7PyAAAoJEFmIoMA60/r8kjQQALr/8oEfZoVcjgCb7waWOr25
hUTnrI6GBIAh/50hoBiPq0ouPCAKVv66+CUhuhFkLP7oJz+rMU0B9hfUvdLfmCpH
7ppaallkllT9nPFIr7h5RUWLXsoQyuHmCYmSrUCcnlT2LPgU0dN72YWElLisEM6Z
Pldg3933xyIQaCWviHjGEjWb7NvC+JY4pTkV5iyqGgU8Ale/eFYtLLSfdBEjIbGv
VDirYZmKELYeuncZPrTAsp4IENRMZn702wwDakMSODVMEWtJB5h4yrBawqQDlFP5
9ztIX6n9p9zkdVKbYZlx/Xwv6SYEnYXLxauVQMSO3Nck7Z10R5Ud+5uuCg/6mWH8
AQI4UV5bbJcg7zHgocTG9XLFLFPoPtD2JT6k6UT1LeUAiAOqcSzhRO+/qJBmJOWZ
Zv+EHXPlxBrl0zNifut6ZQrY17teuItVtmha70a/9W3PjnIx3KecqLcTwdTvDsOY
IAyH8WMZrBKpPpsczSmfE93i2Z1QRS91HEAOeSMxl/98dcDTdllYZS7spjoDll2f
xmpGDbpriLSCu2XsGHfTC9RbqA7CyuFlHggJSQDkT/5Esli0sCs7eweTuK3RVvPu
t6bUHK3yElb6x9qMZhb5q6l72wSMlGMishTdaxEHmqrEA8PtaIFodmVX2T/Zel5n
GHN6bysPqDItNR2v/3JX
=jJGu
-----END PGP SIGNATURE-----
Merge tag 'pci-v3.17-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull DEFINE_PCI_DEVICE_TABLE removal from Bjorn Helgaas:
"Part two of the PCI changes for v3.17:
- Remove DEFINE_PCI_DEVICE_TABLE macro use (Benoit Taine)
It's a mechanical change that removes uses of the
DEFINE_PCI_DEVICE_TABLE macro. I waited until later in the merge
window to reduce conflicts, but it's possible you'll still see a few"
* tag 'pci-v3.17-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: Remove DEFINE_PCI_DEVICE_TABLE macro use
Pull x86/xsave changes from Peter Anvin:
"This is a patchset to support the XSAVES instruction required to
support context switch of supervisor-only features in upcoming
silicon.
This patchset missed the 3.16 merge window, which is why it is based
on 3.15-rc7"
* 'x86-xsave-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, xsave: Add forgotten inline annotation
x86/xsaves: Clean up code in xstate offsets computation in xsave area
x86/xsave: Make it clear that the XSAVE macros use (%edi)/(%rdi)
Define kernel API to get address of each state in xsave area
x86/xsaves: Enable xsaves/xrstors
x86/xsaves: Call booting time xsaves and xrstors in setup_init_fpu_buf
x86/xsaves: Save xstate to task's xsave area in __save_fpu during booting time
x86/xsaves: Add xsaves and xrstors support for booting time
x86/xsaves: Clear reserved bits in xsave header
x86/xsaves: Use xsave/xrstor for saving and restoring user space context
x86/xsaves: Use xsaves/xrstors for context switch
x86/xsaves: Use xsaves/xrstors to save and restore xsave area
x86/xsaves: Define a macro for handling xsave/xrstor instruction fault
x86/xsaves: Define macros for xsave instructions
x86/xsaves: Change compacted format xsave area header
x86/alternative: Add alternative_input_2 to support alternative with two features and input
x86/xsaves: Add a kernel parameter noxsaves to disable xsaves/xrstors
Keeping track of all the various CPU names is hard enough; adding extra
silly names for no reason is just not helping. If we know the base arch
name (IvyBridge) then we can do the client/server parts with the well
known {,EP,EX} postfixes, no need to remember endless amounts of
unrelated and pointless names for this.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-8559jke61dsyr7d0i74iutli@git.kernel.org
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch exposes two basic events for Ivytown IMC uncore PMU:
- cas_count_read: number of full-cache line reads to memory controller
- cas_count_write: number of full-cache line writes to memory controller
Those events use the same encoding as for SNB-EP, so reuse the same
event table. See specification in:
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/xeon-e5-2600-v2-uncore-manual.pdf
By aggregating all the read and write events from all the memory controllers
of each processor socket, one can determine the total memory bandwidth utilization.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140812060031.GA25239@quad
Cc: zheng.z.yan@intel.com
Cc: ak@linux.intel.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch makes the code more readable. It also renames
precise_store_data_hsw() to precise_datala_hsw() because
the function is called for both loads and stores on HSW.
The patch also gets rid of the hardcoded store events
codes in that same function.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1407785233-32193-5-git-send-email-eranian@google.com
Cc: ak@linux.intel.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch fixes issues introuduce by Andi's previous patch 'Revamp PEBS'
series.
This patch fixes the following:
- precise_store_data_hsw() encode the mem op type whenever we can
- precise_store_data_hsw set the default data source correctly
- 0 is not a valid init value for data source. Define PERF_MEM_NA as the
default value
This bug was actually introduced by
commit 722e76e60f
Author: Stephane Eranian <eranian@google.com>
Date: Thu May 15 17:56:44 2014 +0200
fix Haswell precise store data source encoding
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1407785233-32193-4-git-send-email-eranian@google.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: ak@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Haswell supports reporting the data address for a range
of PEBS events, including:
UOPS_RETIRED.ALL
MEM_UOPS_RETIRED.STLB_MISS_LOADS
MEM_UOPS_RETIRED.STLB_MISS_STORES
MEM_UOPS_RETIRED.LOCK_LOADS
MEM_UOPS_RETIRED.SPLIT_LOADS
MEM_UOPS_RETIRED.SPLIT_STORES
MEM_UOPS_RETIRED.ALL_LOADS
MEM_UOPS_RETIRED.ALL_STORES
MEM_LOAD_UOPS_RETIRED.L1_HIT
MEM_LOAD_UOPS_RETIRED.L2_HIT
MEM_LOAD_UOPS_RETIRED.L3_HIT
MEM_LOAD_UOPS_RETIRED.L1_MISS
MEM_LOAD_UOPS_RETIRED.L2_MISS
MEM_LOAD_UOPS_RETIRED.L3_MISS
MEM_LOAD_UOPS_RETIRED.HIT_LFB
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_NONE
MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM
This facility was already enabled earlier with the original Haswell
perf changes.
However these addresses were always reports as stores by perf, which is wrong,
as they could be loads too. The hardware does not distinguish loads and stores
for these instructions, so there's no (cheap) way for the profiler
to find out.
Change the type to PERF_MEM_OP_NA instead.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/1407785233-32193-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The basic idea is that it does not make sense to list all PEBS
events individually. The list is very long, sometimes outdated
and the hardware doesn't need it. If an event does not support
PEBS it will just not count, there is no security issue.
We need to only list events that something special, like
supporting load or store addresses.
This vastly simplifies the PEBS event selection. It also
speeds up the scheduling because the scheduler doesn't
have to walk as many constraints.
Bugs fixed:
- We do not allow setting forbidden flags with PEBS anymore
(SDM 18.9.4), except for the special cycle event.
This is done using a new constraint macro that also
matches on the event flags.
- Correct DataLA and load/store/na flags reporting on Haswell
[Requires a followon patch]
- We did not allow all PEBS events on Haswell:
We were missing some valid subevents in d1-d2 (MEM_LOAD_UOPS_RETIRED.*,
MEM_LOAD_UOPS_RETIRED_L3_HIT_RETIRED.*)
This includes the changes proposed by Stephane earlier and obsoletes
his patchkit (except for some changes on pre Sandy Bridge/Silvermont
CPUs)
I only did Sandy Bridge and Silvermont and later so far, mostly because these
are the parts I could directly confirm the hardware behavior with hardware
architects. Also I do not believe the older CPUs have any
missing events in their PEBS list, so there's no pressing
need to change them.
I did not implement the flag proposed by Peter to allow
setting forbidden flags. If really needed this could
be implemented on to of this patch.
v2: Fix broken store events on SNB/IVB (Stephane Eranian)
v3: More fixes. Rename some arguments (Stephane Eranian)
v4: List most Haswell events individually again to report
memory operation type correctly.
Add new flags to describe load/store/na for datala.
Update description.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1407785233-32193-2-git-send-email-eranian@google.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Cc: Mark Davies <junk@eslaf.co.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This fixes a side effect of Kan's earlier patch to probe the LBRs at boot
time. Normally when the LBRs are disabled cycles:pp is disabled too.
So for example cycles:pp doesn't work.
However this is not needed with PEBSv2 and later (Haswell) because
it does not need LBRs to correct the IP-off-by-one.
So add an extra check for PEBSv2 that also allows :pp
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: kan.liang@intel.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/1407456534-15747-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
HSW-EP has a larger offcore mask than the client Haswell CPUs.
It is the same mask as on Sandy/IvyBridge-EP. All of
Haswell was using the client mask, so some bits were missing.
On the client parts some bits were also missing compared
to Sandy/IvyBridge, in particular the bits to match on a L4
cache hit.
The Haswell core in both client and server incarnations
accepts the same bits (but some are nops), so we can use
the same mask.
So use the snbep extended mask, which is a superset of the
client and the server, for all of Haswell.
This allows specifying a number of extra offcore events, like
for example for HSW-EP.
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3fffc00100,name=offcore_response_pf_l3_rfo_l3_miss_any_response/ true
which were <not supported> before.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: eranian@google.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/1406840722-25416-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The model number descriptions got a bit messy, clean them up.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-oo3xclxdoy8s7ubssn929vaj@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We should prefer `struct pci_device_id` over `DEFINE_PCI_DEVICE_TABLE` to
meet kernel coding style guidelines. This issue was reported by checkpatch.
A simplified version of the semantic patch that makes this change is as
follows (http://coccinelle.lip6.fr/):
// <smpl>
@@
identifier i;
declarer name DEFINE_PCI_DEVICE_TABLE;
initializer z;
@@
- DEFINE_PCI_DEVICE_TABLE(i)
+ const struct pci_device_id i[]
= z;
// </smpl>
[bhelgaas: add semantic patch]
Signed-off-by: Benoit Taine <benoit.taine@lip6.fr>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Commit ea431643d6 ("x86/mce: Fix CMCI preemption bugs") breaks RT by
the completely unrelated conversion of the cmci_discover_lock to a
regular (non raw) spinlock. This lock was annotated in commit
59d958d2c7 ("locking, x86: mce: Annotate cmci_discover_lock as raw")
with a proper explanation why.
The argument for converting the lock back to a regular spinlock was:
- it does percpu ops without disabling preemption. Preemption is not
disabled due to the mistaken use of a raw spinlock.
Which is complete nonsense. The raw_spinlock is disabling preemption in
the same way as a regular spinlock. In mainline spinlock maps to
raw_spinlock, in RT spinlock becomes a "sleeping" lock.
raw_spinlock has on RT exactly the same semantics as in mainline. And
because this lock is taken in non preemptible context it must be raw on
RT.
Undo the locking brainfart.
Reported-by: Clark Williams <williams@redhat.com>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull RAS updates from Ingo Molnar:
"The main changes in this cycle are:
- RAS tracing/events infrastructure, by Gong Chen.
- Various generalizations of the APEI code to make it available to
non-x86 architectures, by Tomasz Nowicki"
* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ras: Fix build warnings in <linux/aer.h>
acpi, apei, ghes: Factor out ioremap virtual memory for IRQ and NMI context.
acpi, apei, ghes: Make NMI error notification to be GHES architecture extension.
apei, mce: Factor out APEI architecture specific MCE calls.
RAS, extlog: Adjust init flow
trace, eMCA: Add a knob to adjust where to save event log
trace, RAS: Add eMCA trace event interface
RAS, debugfs: Add debugfs interface for RAS subsystem
CPER: Adjust code flow of some functions
x86, MCE: Robustify mcheck_init_device
trace, AER: Move trace into unified interface
trace, RAS: Add basic RAS trace event
x86, MCE: Kill CPU_POST_DEAD
Pull x86 mm changes from Ingo Molnar:
"The main change in this cycle is the rework of the TLB range flushing
code, to simplify, fix and consolidate the code. By Dave Hansen"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Set TLB flush tunable to sane value (33)
x86/mm: New tunable for single vs full TLB flush
x86/mm: Add tracepoints for TLB flushes
x86/mm: Unify remote INVLPG code
x86/mm: Fix missed global TLB flush stat
x86/mm: Rip out complicated, out-of-date, buggy TLB flushing
x86/mm: Clean up the TLB flushing code
x86/smep: Be more informative when signalling an SMEP fault
Pull x86 cpufeature updates from Ingo Molnar:
"The main changes in this cycle were:
- Continued cleanups of CPU bugs mis-marked as 'missing features', by
Borislav Petkov.
- Detect the xsaves/xrstors feature and releated cleanup, by Fenghua
Yu"
* 'x86-cpufeature-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpu: Kill cpu_has_mp
x86, amd: Cleanup init_amd
x86/cpufeature: Add bug flags to /proc/cpuinfo
x86, cpufeature: Convert more "features" to bugs
x86/xsaves: Detect xsaves/xrstors feature
x86/cpufeature.h: Reformat x86 feature macros
I think the flush_tlb_mm_range() code that tries to tune the
flush sizes based on the CPU needs to get ripped out for
several reasons:
1. It is obviously buggy. It uses mm->total_vm to judge the
task's footprint in the TLB. It should certainly be using
some measure of RSS, *NOT* ->total_vm since only resident
memory can populate the TLB.
2. Haswell, and several other CPUs are missing from the
intel_tlb_flushall_shift_set() function. Thus, it has been
demonstrated to bitrot quickly in practice.
3. It is plain wrong in my vm:
[ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.037444] tlb_flushall_shift: 6
Which leads to it to never use invlpg.
4. The assumptions about TLB refill costs are wrong:
http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com
(more on this in later patches)
5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59
I believe the sample times were too short. Running the
benchmark in a loop yields times that vary quite a bit.
Note that this leaves us with a static ceiling of 1 page. This
is a conservative, dumb setting, and will be revised in a later
patch.
This also removes the code which attempts to predict whether we
are flushing data or instructions. We expect instruction flushes
to be relatively rare and not worth tuning for explicitly.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154055.ABC88E89@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The primary dependency is that GHES uses the x86 NMI for hardware
error notification and MCE for memory error handling. These patches
remove that dependency.
Other APEI features such as error reporting via external IRQ, error
serialization, or error injection, do not require changes to use them
on non-x86 architectures.
The following patch set eliminates the APEI Kconfig x86 dependency
by making these changes:
- treat NMI notification as GHES architecture - HAVE_ACPI_APEI_NMI
- group and wrap around #ifdef CONFIG_HAVE_ACPI_APEI_NMI code which
is used only for NMI path
- identify architectural boxes and abstract it accordingly (tlb flush and MCE)
- rework ioremap for both IRQ and NMI context
NMI code is kept in ghes.c file since NMI and IRQ context are tightly coupled.
Note, these patches introduce no functional changes for x86. The NMI notification
feature is hard selected for x86. Architectures that want to use this
feature should also provide NMI code infrastructure.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT2BaPAAoJEKurIx+X31iBLGMP/0yyWOna4229p9CmuElSP3os
Kb+9Thru+Wg4ihj43CYW0nznQnamCaqBa5NpDXZn0Ebtxc08SSGVzbf+z+vBMeD+
HW4093m4g8sGL7i4JdAol0MEPpKTQRdpj525N/h/xWVSDXQ0Bq3vQ7DS1/j1Bp4k
Lq3G8dEk+4LjNPcQ5YBPl71zWJOC4iUctfh1OpFdfgA04804Vis3j8T6ljE7/72M
51xXK3af9ktIg6MU2HOwraUsSspVeJs/4lPu4fab4XI07BRDb4T7yx19a9VaBy67
m6TaTd3eC/Z0Uh+51grNuXSnWQK4fvahRZJEwiRdC0wL3w3mhdZkmqm0nBdBFyof
5b251+FOazOtZdMsWS/mMjQUjybQ+4k9zpnndIPw/5rqxJ8lgaP7o81e+hw1Xh1Q
E0ZWUMXnAIkRmkyYLUv5aTICRYIZtAC/C1QrR5ZB/9Q+yvtxp13dbqGzWhcF7AIw
UK/yb5T5ZAzvuJlmPG0ZiV75HH9bjX4OFV3AhXJIEG/iTOdVVpat8yICFrT33Xpc
uAwRXQvz6mn2c2xpZcJqSJQlXKg2nbrfUmscU8P8Zu6mQpvBB/+2cDbW/5wfuKbE
NpD0aB5PxhHY+nNvIfOsTUk72aZcZdUEQJt/792vhnMYb/IK1X/qa4zrVmOqlZKt
mtXwUQWdj3kSG36mgssO
=nYdd
-----END PGP SIGNATURE-----
Merge tag 'please-pull-apei' into x86/ras
APEI is currently implemented so that it depends on x86 hardware.
The primary dependency is that GHES uses the x86 NMI for hardware
error notification and MCE for memory error handling. These patches
remove that dependency.
Other APEI features such as error reporting via external IRQ, error
serialization, or error injection, do not require changes to use them
on non-x86 architectures.
The following patch set eliminates the APEI Kconfig x86 dependency
by making these changes:
- treat NMI notification as GHES architecture - HAVE_ACPI_APEI_NMI
- group and wrap around #ifdef CONFIG_HAVE_ACPI_APEI_NMI code which
is used only for NMI path
- identify architectural boxes and abstract it accordingly (tlb flush and MCE)
- rework ioremap for both IRQ and NMI context
NMI code is kept in ghes.c file since NMI and IRQ context are tightly coupled.
Note, these patches introduce no functional changes for x86. The NMI notification
feature is hard selected for x86. Architectures that want to use this
feature should also provide NMI code infrastructure.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT1VYNAAoJEHm+PkMAQRiGQJwIAKSYp1Uqz5O/e5r0V1TlZKT4
1B4Njopl57PwSrJQWcGEuH2yHyM896vfPO4L6BJIOfyWzh8kwpQqclDt6uhXoF/v
OsO1zb/7/j+n/pDZsePqP9AyIgErsHEBgUbhecDqzjN++ITPcZjQ6TIMPglZaumN
jFAdAZuAaEwqAk8jqN2wlm689Fh9MuUEarHXbXLCqu5RgLrWhFGhp/cTWY62aqnZ
XfEeQ9KtpRZmlR/IYjerbb1eRH7ZdJsZ88WngLX9dj/JdNxHWBkWQBXGAusXk5Fk
y6LsIV3TjyBdrRKJ1Ifyg/2EIXHNBs8HxTFGXpjtp2HPuMLDxZOWOWikb9URtNg=
=Fjf4
-----END PGP SIGNATURE-----
Merge tag 'v3.16-rc7' into perf/core, to merge in the latest fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull perf fixes from Thomas Gleixner:
"A bunch of fixes for perf and kprobes:
- revert a commit that caused a perf group regression
- silence dmesg spam
- fix kprobe probing errors on ia64 and ppc64
- filter kprobe faults from userspace
- lockdep fix for perf exit path
- prevent perf #GP in KVM guest
- correct perf event and filters"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64
kprobes/x86: Don't try to resolve kprobe faults from userspace
perf/x86/intel: Avoid spamming kernel log for BTS buffer failure
perf/x86/intel: Protect LBR and extra_regs against KVM lying
perf: Fix lockdep warning on process exit
perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappings
perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
perf: Revert ("perf: Always destroy groups on exit")