alloc_pages_exact_node() was introduced in commit 6484eb3e2a ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.
The misleading name has lead to mistakes in the past, see for example
commits 5265047ac3 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f ("mm, mempolicy:
migrate_to_node should only migrate to node").
Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.
To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.
It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.
Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.
Both differences will be rectified by the next patch.
To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1/ Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map. This facility is used by the pmem driver to
enable pfn_to_page() operations on the page frames returned by DAX
('direct_access' in 'struct block_device_operations'). For now, the
'memmap' allocation for these "device" pages comes from "System
RAM". Support for allocating the memmap from device memory will
arrive in a later kernel.
2/ Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3. Completion of
the conversion is targeted for v4.4.
3/ Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
4/ Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
5/ Miscellaneous updates and fixes to libnvdimm including support
for issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV6Nx7AAoJEB7SkWpmfYgCWyYQAI5ju6Gvw27RNFtPovHcZUf5
JGnxXejI6/AqeTQ+IulgprxtEUCrXOHjCDA5dkjr1qvsoqK1qxug+vJHOZLgeW0R
OwDtmdW4Qrgeqm+CPoxETkorJ8wDOc8mol81kTiMgeV3UqbYeeHIiTAmwe7VzZ0C
nNdCRDm5g8dHCjTKcvK3rvozgyoNoWeBiHkPe76EbnxDICxCB5dak7XsVKNMIVFQ
NuYlnw6IYN7+rMHgpgpRux38NtIW8VlYPWTmHExejc2mlioWMNBG/bmtwLyJ6M3e
zliz4/cnonTMUaizZaVozyinTa65m7wcnpjK+vlyGV2deDZPJpDRvSOtB0lH30bR
1gy+qrKzuGKpaN6thOISxFLLjmEeYwzYd7SvC9n118r32qShz+opN9XX0WmWSFlA
sajE1ehm4M7s5pkMoa/dRnAyR8RUPu4RNINdQ/Z9jFfAOx+Q26rLdQXwf9+uqbEb
bIeSQwOteK5vYYCstvpAcHSMlJAglzIX5UfZBvtEIJN7rlb0VhmGWfxAnTu+ktG1
o9cqAt+J4146xHaFwj5duTsyKhWb8BL9+xqbKPNpXEp+PbLsrnE/+WkDLFD67jxz
dgIoK60mGnVXp+16I2uMqYYDgAyO5zUdmM4OygOMnZNa1mxesjbDJC6Wat1Wsndn
slsw6DkrWT60CRE42nbK
=o57/
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().
Summary:
- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.
This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').
For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.
- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.
Completion of the conversion is targeted for v4.4.
- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"
* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...
The changes with more meat are:
o Allowing the trace event filters to filter on CPU number and process ids
o Two new markers for trace output latency were added
(10 and 100 msec latencies)
o Have tracing_thresh filter function profiling time
I also worked on modifying the ring buffer code for some future
work, and moved the adding of the timestamp around. One of my changes
caused a regression, and since other changes were built on top of it
and already tested, I had to operate a revert of that change. Instead
of rebasing, this change set has the code that caused a regression
as well as the code to revert that change without touching the other
changes that were made on top of it.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJV6aZEAAoJEEjnJuOKh9ldrR4H/A1RcQf1prLLoUibPP4w3lat
dmQcdpS1NY+cqyiKuKPAOkFDGQL7qWzRqZ8whcPSJIsHq57ufqNSLf+0bbQYPzg9
g3CgGL7OApmGi5ulj0sNxhadvc9TFm/SAN0nVJlNuUWdm8e1UWHLsrJZaMfopu2r
RDEtkOhg619mhDL4rktNdS6rk0B92Fhu2o2PwLZPVlUl1NNEt4WJU+ejitXUVO1A
Nb70/rTGGJKtyHbW+74on4LnEN5Uu0Viu6rMwGfYyIgRmC2otdBDvE4xfKMiTUKr
SzBjzrhIoMIRn4Vl0vElfulkpYaw7pcC2BdpZ4d9VpIOiLSlZs0x/TgCtpFEv5M=
=baZ3
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing update from Steven Rostedt:
"Mostly this is just clean ups and micro optimizations.
The changes with more meat are:
- Allowing the trace event filters to filter on CPU number and
process ids
- Two new markers for trace output latency were added (10 and 100
msec latencies)
- Have tracing_thresh filter function profiling time
I also worked on modifying the ring buffer code for some future work,
and moved the adding of the timestamp around. One of my changes
caused a regression, and since other changes were built on top of it
and already tested, I had to operate a revert of that change. Instead
of rebasing, this change set has the code that caused a regression as
well as the code to revert that change without touching the other
changes that were made on top of it"
* tag 'trace-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Revert "ring-buffer: Get timestamp after event is allocated"
tracing: Don't make assumptions about length of string on task rename
tracing: Allow triggers to filter for CPU ids and process names
ftrace: Format MCOUNT_ADDR address as type unsigned long
tracing: Introduce two additional marks for delay
ftrace: Fix function_graph duration spacing with 7-digits
ftrace: add tracing_thresh to function profile
tracing: Clean up stack tracing and fix fentry updates
ring-buffer: Reorganize function locations
ring-buffer: Make sure event has enough room for extend and padding
ring-buffer: Get timestamp after event is allocated
ring-buffer: Move the adding of the extended timestamp out of line
ring-buffer: Add event descriptor to simplify passing data
ftrace: correct the counter increment for trace_buffer data
tracing: Fix for non-continuous cpu ids
tracing: Prefer kcalloc over kzalloc with multiply
Commit f32393c943 ("powerpc/pseries: Correct cpu affinity for
dlpar added cpus") moved dlpar_acquire_drc() call to before
dlpar_configure_connector() call in dlpar_cpu_probe(), but missed
to release the DRC if dlpar_configure_connector() failed.
During CPU hotplug, if configure-connector fails for any reason,
then this will result in subsequent CPU hotplug attempts to fail.
Release the acquired DRC if dlpar_configure_connector() call fails
so that the DRC is left in right isolation and allocation state
for the subsequent hotplug operation to succeed.
Fixes: f32393c943 ("powerpc/pseries: Correct cpu affinity for dlpar added cpus")
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The 32-bit TCE table initialization relies on the DMA window having a
size equal to a power of 2 (and checks for it explicitly). But
crashkernel= has no constraint that requires a power-of-2 be specified.
This causes the kdump kernel to fail to boot as none of the PCI devices
(including the disk controller) are successfully initialized.
After this change, the PCI devices successfully set up the 32-bit TCE
table and kdump succeeds.
Fixes: aca6913f55 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages")
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org # 4.2
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When attempting to kdump with the 4.2 kernel, we see for each PCI
device:
pci 0003:01 : [PE# 000] Assign DMA32 space
pci 0003:01 : [PE# 000] Setting up 32-bit TCE table at 0..80000000
pci 0003:01 : [PE# 000] Failed to create 32-bit TCE table, err -22
PCI: Domain 0004 has 8 available 32-bit DMA segments
PCI: 4 PE# for a total weight of 70
pci 0004:01 : [PE# 002] Assign DMA32 space
pci 0004:01 : [PE# 002] Setting up 32-bit TCE table at 0..80000000
pci 0004:01 : [PE# 002] Failed to create 32-bit TCE table, err -22
pci 0004:0d : [PE# 005] Assign DMA32 space
pci 0004:0d : [PE# 005] Setting up 32-bit TCE table at 0..80000000
pci 0004:0d : [PE# 005] Failed to create 32-bit TCE table, err -22
pci 0004:0e : [PE# 006] Assign DMA32 space
pci 0004:0e : [PE# 006] Setting up 32-bit TCE table at 0..80000000
pci 0004:0e : [PE# 006] Failed to create 32-bit TCE table, err -22
pci 0004:10 : [PE# 008] Assign DMA32 space
pci 0004:10 : [PE# 008] Setting up 32-bit TCE table at 0..80000000
pci 0004:10 : [PE# 008] Failed to create 32-bit TCE table, err -22
and eventually the kdump kernel fails to boot as none of the PCI devices
(including the disk controller) are successfully initialized.
The EINVAL response is because the DMA window (the 2GB base window) is
larger than the kdump kernel's reserved memory (crashkernel=, in this
case specified to be 1024M). The check in question,
if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
is a valid sanity check for pnv_pci_ioda2_table_alloc_pages(), so adjust
the caller to pass in a smaller window size if our maximum memory value
is smaller than the DMA window.
After this change, the PCI devices successfully set up the 32-bit TCE
table and kdump succeeds.
The problem was seen on a Firestone machine originally.
Fixes: aca6913f55 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages")
Cc: stable@vger.kernel.org # 4.2
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[mpe: Coding style pedantry, use u64, change the indentation]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We cannot detect clang before including the arch Makefile, because that
can set the default cross compiler. We also cannot detect clang after
including the arch Makefile, because powerpc wants to know about clang.
Solve this by using an deferred variable. This costs us a few shell
invocations, but this is only a constant number.
Reported-by: Behan Webster <behanw@converseincode.com>
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michal Marek <mmarek@suse.com>
- Support "hybrid" iommu/direct DMA ops for coherent_mask < dma_mask from Benjamin Herrenschmidt
- EEH fixes for SRIOV from Gavin
- Introduce rtas_get_sensor_fast() for IRQ handlers from Thomas Huth
- Use hardware RNG for arch_get_random_seed_* not arch_get_random_* from Paul Mackerras
- Seccomp filter support from Michael Ellerman
- opal_cec_reboot2() handling for HMIs & machine checks from Mahesh Salgaonkar
- Add powerpc timebase as a trace clock source from Naveen N. Rao
- Misc cleanups in the xmon, signal & SLB code from Anshuman Khandual
- Add an inline function to update POWER8 HID0 from Gautham R. Shenoy
- Fix pte_pagesize_index() crash on 4K w/64K hash from Michael Ellerman
- Drop support for 64K local store on 4K kernels from Michael Ellerman
- move dma_get_required_mask() from pnv_phb to pci_controller_ops from Andrew Donnellan
- Initialize distance lookup table from drconf path from Nikunj A Dadhania
- Enable RTC class support from Vaibhav Jain
- Disable automatically blocked PCI config from Gavin Shan
- Add LEDs driver for PowerNV platform from Vasant Hegde
- Fix endianness issues in the HVSI driver from Laurent Dufour
- Kexec endian fixes from Samuel Mendoza-Jonas
- Fix corrupted pdn list from Gavin Shan
- Fix fenced PHB caused by eeh_slot_error_detail() from Gavin Shan
- Freescale updates from Scott: Highlights include 32-bit memcpy/memset
optimizations, checksum optimizations, 85xx config fragments and updates,
device tree updates, e6500 fixes for non-SMP, and misc cleanup and minor
fixes.
- A ton of cxl updates & fixes:
- Add explicit precision specifiers from Rasmus Villemoes
- use more common format specifier from Rasmus Villemoes
- Destroy cxl_adapter_idr on module_exit from Johannes Thumshirn
- Destroy afu->contexts_idr on release of an afu from Johannes Thumshirn
- Compile with -Werror from Daniel Axtens
- EEH support from Daniel Axtens
- Plug irq_bitmap getting leaked in cxl_context from Vaibhav Jain
- Add alternate MMIO error handling from Ian Munsie
- Allow release of contexts which have been OPENED but not STARTED from Andrew Donnellan
- Remove use of macro DEFINE_PCI_DEVICE_TABLE from Vaishali Thakkar
- Release irqs if memory allocation fails from Vaibhav Jain
- Remove racy attempt to force EEH invocation in reset from Daniel Axtens
- Fix + cleanup error paths in cxl_dev_context_init from Ian Munsie
- Fix force unmapping mmaps of contexts allocated through the kernel api from Ian Munsie
- Set up and enable PSL Timebase from Philippe Bergheaud
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV5+GzAAoJEFHr6jzI4aWA0iAP/jcd0kNaNBzLgcDKKygKdgz4
xn4EWu81vfMfZYWesb0ATrjlH0hLsRxSXoFUqUMhtJTa5kNAoCIaz/M8WBALS50h
aT+i7br4WEU2j2FcaMyP3iAZx/2hl+2utODJSHPRWPkec1fUDBfEyBf++e520RWM
HUQGIGZXh8yq7KMA96Pwhsvls9vOB8hS2UdU/NS8ff3J5jFvXC1/WmF2qfzJBS1V
8iHyz26Jl8+dJ+et7iC2oD5XQAjIH1oJgOyPVPBzAQttfi8RjuVzRA30TfPBAUwI
lC9nlmPy6bCe4kiQYWVB1z7GegHyW/9vkeuMj/u8mZbqpaayMEMZmd2C3iNDXNHx
i2NSvdln539t4qWYsV2v6lVCfa/ayDHD73Wackj5Dk394tzXnpCPhxNzc2yKEd5v
h7vwYc9jBhsbfSCSogaM+gSHJ1APgCidggHJMYYNA2nN2u6V62RpsMB7zp/1+Q2v
yqYdD8oYF4Dm21x/ujaNFrlizROD46WS0UqdJ3yP6HAqRYIpRXtibmpECJgt1n5h
HjADEci4hQ2UQxdMdp/Q5KZnPTJebBtrZrmkW5r6cZBUaTB5TVkFaEWN44CT/Loh
tMNeA3qOBN06CaQS2WL3UUUWpbZq9fSbWuUZ5lWZDb5AOyRxe5eWVYNLkiyIXozY
L24l1bYdBhXahnjoS/kc
=n9+X
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- support "hybrid" iommu/direct DMA ops for coherent_mask < dma_mask
from Benjamin Herrenschmidt
- EEH fixes for SRIOV from Gavin
- introduce rtas_get_sensor_fast() for IRQ handlers from Thomas Huth
- use hardware RNG for arch_get_random_seed_* not arch_get_random_*
from Paul Mackerras
- seccomp filter support from Michael Ellerman
- opal_cec_reboot2() handling for HMIs & machine checks from Mahesh
Salgaonkar
- add powerpc timebase as a trace clock source from Naveen N. Rao
- misc cleanups in the xmon, signal & SLB code from Anshuman Khandual
- add an inline function to update POWER8 HID0 from Gautham R. Shenoy
- fix pte_pagesize_index() crash on 4K w/64K hash from Michael Ellerman
- drop support for 64K local store on 4K kernels from Michael Ellerman
- move dma_get_required_mask() from pnv_phb to pci_controller_ops from
Andrew Donnellan
- initialize distance lookup table from drconf path from Nikunj A
Dadhania
- enable RTC class support from Vaibhav Jain
- disable automatically blocked PCI config from Gavin Shan
- add LEDs driver for PowerNV platform from Vasant Hegde
- fix endianness issues in the HVSI driver from Laurent Dufour
- kexec endian fixes from Samuel Mendoza-Jonas
- fix corrupted pdn list from Gavin Shan
- fix fenced PHB caused by eeh_slot_error_detail() from Gavin Shan
- Freescale updates from Scott: Highlights include 32-bit memcpy/memset
optimizations, checksum optimizations, 85xx config fragments and
updates, device tree updates, e6500 fixes for non-SMP, and misc
cleanup and minor fixes.
- a ton of cxl updates & fixes:
- add explicit precision specifiers from Rasmus Villemoes
- use more common format specifier from Rasmus Villemoes
- destroy cxl_adapter_idr on module_exit from Johannes Thumshirn
- destroy afu->contexts_idr on release of an afu from Johannes
Thumshirn
- compile with -Werror from Daniel Axtens
- EEH support from Daniel Axtens
- plug irq_bitmap getting leaked in cxl_context from Vaibhav Jain
- add alternate MMIO error handling from Ian Munsie
- allow release of contexts which have been OPENED but not STARTED
from Andrew Donnellan
- remove use of macro DEFINE_PCI_DEVICE_TABLE from Vaishali Thakkar
- release irqs if memory allocation fails from Vaibhav Jain
- remove racy attempt to force EEH invocation in reset from Daniel
Axtens
- fix + cleanup error paths in cxl_dev_context_init from Ian Munsie
- fix force unmapping mmaps of contexts allocated through the kernel
api from Ian Munsie
- set up and enable PSL Timebase from Philippe Bergheaud
* tag 'powerpc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (140 commits)
cxl: Set up and enable PSL Timebase
cxl: Fix force unmapping mmaps of contexts allocated through the kernel api
cxl: Fix + cleanup error paths in cxl_dev_context_init
powerpc/eeh: Fix fenced PHB caused by eeh_slot_error_detail()
powerpc/pseries: Cleanup on pci_dn_reconfig_notifier()
powerpc/pseries: Fix corrupted pdn list
powerpc/powernv: Enable LEDS support
powerpc/iommu: Set default DMA offset in dma_dev_setup
cxl: Remove racy attempt to force EEH invocation in reset
cxl: Release irqs if memory allocation fails
cxl: Remove use of macro DEFINE_PCI_DEVICE_TABLE
powerpc/powernv: Fix mis-merge of OPAL support for LEDS driver
powerpc/powernv: Reset HILE before kexec_sequence()
powerpc/kexec: Reset secondary cpu endianness before kexec
powerpc/hvsi: Fix endianness issues in the HVSI driver
leds/powernv: Add driver for PowerNV platform
powerpc/powernv: Create LED platform device
powerpc/powernv: Add OPAL interfaces for accessing and modifying system LED states
powerpc/powernv: Fix the log message when disabling VF
cxl: Allow release of contexts which have been OPENED but not STARTED
...
Pull locking and atomic updates from Ingo Molnar:
"Main changes in this cycle are:
- Extend atomic primitives with coherent logic op primitives
(atomic_{or,and,xor}()) and deprecate the old partial APIs
(atomic_{set,clear}_mask())
The old ops were incoherent with incompatible signatures across
architectures and with incomplete support. Now every architecture
supports the primitives consistently (by Peter Zijlstra)
- Generic support for 'relaxed atomics':
- _acquire/release/relaxed() flavours of xchg(), cmpxchg() and {add,sub}_return()
- atomic_read_acquire()
- atomic_set_release()
This came out of porting qwrlock code to arm64 (by Will Deacon)
- Clean up the fragile static_key APIs that were causing repeat bugs,
by introducing a new one:
DEFINE_STATIC_KEY_TRUE(name);
DEFINE_STATIC_KEY_FALSE(name);
which define a key of different types with an initial true/false
value.
Then allow:
static_branch_likely()
static_branch_unlikely()
to take a key of either type and emit the right instruction for the
case. To be able to know the 'type' of the static key we encode it
in the jump entry (by Peter Zijlstra)
- Static key self-tests (by Jason Baron)
- qrwlock optimizations (by Waiman Long)
- small futex enhancements (by Davidlohr Bueso)
- ... and misc other changes"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
jump_label/x86: Work around asm build bug on older/backported GCCs
locking, ARM, atomics: Define our SMP atomics in terms of _relaxed() operations
locking, include/llist: Use linux/atomic.h instead of asm/cmpxchg.h
locking/qrwlock: Make use of _{acquire|release|relaxed}() atomics
locking/qrwlock: Implement queue_write_unlock() using smp_store_release()
locking/lockref: Remove homebrew cmpxchg64_relaxed() macro definition
locking, asm-generic: Add _{relaxed|acquire|release}() variants for 'atomic_long_t'
locking, asm-generic: Rework atomic-long.h to avoid bulk code duplication
locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations
locking, compiler.h: Cast away attributes in the WRITE_ONCE() magic
locking/static_keys: Make verify_keys() static
jump label, locking/static_keys: Update docs
locking/static_keys: Provide a selftest
jump_label: Provide a self-test
s390/uaccess, locking/static_keys: employ static_branch_likely()
x86, tsc, locking/static_keys: Employ static_branch_likely()
locking/static_keys: Add selftest
locking/static_keys: Add a new static_key interface
locking/static_keys: Rework update logic
locking/static_keys: Add static_key_{en,dis}able() helpers
...
The size of the Problem State Priority Boost Register is only
32 bits, but the kvm_vcpu_arch->pspb variable is declared as
"ulong", ie. 64-bit. However, the assembler code accesses this
variable with 32-bit accesses, and the KVM_REG_PPC_PSPB macro
is defined with SIZE_U32, too, so that the current code is
broken on big endian hosts: kvmppc_get_one_reg_hv() will only
return zero for this register since it is using the wrong half
of the pspb variable. Let's fix this problem by adjusting the
size of the pspb field in the kvm_vcpu_arch structure.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The code that handles the case when we receive a H_DOORBELL interrupt
has a comment which says "Hypervisor doorbell - exit only if host IPI
flag set". However, the current code does not actually check if the
host IPI flag is set. This is due to a comparison instruction that
got missed.
As a result, the current code performs the exit to host only
if some sibling thread or a sibling sub-core is exiting to the
host. This implies that, an IPI sent to a sibling core in
(subcores-per-core != 1) mode will be missed by the host unless the
sibling core is on the exit path to the host.
This patch adds the missing comparison operation which will ensure
that when HOST_IPI flag is set, we unconditionally exit to the host.
Fixes: 66feed61cd
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The current dynamic micro-threading code has a race due to which a
secondary thread naps when it is supposed to be running a vcpu. As a
side effect of this, on a guest exit, the primary thread in
kvmppc_wait_for_nap() finds that this secondary thread hasn't cleared
its vcore pointer. This results in "CPU X seems to be stuck!"
warnings.
The race is possible since the primary thread on exiting the guests
only waits for all the secondaries to clear its vcore pointer. It
subsequently expects the secondary threads to enter nap while it
unsplits the core. A secondary thread which hasn't yet entered the nap
will loop in kvm_no_guest until its vcore pointer and the do_nap flag
are unset. Once the core has been unsplit, a new vcpu thread can grab
the core and set the do_nap flag *before* setting the vcore pointers
of the secondary. As a result, the secondary thread will now enter nap
via kvm_unsplit_nap instead of running the guest vcpu.
Fix this by setting the do_nap flag after setting the vcore pointer in
the PACA of the secondary in kvmppc_run_core. Also, ensure that a
secondary thread doesn't nap in kvm_unsplit_nap when the vcore pointer
in its PACA struct is set.
Fixes: b4deba5c41
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Pull core block updates from Jens Axboe:
"This first core part of the block IO changes contains:
- Cleanup of the bio IO error signaling from Christoph. We used to
rely on the uptodate bit and passing around of an error, now we
store the error in the bio itself.
- Improvement of the above from myself, by shrinking the bio size
down again to fit in two cachelines on x86-64.
- Revert of the max_hw_sectors cap removal from a revision again,
from Jeff Moyer. This caused performance regressions in various
tests. Reinstate the limit, bump it to a more reasonable size
instead.
- Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
Most devices have huge trim limits, which can cause nasty latencies
when deleting files. Enable the admin to configure the size down.
We will look into having a more sane default instead of UINT_MAX
sectors.
- Improvement of the SGP gaps logic from Keith Busch.
- Enable the block core to handle arbitrarily sized bios, which
enables a nice simplification of bio_add_page() (which is an IO hot
path). From Kent.
- Improvements to the partition io stats accounting, making it
faster. From Ming Lei.
- Also from Ming Lei, a basic fixup for overflow of the sysfs pending
file in blk-mq, as well as a fix for a blk-mq timeout race
condition.
- Ming Lin has been carrying Kents above mentioned patches forward
for a while, and testing them. Ming also did a few fixes around
that.
- Sasha Levin found and fixed a use-after-free problem introduced by
the bio->bi_error changes from Christoph.
- Small blk cgroup cleanup from Viresh Kumar"
* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
blk: Fix bio_io_vec index when checking bvec gaps
block: Replace SG_GAPS with new queue limits mask
block: bump BLK_DEF_MAX_SECTORS to 2560
Revert "block: remove artifical max_hw_sectors cap"
blk-mq: fix race between timeout and freeing request
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
Documentation: update notes in biovecs about arbitrarily sized bios
block: remove bio_get_nr_vecs()
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: kill merge_bvec_fn() completely
md/raid5: get rid of bio_fits_rdev()
md/raid5: split bio for chunk_aligned_read
block: remove split code in blkdev_issue_{discard,write_same}
btrfs: remove bio splitting and merge_bvec_fn() calls
bcache: remove driver private bio splitting code
block: simplify bio_add_page()
block: make generic_make_request handle arbitrarily sized bios
blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
block: don't access bio->bi_error after bio_put()
block: shrink struct bio down to 2 cache lines again
...
- ACPICA update to upstream revision 20150818 including method
tracing extensions to allow more in-depth AML debugging in the
kernel and a number of assorted fixes and cleanups (Bob Moore,
Lv Zheng, Markus Elfring).
- ACPI sysfs code updates and a documentation update related to
AML method tracing (Lv Zheng).
- ACPI EC driver fix related to serialized evaluations of _Qxx
methods and ACPI tools updates allowing the EC userspace tool
to be built from the kernel source (Lv Zheng).
- ACPI processor driver updates preparing it for future
introduction of CPPC support and ACPI PCC mailbox driver
updates (Ashwin Chaugule).
- ACPI interrupts enumeration fix for a regression related
to the handling of IRQ attribute conflicts between MADT
and the ACPI namespace (Jiang Liu).
- Fixes related to ACPI device PM (Mika Westerberg, Srinidhi Kasagar).
- ACPI device registration code reorganization to separate the
sysfs-related code and bus type operations from the rest (Rafael
J Wysocki).
- Assorted cleanups in the ACPI core (Jarkko Nikula, Mathias Krause,
Andy Shevchenko, Rafael J Wysocki, Nicolas Iooss).
- ACPI cpufreq driver and ia64 cpufreq driver fixes and cleanups
(Pan Xinhui, Rafael J Wysocki).
- cpufreq core cleanups on top of the previous changes allowing it
to preseve its sysfs directories over system suspend/resume (Viresh
Kumar, Rafael J Wysocki, Sebastian Andrzej Siewior).
- cpufreq fixes and cleanups related to governors (Viresh Kumar).
- cpufreq updates (core and the cpufreq-dt driver) related to the
turbo/boost mode support (Viresh Kumar, Bartlomiej Zolnierkiewicz).
- New DT bindings for Operating Performance Points (OPP), support
for them in the OPP framework and in the cpufreq-dt driver plus
related OPP framework fixes and cleanups (Viresh Kumar).
- cpufreq powernv driver updates (Shilpasri G Bhat).
- New cpufreq driver for Mediatek MT8173 (Pi-Cheng Chen).
- Assorted cpufreq driver (speedstep-lib, sfi, integrator) cleanups
and fixes (Abhilash Jindal, Andrzej Hajda, Cristian Ardelean).
- intel_pstate driver updates including Skylake-S support, support
for enabling HW P-states per CPU and an additional vendor bypass
list entry (Kristen Carlson Accardi, Chen Yu, Ethan Zhao).
- cpuidle core fixes related to the handling of coupled idle states
(Xunlei Pang).
- intel_idle driver updates including Skylake Client support and
support for freeze-mode-specific idle states (Len Brown).
- Driver core updates related to power management (Andy Shevchenko,
Rafael J Wysocki).
- Generic power domains framework fixes and cleanups (Jon Hunter,
Geert Uytterhoeven, Rajendra Nayak, Ulf Hansson).
- Device PM QoS framework update to allow the latency tolerance
setting to be exposed to user space via sysfs (Mika Westerberg).
- devfreq support for PPMUv2 in Exynos5433 and a fix for an incorrect
exynos-ppmu DT binding (Chanwoo Choi, Javier Martinez Canillas).
- System sleep support updates (Alan Stern, Len Brown, SungEun Kim).
- rockchip-io AVS support updates (Heiko Stuebner).
- PM core clocks support fixup (Colin Ian King).
- Power capping RAPL driver update including support for Skylake H/S
and Broadwell-H (Radivoje Jovanovic, Seiichi Ikarashi).
- Generic device properties framework fixes related to the handling
of static (driver-provided) property sets (Andy Shevchenko).
- turbostat and cpupower updates (Len Brown, Shilpasri G Bhat,
Shreyas B Prabhu).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJV5hhGAAoJEILEb/54YlRxs+EQAK51iFk48+IbpHYaZZ50Yo4m
ZZc2zBcbwRcBlU9vKERrhG+jieSl8J/JJNxT8vBjKqyvNw038mCjewQh02ol0HuC
R7nlDiVJkmZ50sLO4xwE/1UBZr/XqbddwCUnYzvFMkMTA0ePzFtf8BrJ1FXpT8S/
fkwSXQty6hvJDwxkfrbMSaA730wMju9lahx8D6MlmUAedWYZOJDMQKB4WKa/St5X
9uckBPHUBB2KiKlXxdbFPwKLNxHvLROq5SpDLc6cM/7XZB+QfNFy85CUjCUtYo1O
1W8k0qnztvZ6UEv27qz5dejGyAGOarMWGGNsmL9evoeGeHRpQL+dom7HcTnbAfUZ
walyhYSm/zKkdy7Vl3xWUUQkMG48+PviMI6K0YhHXb3Rm5wlR/yBNZTwNIty9SX/
fKCHEa8QynWwLxgm53c3xRkiitJxMsHNK03moLD9zQMjshTyTNvpNbZoahyKQzk6
H+9M1DBRHhkkREDWSwGutukxfEMtWe2vcZcyERrFiY7l5k1j58DwDBMPqjPhRv6q
P/1NlCzr0XYf83Y86J18LbDuPGDhTjjIEn6CqbtI2mmWqTg3+rF7zvS2ux+FzMnA
gisv8l6GT9JiWhxKFqqL/rrVpwtyHebWLYE/RpNUW6fEzLziRNj1qyYO9dqI/GGi
I3rfxlXoc/5xJWCgNB8f
=fTgI
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI updates from Rafael Wysocki:
"From the number of commits perspective, the biggest items are ACPICA
and cpufreq changes with the latter taking the lead (over 50 commits).
On the cpufreq front, there are many cleanups and minor fixes in the
core and governors, driver updates etc. We also have a new cpufreq
driver for Mediatek MT8173 chips.
ACPICA mostly updates its debug infrastructure and adds a number of
fixes and cleanups for a good measure.
The Operating Performance Points (OPP) framework is updated with new
DT bindings and support for them among other things.
We have a few updates of the generic power domains framework and a
reorganization of the ACPI device enumeration code and bus type
operations.
And a lot of fixes and cleanups all over.
Included is one branch from the MFD tree as it contains some
PM-related driver core and ACPI PM changes a few other commits are
based on.
Specifics:
- ACPICA update to upstream revision 20150818 including method
tracing extensions to allow more in-depth AML debugging in the
kernel and a number of assorted fixes and cleanups (Bob Moore, Lv
Zheng, Markus Elfring).
- ACPI sysfs code updates and a documentation update related to AML
method tracing (Lv Zheng).
- ACPI EC driver fix related to serialized evaluations of _Qxx
methods and ACPI tools updates allowing the EC userspace tool to be
built from the kernel source (Lv Zheng).
- ACPI processor driver updates preparing it for future introduction
of CPPC support and ACPI PCC mailbox driver updates (Ashwin
Chaugule).
- ACPI interrupts enumeration fix for a regression related to the
handling of IRQ attribute conflicts between MADT and the ACPI
namespace (Jiang Liu).
- Fixes related to ACPI device PM (Mika Westerberg, Srinidhi
Kasagar).
- ACPI device registration code reorganization to separate the
sysfs-related code and bus type operations from the rest (Rafael J
Wysocki).
- Assorted cleanups in the ACPI core (Jarkko Nikula, Mathias Krause,
Andy Shevchenko, Rafael J Wysocki, Nicolas Iooss).
- ACPI cpufreq driver and ia64 cpufreq driver fixes and cleanups (Pan
Xinhui, Rafael J Wysocki).
- cpufreq core cleanups on top of the previous changes allowing it to
preseve its sysfs directories over system suspend/resume (Viresh
Kumar, Rafael J Wysocki, Sebastian Andrzej Siewior).
- cpufreq fixes and cleanups related to governors (Viresh Kumar).
- cpufreq updates (core and the cpufreq-dt driver) related to the
turbo/boost mode support (Viresh Kumar, Bartlomiej Zolnierkiewicz).
- New DT bindings for Operating Performance Points (OPP), support for
them in the OPP framework and in the cpufreq-dt driver plus related
OPP framework fixes and cleanups (Viresh Kumar).
- cpufreq powernv driver updates (Shilpasri G Bhat).
- New cpufreq driver for Mediatek MT8173 (Pi-Cheng Chen).
- Assorted cpufreq driver (speedstep-lib, sfi, integrator) cleanups
and fixes (Abhilash Jindal, Andrzej Hajda, Cristian Ardelean).
- intel_pstate driver updates including Skylake-S support, support
for enabling HW P-states per CPU and an additional vendor bypass
list entry (Kristen Carlson Accardi, Chen Yu, Ethan Zhao).
- cpuidle core fixes related to the handling of coupled idle states
(Xunlei Pang).
- intel_idle driver updates including Skylake Client support and
support for freeze-mode-specific idle states (Len Brown).
- Driver core updates related to power management (Andy Shevchenko,
Rafael J Wysocki).
- Generic power domains framework fixes and cleanups (Jon Hunter,
Geert Uytterhoeven, Rajendra Nayak, Ulf Hansson).
- Device PM QoS framework update to allow the latency tolerance
setting to be exposed to user space via sysfs (Mika Westerberg).
- devfreq support for PPMUv2 in Exynos5433 and a fix for an incorrect
exynos-ppmu DT binding (Chanwoo Choi, Javier Martinez Canillas).
- System sleep support updates (Alan Stern, Len Brown, SungEun Kim).
- rockchip-io AVS support updates (Heiko Stuebner).
- PM core clocks support fixup (Colin Ian King).
- Power capping RAPL driver update including support for Skylake H/S
and Broadwell-H (Radivoje Jovanovic, Seiichi Ikarashi).
- Generic device properties framework fixes related to the handling
of static (driver-provided) property sets (Andy Shevchenko).
- turbostat and cpupower updates (Len Brown, Shilpasri G Bhat,
Shreyas B Prabhu)"
* tag 'pm+acpi-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (180 commits)
cpufreq: speedstep-lib: Use monotonic clock
cpufreq: powernv: Increase the verbosity of OCC console messages
cpufreq: sfi: use kmemdup rather than duplicating its implementation
cpufreq: drop !cpufreq_driver check from cpufreq_parse_governor()
cpufreq: rename cpufreq_real_policy as cpufreq_user_policy
cpufreq: remove redundant 'policy' field from user_policy
cpufreq: remove redundant 'governor' field from user_policy
cpufreq: update user_policy.* on success
cpufreq: use memcpy() to copy policy
cpufreq: remove redundant CPUFREQ_INCOMPATIBLE notifier event
cpufreq: mediatek: Add MT8173 cpufreq driver
dt-bindings: mediatek: Add MT8173 CPU DVFS clock bindings
PM / Domains: Fix typo in description of genpd_dev_pm_detach()
PM / Domains: Remove unusable governor dummies
PM / Domains: Make pm_genpd_init() available to modules
PM / domains: Align column headers and data in pm_genpd_summary output
powercap / RAPL: disable the 2nd power limit properly
tools: cpupower: Fix error when running cpupower monitor
PM / OPP: Drop unlikely before IS_ERR(_OR_NULL)
PM / OPP: Fix static checker warning (broken 64bit big endian systems)
...
Pull irq updates from Thomas Gleixner:
"This updated pull request does not contain the last few GIC related
patches which were reported to cause a regression. There is a fix
available, but I let it breed for a couple of days first.
The irq departement provides:
- new infrastructure to support non PCI based MSI interrupts
- a couple of new irq chip drivers
- the usual pile of fixlets and updates to irq chip drivers
- preparatory changes for removal of the irq argument from interrupt
flow handlers
- preparatory changes to remove IRQF_VALID"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
irqchip/imx-gpcv2: IMX GPCv2 driver for wakeup sources
irqchip: Add bcm2836 interrupt controller for Raspberry Pi 2
irqchip: Add documentation for the bcm2836 interrupt controller
irqchip/bcm2835: Add support for being used as a second level controller
irqchip/bcm2835: Refactor handle_IRQ() calls out of MAKE_HWIRQ
PCI: xilinx: Fix typo in function name
irqchip/gic: Ensure gic_cpu_if_up/down() programs correct GIC instance
irqchip/gic: Only allow the primary GIC to set the CPU map
PCI/MSI: pci-xgene-msi: Consolidate chained IRQ handler install/remove
unicore32/irq: Prepare puv3_gpio_handler for irq argument removal
tile/pci_gx: Prepare trio_handle_level_irq for irq argument removal
m68k/irq: Prepare irq handlers for irq argument removal
C6X/megamode-pic: Prepare megamod_irq_cascade for irq argument removal
blackfin: Prepare irq handlers for irq argument removal
arc/irq: Prepare idu_cascade_isr for irq argument removal
sparc/irq: Use access helper irq_data_get_affinity_mask()
sparc/irq: Use helper irq_data_get_irq_handler_data()
parisc/irq: Use access helper irq_data_get_affinity_mask()
mn10300/irq: Use access helper irq_data_get_affinity_mask()
irqchip/i8259: Prepare i8259_irq_dispatch for irq argument removal
...
Pull timer updates from Thomas Gleixner:
"Rather large, but nothing exiting:
- new range check for settimeofday() to prevent that boot time
becomes negative.
- fix for file time rounding
- a few simplifications of the hrtimer code
- fix for the proc/timerlist code so the output of clock realtime
timers is accurate
- more y2038 work
- tree wide conversion of clockevent drivers to the new callbacks"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (88 commits)
hrtimer: Handle failure of tick_init_highres() gracefully
hrtimer: Unconfuse switch_hrtimer_base() a bit
hrtimer: Simplify get_target_base() by returning current base
hrtimer: Drop return code of hrtimer_switch_to_hres()
time: Introduce timespec64_to_jiffies()/jiffies_to_timespec64()
time: Introduce current_kernel_time64()
time: Introduce struct itimerspec64
time: Add the common weak version of update_persistent_clock()
time: Always make sure wall_to_monotonic isn't positive
time: Fix nanosecond file time rounding in timespec_trunc()
timer_list: Add the base offset so remaining nsecs are accurate for non monotonic timers
cris/time: Migrate to new 'set-state' interface
kernel: broadcast-hrtimer: Migrate to new 'set-state' interface
xtensa/time: Migrate to new 'set-state' interface
unicore/time: Migrate to new 'set-state' interface
um/time: Migrate to new 'set-state' interface
sparc/time: Migrate to new 'set-state' interface
sh/localtimer: Migrate to new 'set-state' interface
score/time: Migrate to new 'set-state' interface
s390/time: Migrate to new 'set-state' interface
...
Pull x86 mm updates from Ingo Molnar:
"The dominant change in this cycle was the continued work to isolate
kernel drivers from MTRR legacies: this tree gets rid of all kernel
internal driver interfaces to MTRRs (mostly by rewriting it to proper
PAT interfaces), the only access left is the /proc/mtrr ABI.
This work was done by Luis R Rodriguez.
There's also some related PCI interface additions for which I've
Cc:-ed Bjorn"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/mm/mtrr: Remove kernel internal MTRR interfaces: unexport mtrr_add() and mtrr_del()
s390/io: Add pci_iomap_wc() and pci_iomap_wc_range()
drivers/dma/iop-adma: Use dma_alloc_writecombine() kernel-style
drivers/video/fbdev/vt8623fb: Use arch_phys_wc_add() and pci_iomap_wc()
drivers/video/fbdev/s3fb: Use arch_phys_wc_add() and pci_iomap_wc()
drivers/video/fbdev/arkfb.c: Use arch_phys_wc_add() and pci_iomap_wc()
PCI: Add pci_iomap_wc() variants
drivers/video/fbdev/gxt4500: Use pci_ioremap_wc_bar() to map framebuffer
drivers/video/fbdev/kyrofb: Use arch_phys_wc_add() and pci_ioremap_wc_bar()
drivers/video/fbdev/i740fb: Use arch_phys_wc_add() and pci_ioremap_wc_bar()
PCI: Add pci_ioremap_wc_bar()
x86/mm: Make kernel/check.c explicitly non-modular
x86/mm/pat: Make mm/pageattr[-test].c explicitly non-modular
x86/mm/pat: Add comments to cachemode translation tables
arch/*/io.h: Add ioremap_uc() to all architectures
drivers/video/fbdev/atyfb: Use arch_phys_wc_add() and ioremap_wc()
drivers/video/fbdev/atyfb: Replace MTRR UC hole with strong UC
drivers/video/fbdev/atyfb: Clarify ioremap() base and length used
drivers/video/fbdev/atyfb: Carve out framebuffer length fudging into a helper
x86/mm, asm-generic: Add IOMMU ioremap_uc() variant default
...
* pm-cpufreq: (53 commits)
cpufreq: speedstep-lib: Use monotonic clock
cpufreq: powernv: Increase the verbosity of OCC console messages
cpufreq: sfi: use kmemdup rather than duplicating its implementation
cpufreq: drop !cpufreq_driver check from cpufreq_parse_governor()
cpufreq: rename cpufreq_real_policy as cpufreq_user_policy
cpufreq: remove redundant 'policy' field from user_policy
cpufreq: remove redundant 'governor' field from user_policy
cpufreq: update user_policy.* on success
cpufreq: use memcpy() to copy policy
cpufreq: remove redundant CPUFREQ_INCOMPATIBLE notifier event
cpufreq: mediatek: Add MT8173 cpufreq driver
dt-bindings: mediatek: Add MT8173 CPU DVFS clock bindings
intel_pstate: append more Oracle OEM table id to vendor bypass list
intel_pstate: Add SKY-S support
intel_pstate: Fix possible overflow complained by Coverity
cpufreq: Correct a freq check in cpufreq_set_policy()
cpufreq: Lock CPU online/offline in cpufreq_register_driver()
cpufreq: Replace recover_policy with new_policy in cpufreq_online()
cpufreq: Separate CPU device registration from CPU online
cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling
...
Pull scheduler updates from Ingo Molnar:
"The biggest change in this cycle is the rewrite of the main SMP load
balancing metric: the CPU load/utilization. The main goal was to make
the metric more precise and more representative - see the changelog of
this commit for the gory details:
9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
It is done in a way that significantly reduces complexity of the code:
5 files changed, 249 insertions(+), 494 deletions(-)
and the performance testing results are encouraging. Nevertheless we
need to keep an eye on potential regressions, since this potentially
affects every SMP workload in existence.
This work comes from Yuyang Du.
Other changes:
- SCHED_DL updates. (Andrea Parri)
- Simplify architecture callbacks by removing finish_arch_switch().
(Peter Zijlstra et al)
- cputime accounting: guarantee stime + utime == rtime. (Peter
Zijlstra)
- optimize idle CPU wakeups some more - inspired by Facebook server
loads. (Mike Galbraith)
- stop_machine fixes and updates. (Oleg Nesterov)
- Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra)
- sched/numa tweaks. (Srikar Dronamraju)
- misc fixes and small cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
sched/deadline: Fix comment in enqueue_task_dl()
sched/deadline: Fix comment in push_dl_tasks()
sched: Change the sched_class::set_cpus_allowed() calling context
sched: Make sched_class::set_cpus_allowed() unconditional
sched: Fix a race between __kthread_bind() and sched_setaffinity()
sched: Ensure a task has a non-normalized vruntime when returning back to CFS
sched/numa: Fix NUMA_DIRECT topology identification
tile: Reorganize _switch_to()
sched, sparc32: Update scheduler comments in copy_thread()
sched: Remove finish_arch_switch()
sched, tile: Remove finish_arch_switch
sched, sh: Fold finish_arch_switch() into switch_to()
sched, score: Remove finish_arch_switch()
sched, avr32: Remove finish_arch_switch()
sched, MIPS: Get rid of finish_arch_switch()
sched, arm: Remove finish_arch_switch()
sched/fair: Clean up load average references
sched/fair: Provide runnable_load_avg back to cfs_rq
sched/fair: Remove task and group entity load when they are dead
sched/fair: Init cfs_rq's sched_entity load average
...
Pull RCU updates from Ingo Molnar:
"The main RCU changes in this cycle are:
- the combination of tree geometry-initialization simplifications and
OS-jitter-reduction changes to expedited grace periods. These two
are stacked due to the large number of conflicts that would
otherwise result.
- privatize smp_mb__after_unlock_lock().
This commit moves the definition of smp_mb__after_unlock_lock() to
kernel/rcu/tree.h, in recognition of the fact that RCU is the only
thing using this, that nothing else is likely to use it, and that
it is likely to go away completely.
- documentation updates.
- torture-test updates.
- misc fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
rcu,locking: Privatize smp_mb__after_unlock_lock()
rcu: Silence lockdep false positive for expedited grace periods
rcu: Don't disable CPU hotplug during OOM notifiers
scripts: Make checkpatch.pl warn on expedited RCU grace periods
rcu: Update MAINTAINERS entry
rcu: Clarify CONFIG_RCU_EQS_DEBUG help text
rcu: Fix backwards RCU_LOCKDEP_WARN() in synchronize_rcu_tasks()
rcu: Rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
rcu: Make rcu_is_watching() really notrace
cpu: Wait for RCU grace periods concurrently
rcu: Create a synchronize_rcu_mult()
rcu: Fix obsolete priority-boosting comment
rcu: Use WRITE_ONCE in RCU_INIT_POINTER
rcu: Hide RCU_NOCB_CPU behind RCU_EXPERT
rcu: Add RCU-sched flavors of get-state and cond-sync
rcu: Add fastpath bypassing funnel locking
rcu: Rename RCU_GP_DONE_FQS to RCU_GP_DOING_FQS
rcu: Pull out wait_event*() condition into helper function
documentation: Describe new expedited stall warnings
rcu: Add stall warnings to synchronize_sched_expedited()
...
and the addition of new clock drivers. Stephen Boyd has also done a lot
of subsystem-wide driver clean-ups (thanks!). There are also fixes to
the framework core and changes to better split clock provider drivers
from clock consumer drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV5KelAAoJEKI6nJvDJaTUwaQP/RVb70v6XSgMIePuOq3iaECT
bclCAyito3YFwykrPPmQ1DucHvEjlWopeFwKqEE9VjNl07TVIH/OMGeonb9yErIY
aN+FMoA9RUGVexMhy004q5sSbOEihAqTgKWaOiYoY8zAfJfeTpYXUoy34FcrW7MB
j/cDDJgigtWe9zzcdrW04oT454lXQaSQuGX39tDCR0s0S3soYU2JyjkyBGiO5Yid
1yIMq/nzI8SrCwxwD/nFwQNtg7lqiAN291Nbi4At1vvG5r4RhNveuLGv8uJ50XRB
xwy0sdHLIVJrIJ8OUcs1sY8wxu7ghDS8u+vjTNO2RzBf3KZWbuXWX+yVM7JQi4Ty
0iL5hGbvERy5E9QSzzH+Ox2jVt5e/r/dyvRf3oBDPVrFXhKusYhn6JmdUVJkTZ83
GTw2sQdEpcmry4z/50/MaqpZuXVZ09VTOCTqp8ToseJjsz9jXxVhQ4HdAwLc8cmV
txWGRXuBxCB+2o8M0oky3IKS69VFFH5u6QQ0KG8+JYOrDDG7GcnJsFeV7mQjlu8g
3evYUILNAUfJGBpkOeLs654KUBHwUyXc87cUIKwjGaPruWb2048+kdCVrL3IFwPb
sS/7Qn3DQ90pHFUTssDnWLz3X0IWT3H0iV4zZyAqqdARugEo+mpykmXmMWcWc3VR
MrD1l3GVxLegEf242Zpo
=QAiQ
-----END PGP SIGNATURE-----
Merge tag 'clk-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk updates from Michael Turquette:
"The clk framework changes for 4.3 are mostly updates to existing
drivers and the addition of new clock drivers. Stephen Boyd has also
done a lot of subsystem-wide driver clean-ups (thanks!). There are
also fixes to the framework core and changes to better split clock
provider drivers from clock consumer drivers"
* tag 'clk-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (227 commits)
clk: s5pv210: add missing call to samsung_clk_of_add_provider()
clk: pistachio: correct critical clock list
clk: pistachio: Fix PLL rate calculation in integer mode
clk: pistachio: Fix override of clk-pll settings from boot loader
clk: pistachio: Fix 32bit integer overflows
clk: tegra: Fix some static checker problems
clk: qcom: Fix MSM8916 prng clock enable bit
clk: Add missing header for 'bool' definition to clk-conf.h
drivers/clk: appropriate __init annotation for const data
clk: rockchip: register pll mux before pll itself
clk: add bindings for the Ux500 clocks
clk/ARM: move Ux500 PRCC bases to the device tree
clk: remove duplicated code with __clk_set_parent_after
clk: Convert __clk_get_name(hw->clk) to clk_hw_get_name(hw)
clk: Constify clk_hw argument to provider APIs
clk: Hi6220: add stub clock driver
dt-bindings: clk: Hi6220: Document stub clock driver
dt-bindings: arm: Hi6220: add doc for SRAM controller
clk: atlas7: fix pll missed divide NR in fraction mode
clk: atlas7: fix bit field and its root clk for coresight_tpiu
...
A few core tweaks this time together with the usual collection of driver
specific updates and fixes plus a larger than average selection of new
device support:
- Fix DMA mapping of unaligned vmalloc() buffers.
- Statistics tracking transfer volumes exposed via sysfs.
- New drivers for Freescale MPC5125, Intel Sunrise Point, Mediatek SoCs,
and Netlogic XLP SoCs.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJV5F3EAAoJECTWi3JdVIfQ45kH/isvhUPuvaZS8zQggQ95xIjy
7uZMQzf2JHwmeFiYsqFuWdf6IVgGmg68cQ4ZIzyXk1quEJkOLNCrlOvn4Qv7z6lJ
fqr5qbU+BWrtv0IjN5szIx7BR+ZWQ9LHdZI651A1XbCPvU8H1PZL+PfWDK6pMHCM
tqCLUVgnwRYNT0ELmgl7Up8SWKMbs98Wit2dQzNvwrfEZ1GHTMk3ChYAABfMkKIU
wbuug6+cv5+VdhV73nX29BZXJj7+EyRATQknADQT3nugcXkMysi0IvLZIrSFEJTp
iBApQsO3EaOVUR5/hN6zHpS8lqZdEkQvlKRuodnW31dXLrIw7TyVDT1wWknRZeU=
=Jykk
-----END PGP SIGNATURE-----
Merge tag 'spi-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi updates from Mark Brown:
"A few core tweaks this time together with the usual collection of
driver specific updates and fixes plus a larger than average selection
of new device support:
- fix DMA mapping of unaligned vmalloc() buffers
- statistics tracking transfer volumes exposed via sysfs
- new drivers for Freescale MPC5125, Intel Sunrise Point, Mediatek
SoCs, and Netlogic XLP SoCs"
* tag 'spi-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (66 commits)
spi: sh-msiof: Fix FIFO size to 64 word from 256 word
spi: fsl-(e)spi: Fix checking return value of devm_ioremap_resource
spi: Add DT bindings documentation for Netlogic XLP SPI controller
spi/xlp: SPI controller driver for Netlogic XLP SoCs
spi: fsl-espi: add runtime PM
spi: fsl-(e)spi: simplify cleanup code
spi: fsl-(e)spi: migrate to using devm_ functions to simplify cleanup
spi: mediatek: fix SPI_CMD_PAUSE_IE macro error
spi: check bits_per_word in spi_setup
spi: mediatek: replace *_time name
spi: mediatek: add PM clk_prepare_enable fail flow
spi: mediatek: replace int with u32, delete TAB and define MTK_SPI_PAUSE_INT_STATUS marco
spi: mediatek: add linux/io.h include file
spi/bcm63xx-hsspi: add support for dual spi read/write
spi: dw: Allow interface drivers to limit data I/O to word sizes
dt: snps,dw-apb-ssi: Document new I/O data register width property
spi: Fall back to master maximum speed if no slave speed specified
spi: mediatek: use BIT() to instead of SPI_CMD_*_OFFSET
spi: medaitek: revise quirks compatibility style
spi: mediatek: fix spi incorrect endian usage
...
The config space of some PCI devices can't be accessed when their
PEs are in frozen state. Otherwise, fenced PHB might be seen.
Those PEs are identified with flag EEH_PE_CFG_RESTRICTED, meaing
EEH_PE_CFG_BLOCKED is set automatically when the PE is put to
frozen state (EEH_PE_ISOLATED). eeh_slot_error_detail() restores
PCI device BARs with eeh_pe_restore_bars(), which then calls
eeh_ops->restore_config() to reinitialize the PCI device in
(OPAL) firmware. eeh_ops->restore_config() produces PCI config
access that causes fenced PHB. The problem was reported on below
adapter:
0001:01:00.0 0200: 14e4:168e (rev 10)
0001:01:00.0 Ethernet controller: Broadcom Corporation \
NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
This fixes the issue by skipping eeh_pe_restore_bars() in
eeh_slot_error_detail() when EEH_PE_CFG_BLOCKED is set for the PE.
Fixes: b6541db1 ("powerpc/eeh: Block PCI config access upon frozen PE")
Cc: stable@vger.kernel.org # v4.0+
Reported-by: Manvanthara B. Puttashankar <mputtash@in.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This applies cleanup on pci_dn_reconfig_notifier(), no functional
changes:
* Rename variable "pci" to "pdn" to indicate its purpose clearly.
* The parent node can be released at any time. So it should be
hold with of_get_parent() before accessing it.
* The device node doesn't have to have parent node in theory.
More check on this.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit cca87d30 ("powerpc/pci: Refactor pci_dn") introduced pdn
list for SRIOV VFs. It means the pdn is be put into the child list
of its parent pdn when the pdn is created. When doing PCI hot
unplugging on pSeries, the PCI device node as well as its pdn are
released through procfs entry "powerpc/ofdt". Some one else grabs
the memory chunk of the pdn and update it accordingly. At the same
time, the pdn is still tracked in the child list of parent pdn. It
leads to corrupted child list in the parent pdn.
This fixes above issue by removing the pdn from the child list of
its parent pdn when the device node is detached from the system.
Note the pdn is free'd when the device node is released if the
device node is dynamic one. Otherwise, the device node as well
as the pdn won't be released.
Fixes: cca87d30 ("powerpc/pci: Refactor pci_dn")
Cc: stable@vger.kernel.org # 4.1+
Reported-by: Santwana Samantray <santwana.samantray@in.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.
Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory. Device memory has
different lifetime and performance characteristics than RAM. However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jerome Glisse <j.glisse@gmail.com>
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
None of the implementations currently use it. The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Freescale updates from Scott:
"Highlights include 32-bit memcpy/memset optimizations, checksum
optimizations, 85xx config fragments and updates, device tree updates,
e6500 fixes for non-SMP, and misc cleanup and minor fixes."
Commit 84ad6e5c added LEDS support for PowerNV platform. Lets
update ppc64_defconfig to pick LEDS driver.
PowerNV LEDS driver looks for "/ibm,opal/leds" node in device
tree and loads if this node exists. Hence added it as 'm'.
Also note that powernv LEDS driver needs NEW_LEDS and LEDS_CLASS
as well. Hence added them to config file.
mpe: Also add them to pseries_defconfig, which is currently also used
for powernv systems.
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit e91c25111a "powerpc/iommu: Cleanup setting of DMA base/offset"
expects that the default DMA offset is set from pnv_ioda_setup_bus_dma()
which is correct unless it is SRIOV where the code flow is different -
at the moment when pnv_ioda_setup_bus_dma() is called, PCI devices for
VFs are not created yet.
This adds missing set_dma_offset() to pnv_pci_ioda_dma_dev_setup() to
cover the case of SRIOV.
Note that we still need set_dma_offset() in pnv_ioda_setup_bus_dma() as
at the boot time pnv_pci_ioda_dma_dev_setup() is called when no PE was
created yet, this happens at the PHB fixup stage.
Fixes: e91c25111a ("powerpc/iommu: Cleanup setting of DMA base/offset")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since commit 1851617cd2 ("PCI/MSI: Disable MSI at enumeration even if
kernel doesn't support MSI"), the setup of dev->msi_cap/msix_cap and the
disable of MSI/MSI-X interrupts isn't being done at PCI probe time, as
the logic responsible for this was moved in the aforementioned commit
from pci_device_add() to pci_setup_device(). The latter function is not
reachable on PowerPC pseries platform during Open Firmware PCI probing
time.
This exhibits as drivers not being able to enable MSI, eg:
bnx2x 0000:01:00.0: no msix capability found
This patch calls pci_msi_setup_pci_dev() explicitly to disable MSI/MSI-X
during PCI probe time on pSeries platform.
Fixes: 1851617cd2 ("PCI/MSI: Disable MSI at enumeration even if kernel doesn't support MSI")
[mpe: Flesh out change log and clarify comment]
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull RCU cleanup from Paul E. McKenney:
"Privatize smp_mb__after_unlock_lock(). This commit moves the
definition of smp_mb__after_unlock_lock() to kernel/rcu/tree.h,
in recognition of the fact that RCU is the only thing using
this, that nothing else is likely to use it, and that it is
likely to go away completely."
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When I merged the OPAL support for the powernv LEDS driver I missed a
hunk.
This is slightly modified from the original patch, as the original added
code to opal-api.h which is not in the skiboot version, which is
discouraged.
Instead those values are moved into the driver, which is the only place
they are used.
Fixes: 8a8d91817a ("powerpc/powernv: Add OPAL interfaces for accessing and modifying system LED states")
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In 64 bit kernels, the Fixed Point Exception Register (XER) is a 64
bit field (e.g. in kvm_regs and kvm_vcpu_arch) and in most places it is
accessed as such.
This patch corrects places where it is accessed as a 32 bit field by a
64 bit kernel. In some cases this is via a 32 bit load or store
instruction which, depending on endianness, will cause either the
lower or upper 32 bits to be missed. In another case it is cast as a
u32, causing the upper 32 bits to be cleared.
This patch corrects those places by extending the access methods to
64 bits.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Whenever a vcore state is VCORE_PREEMPT we need to be counting stolen
time for it. This currently isn't the case when we have a vcore that
no longer has any runnable threads in it but still has a runner task,
so we do an explicit call to kvmppc_core_start_stolen() in that case.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
When a vcore gets preempted, we put it on the preempted vcore list for
the current CPU. The runner task then calls schedule() and comes back
some time later and takes itself off the list. We need to be careful
to lock the list that it was put onto, which may not be the list for the
current CPU since the runner task may have moved to another CPU.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds implementations for the H_CLEAR_REF (test and clear reference
bit) and H_CLEAR_MOD (test and clear changed bit) hypercalls.
When clearing the reference or change bit in the guest view of the HPTE,
we also have to clear it in the real HPTE so that we can detect future
references or changes. When we do so, we transfer the R or C bit value
to the rmap entry for the underlying host page so that kvm_age_hva_hv(),
kvm_test_age_hva_hv() and kvmppc_hv_get_dirty_log() know that the page
has been referenced and/or changed.
These hypercalls are not used by Linux guests. These implementations
have been tested using a FreeBSD guest.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This fixes a bug in the tracking of pages that get modified by the
guest. If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.
To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.
The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages. However, we also fix
this case for the sake of future-proofing.
The reference bit is also subject to the same loss of information. We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified. Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
The reference (R) and change (C) bits in a HPT entry can be set by
hardware at any time up until the HPTE is invalidated and the TLB
invalidation sequence has completed. This means that when removing
a HPTE, we need to read the HPTE after the invalidation sequence has
completed in order to obtain reliable values of R and C. The code
in kvmppc_do_h_remove() used to do this. However, commit 6f22bd3265
("KVM: PPC: Book3S HV: Make HTAB code LE host aware") removed the
read after invalidation as a side effect of other changes. This
restores the read of the HPTE after invalidation.
The user-visible effect of this bug would be that when migrating a
guest, there is a small probability that a page modified by the guest
and then unmapped by the guest might not get re-transmitted and thus
the destination might end up with a stale copy of the page.
Fixes: 6f22bd3265
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This builds on the ability to run more than one vcore on a physical
core by using the micro-threading (split-core) modes of the POWER8
chip. Previously, only vcores from the same VM could be run together,
and (on POWER8) only if they had just one thread per core. With the
ability to split the core on guest entry and unsplit it on guest exit,
we can run up to 8 vcpu threads from up to 4 different VMs, and we can
run multiple vcores with 2 or 4 vcpus per vcore.
Dynamic micro-threading is only available if the static configuration
of the cores is whole-core mode (unsplit), and only on POWER8.
To manage this, we introduce a new kvm_split_mode struct which is
shared across all of the subcores in the core, with a pointer in the
paca on each thread. In addition we extend the core_info struct to
have information on each subcore. When deciding whether to add a
vcore to the set already on the core, we now have two possibilities:
(a) piggyback the vcore onto an existing subcore, or (b) start a new
subcore.
Currently, when any vcpu needs to exit the guest and switch to host
virtual mode, we interrupt all the threads in all subcores and switch
the core back to whole-core mode. It may be possible in future to
allow some of the subcores to keep executing in the guest while
subcore 0 switches to the host, but that is not implemented in this
patch.
This adds a module parameter called dynamic_mt_modes which controls
which micro-threading (split-core) modes the code will consider, as a
bitmap. In other words, if it is 0, no micro-threading mode is
considered; if it is 2, only 2-way micro-threading is considered; if
it is 4, only 4-way, and if it is 6, both 2-way and 4-way
micro-threading mode will be considered. The default is 6.
With this, we now have secondary threads which are the primary thread
for their subcore and therefore need to do the MMU switch. These
threads will need to be started even if they have no vcpu to run, so
we use the vcore pointer in the PACA rather than the vcpu pointer to
trigger them.
It is now possible for thread 0 to find that an exit has been
requested before it gets to switch the subcore state to the guest. In
that case we haven't added the guest's timebase offset to the
timebase, so we need to be careful not to subtract the offset in the
guest exit path. In fact we just skip the whole path that switches
back to host context, since we haven't switched to the guest context.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
When running a virtual core of a guest that is configured with fewer
threads per core than the physical cores have, the extra physical
threads are currently unused. This makes it possible to use them to
run one or more other virtual cores from the same guest when certain
conditions are met. This applies on POWER7, and on POWER8 to guests
with one thread per virtual core. (It doesn't apply to POWER8 guests
with multiple threads per vcore because they require a 1-1 virtual to
physical thread mapping in order to be able to use msgsndp and the
TIR.)
The idea is that we maintain a list of preempted vcores for each
physical cpu (i.e. each core, since the host runs single-threaded).
Then, when a vcore is about to run, it checks to see if there are
any vcores on the list for its physical cpu that could be
piggybacked onto this vcore's execution. If so, those additional
vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
threads are started as well as the original vcore, which is called
the master vcore.
After the vcores have exited the guest, the extra ones are put back
onto the preempted list if any of their VCPUs are still runnable and
not idle.
This means that vcpu->arch.ptid is no longer necessarily the same as
the physical thread that the vcpu runs on. In order to make it easier
for code that wants to send an IPI to know which CPU to target, we
now store that in a new field in struct vcpu_arch, called thread_cpu.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
On this switch branch the regs initialization
doesn't happen so add it.
This was found with the help of a static
code analysis tool.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
When compiling the KVM code for POWER with "make C=1", sparse
complains about functions missing proper prototypes and a 64-bit
constant missing the ULL prefix. Let's fix this by making the
functions static or by including the proper header with the
prototypes, and by appending a ULL prefix to the constant
PPC_MPPE_ADDRESS_MASK.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>