On platforms that have firmware support for reading/writing per-dimm
label space, a portion of the dimm may be accessible via an interleave
set PMEM mapping in addition to the dimm's BLK (block-data-window
aperture(s)) interface. A label, stored in a "configuration data
region" on the dimm, disambiguates which dimm addresses are accessed
through which exclusive interface.
Add infrastructure that allows the kernel to block modifications to a
label in the set while any member dimm is active. Note that this is
meant only for enforcing "no modifications of active labels" via the
coarse ioctl command. Adding/deleting namespaces from an active
interleave set is always possible via sysfs.
Another aspect of tracking interleave sets is tracking their integrity
when DIMMs in a set are physically re-ordered. For this purpose we
generate an "interleave-set cookie" that can be recorded in a label and
validated against the current configuration. It is the bus provider
implementation's responsibility to calculate the interleave set cookie
and attach it to a given region.
Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
nd_pmem attaches to persistent memory regions and namespaces emitted by
the libnvdimm subsystem, and, same as the original pmem driver, presents
the system-physical-address range as a block device.
The existing e820-type-12 to pmem setup is converted to an nvdimm_bus
that emits an nd_namespace_io device.
Note that the X in 'pmemX' is now derived from the parent region. This
provides some stability to the pmem devices names from boot-to-boot.
The minor numbers are also more predictable by passing 0 to
alloc_disk().
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Prepare the pmem driver to consume PMEM namespaces emitted by regions of
an nvdimm_bus instance. No functional change.
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The libnvdimm region driver is an intermediary driver that translates
non-volatile "region"s into "namespace" sub-devices that are surfaced by
persistent memory block-device drivers (PMEM and BLK).
ACPI 6 introduces the concept that a given nvdimm may simultaneously
offer multiple access modes to its media through direct PMEM load/store
access, or windowed BLK mode. Existing nvdimms mostly implement a PMEM
interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
If an nvdimm is single interfaced, then there is no need for dimm
metadata labels. For these devices we can take the region boundaries
directly to create a child namespace device (nd_namespace_io).
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A "region" device represents the maximum capacity of a BLK range (mmio
block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
volatile memory), without regard for aliasing. Aliasing, in the
dimm-local address space (DPA), is resolved by metadata on a dimm to
designate which exclusive interface will access the aliased DPA ranges.
Support for the per-dimm metadata/label arrvies is in a subsequent
patch.
The name format of "region" devices is "regionN" where, like dimms, N is
a global ida index assigned at discovery time. This id is not reliable
across reboots nor in the presence of hotplug. Look to attributes of
the region or static id-data of the sub-namespace to generate a
persistent name. However, if the platform configuration does not change
it is reasonable to expect the same region id to be assigned at the next
boot.
"region"s have 2 generic attributes "size", and "mapping"s where:
- size: the BLK accessible capacity or the span of the
system physical address range in the case of PMEM.
- mappingN: a tuple describing a dimm's contribution to the region's
capacity in the format (<nmemX>,<dpa>,<size>). For a PMEM-region
there will be at least one mapping per dimm in the interleave set. For
a BLK-region there is only "mapping0" listing the starting DPA of the
BLK-region and the available DPA capacity of that space (matches "size"
above).
The max number of mappings per "region" is hard coded per the
constraints of sysfs attribute groups. That said the number of mappings
per region should never exceed the maximum number of possible dimms in
the system. If the current number turns out to not be enough then the
"mappings" attribute clarifies how many there are supposed to be. "32
should be enough for anybody...".
Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Implement the device-model infrastructure for loading modules and
attaching drivers to nvdimm devices. This is a simple association of a
nd-device-type number with a driver that has a bitmask of supported
device types. To facilitate userspace bind/unbind operations 'modalias'
and 'devtype', that also appear in the uevent, are added as generic
sysfs attributes for all nvdimm devices. The reason for the device-type
number is to support sub-types within a given parent devtype, be it a
vendor-specific sub-type or otherwise.
* The first consumer of this infrastructure is the driver
for dimm devices. It simply uses control messages to retrieve and
store the configuration-data image (label set) from each dimm.
Note: nd_device_register() arranges for asynchronous registration of
nvdimm bus devices by default.
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Most discovery/configuration of the nvdimm-subsystem is done via sysfs
attributes. However, some nvdimm_bus instances, particularly the
ACPI.NFIT bus, define a small set of messages that can be passed to the
platform. For convenience we derive the initial libnvdimm-ioctl command
formats directly from the NFIT DSM Interface Example formats.
ND_CMD_SMART: media health and diagnostics
ND_CMD_GET_CONFIG_SIZE: size of the label space
ND_CMD_GET_CONFIG_DATA: read label space
ND_CMD_SET_CONFIG_DATA: write label space
ND_CMD_VENDOR: vendor-specific command passthrough
ND_CMD_ARS_CAP: report address-range-scrubbing capabilities
ND_CMD_ARS_START: initiate scrubbing
ND_CMD_ARS_STATUS: report on scrubbing state
ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events
If a platform later defines different commands than this set it is
straightforward to extend support to those formats.
Most of the commands target a specific dimm. However, the
address-range-scrubbing commands target the bus. The 'commands'
attribute in sysfs of an nvdimm_bus, or nvdimm, enumerate the supported
commands for that object.
Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Enable nvdimm devices to be registered on a nvdimm_bus. The kernel
assigned device id for nvdimm devicesis dynamic. If userspace needs a
more static identifier it should consult a provider-specific attribute.
In the case where NFIT is the provider, the 'nmemX/nfit/handle' or
'nmemX/nfit/serial' attributes may be used for this purpose.
Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The control device for a nvdimm_bus is registered as an "nd" class
device. The expectation is that there will usually only be one "nd" bus
registered under /sys/class/nd. However, we allow for the possibility
of multiple buses and they will listed in discovery order as
ndctl0...ndctlN. This character device hosts the ioctl for passing
control messages. The initial command set has a 1:1 correlation with
the commands listed in the by the "NFIT DSM Example" document [1], but
this scheme is extensible to future command sets.
Note, nd_ioctl() and the backing ->ndctl() implementation are defined in
a subsequent patch. This is simply the initial registrations and sysfs
attributes.
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
Cc: Neil Brown <neilb@suse.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A struct nvdimm_bus is the anchor device for registering nvdimm
resources and interfaces, for example, a character control device,
nvdimm devices, and I/O region devices. The ACPI NFIT (NVDIMM Firmware
Interface Table) is one possible platform description for such
non-volatile memory resources in a system. The nfit.ko driver attaches
to the "ACPI0012" device that indicates the presence of the NFIT and
parses the table to register a struct nvdimm_bus instance.
Cc: <linux-acpi@vger.kernel.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Beginning at commit d52d3997f8 ("ipv6: Create percpu rt6_info"), the
following INFO splat is logged:
===============================
[ INFO: suspicious RCU usage. ]
4.1.0-rc7-next-20150612 #1 Not tainted
-------------------------------
kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
3 locks held by systemd/1:
#0: (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40
#1: (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540
#2: (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540
stack backtrace:
CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014
Call Trace:
dump_stack+0x4c/0x6e
lockdep_rcu_suspicious+0xe7/0x120
___might_sleep+0x1d5/0x1f0
__might_sleep+0x4d/0x90
kmem_cache_alloc+0x47/0x250
create_object+0x39/0x2e0
kmemleak_alloc_percpu+0x61/0xe0
pcpu_alloc+0x370/0x630
Additional backtrace lines are truncated. In addition, the above splat
is followed by several "BUG: sleeping function called from invalid
context at mm/slub.c:1268" outputs. As suggested by Martin KaFai Lau,
these are the clue to the fix. Routine kmemleak_alloc_percpu() always
uses GFP_KERNEL for its allocations, whereas it should follow the gfp
from its callers.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@vger.kernel.org> [3.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 077fcf116c ("mm/thp: allocate transparent hugepages on
local node"), we handle THP allocations on page fault in a special way -
for non-interleave memory policies, the allocation is only attempted on
the node local to the current CPU, if the policy's nodemask allows the
node.
This is motivated by the assumption that THP benefits cannot offset the
cost of remote accesses, so it's better to fallback to base pages on the
local node (which might still be available, while huge pages are not due
to fragmentation) than to allocate huge pages on a remote node.
The nodemask check prevents us from violating e.g. MPOL_BIND policies
where the local node is not among the allowed nodes. However, the
current implementation can still give surprising results for the
MPOL_PREFERRED policy when the preferred node is different than the
current CPU's local node.
In such case we should honor the preferred node and not use the local
node, which is what this patch does. If hugepage allocation on the
preferred node fails, we fall back to base pages and don't try other
nodes, with the same motivation as is done for the local node hugepage
allocations. The patch also moves the MPOL_INTERLEAVE check around to
simplify the hugepage specific test.
The difference can be demonstrated using in-tree transhuge-stress test
on the following 2-node machine where half memory on one node was
occupied to show the difference.
> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 7878 MB
node 0 free: 3623 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 8045 MB
node 1 free: 7818 MB
node distances:
node 0 1
0: 10 21
1: 21 10
Before the patch:
> numactl -p0 -C0 ./transhuge-stress
transhuge-stress: 2.197 s/loop, 0.276 ms/page, 7249.168 MiB/s 7962 succeed, 0 failed, 1786 different pages
> numactl -p0 -C12 ./transhuge-stress
transhuge-stress: 2.962 s/loop, 0.372 ms/page, 5376.172 MiB/s 7962 succeed, 0 failed, 3873 different pages
Number of successful THP allocations corresponds to free memory on node 0 in
the first case and node 1 in the second case, i.e. -p parameter is ignored and
cpu binding "wins".
After the patch:
> numactl -p0 -C0 ./transhuge-stress
transhuge-stress: 2.183 s/loop, 0.274 ms/page, 7295.516 MiB/s 7962 succeed, 0 failed, 1760 different pages
> numactl -p0 -C12 ./transhuge-stress
transhuge-stress: 2.878 s/loop, 0.361 ms/page, 5533.638 MiB/s 7962 succeed, 0 failed, 1750 different pages
> numactl -p1 -C0 ./transhuge-stress
transhuge-stress: 4.628 s/loop, 0.581 ms/page, 3440.893 MiB/s 7962 succeed, 0 failed, 3918 different pages
The -p parameter is respected regardless of cpu binding.
> numactl -C0 ./transhuge-stress
transhuge-stress: 2.202 s/loop, 0.277 ms/page, 7230.003 MiB/s 7962 succeed, 0 failed, 1750 different pages
> numactl -C12 ./transhuge-stress
transhuge-stress: 3.020 s/loop, 0.379 ms/page, 5273.324 MiB/s 7962 succeed, 0 failed, 3916 different pages
Without -p parameter, hugepage restriction to CPU-local node works as before.
Fixes: 077fcf116c ("mm/thp: allocate transparent hugepages on local node")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One of the rocksdb people noticed that when you do something like this
fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 10M)
pwrite(fd, buf, 5M, 0)
ftruncate(5M)
on tmpfs, the file would still take up 10M: which led to super fun
issues because we were getting ENOSPC before we thought we should be
getting ENOSPC. This patch fixes the problem, and mirrors what all the
other fs'es do (and was agreed to be the correct behaviour at LSF).
I tested it locally to make sure it worked properly with the following
xfs_io -f -c "falloc -k 0 10M" -c "pwrite 0 5M" -c "truncate 5M" file
Without the patch we have "Blocks: 20480", with the patch we have the
correct value of "Blocks: 10240".
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When hot add two nodes continuously, we found the vmemmap region info is
a bit messed. The last region of node 2 is printed when node 3 hot
added, like the following:
Initmem setup node 2 [mem 0x0000000000000000-0xffffffffffffffff]
On node 2 totalpages: 0
Built 2 zonelists in Node order, mobility grouping on. Total pages: 16090539
Policy zone: Normal
init_memory_mapping: [mem 0x40000000000-0x407ffffffff]
[mem 0x40000000000-0x407ffffffff] page 1G
[ffffea1000000000-ffffea10001fffff] PMD -> [ffff8a077d800000-ffff8a077d9fffff] on node 2
[ffffea1000200000-ffffea10003fffff] PMD -> [ffff8a077de00000-ffff8a077dffffff] on node 2
...
[ffffea101f600000-ffffea101f9fffff] PMD -> [ffff8a074ac00000-ffff8a074affffff] on node 2
[ffffea101fa00000-ffffea101fdfffff] PMD -> [ffff8a074a800000-ffff8a074abfffff] on node 2
Initmem setup node 3 [mem 0x0000000000000000-0xffffffffffffffff]
On node 3 totalpages: 0
Built 3 zonelists in Node order, mobility grouping on. Total pages: 16090539
Policy zone: Normal
init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
[mem 0x60000000000-0x607ffffffff] page 1G
[ffffea101fe00000-ffffea101fffffff] PMD -> [ffff8a074a400000-ffff8a074a5fffff] on node 2 <=== node 2 ???
[ffffea1800000000-ffffea18001fffff] PMD -> [ffff8a074a600000-ffff8a074a7fffff] on node 3
[ffffea1800200000-ffffea18005fffff] PMD -> [ffff8a074a000000-ffff8a074a3fffff] on node 3
[ffffea1800600000-ffffea18009fffff] PMD -> [ffff8a0749c00000-ffff8a0749ffffff] on node 3
...
The cause is the last region was missed at the and of hot add memory,
and p_start, p_end, node_start were not reset, so when hot add memory to
a new node, it will consider they are not contiguous blocks and print
the previous one. So we print the last vmemmap region at the end of hot
add memory to avoid the confusion.
Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The simple check for zero length memory mapping may be performed
earlier. So that in case of zero length memory mapping some unnecessary
code is not executed at all. It does not make the code less readable
and saves some CPU cycles.
Signed-off-by: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kmemleak memory scanning uses finer grained object->lock spinlocks
primarily to avoid races with the memory block freeing. However, the
pointer lookup in the rb tree requires the kmemleak_lock to be held.
This is currently done in the find_and_get_object() function for each
pointer-like location read during scanning. While this allows a low
latency on kmemleak_*() callbacks on other CPUs, the memory scanning is
slower.
This patch moves the kmemleak_lock outside the scan_block() loop,
acquiring/releasing it only once per scanned memory block. The
allow_resched logic is moved outside scan_block() and a new
scan_large_block() function is implemented which splits large blocks in
MAX_SCAN_SIZE chunks with cond_resched() calls in-between. A redundant
(object->flags & OBJECT_NO_SCAN) check is also removed from
scan_object().
With this patch, the kmemleak scanning performance is significantly
improved: at least 50% with lock debugging disabled and over an order of
magnitude with lock proving enabled (on an arm64 system).
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While very unlikely (usually kmemleak or sl*b bug), the create_object()
function in mm/kmemleak.c may fail to insert a newly allocated object into
the rb tree. When this happens, kmemleak disables itself and prints
additional information about the object already found in the rb tree.
Such printing is done with the parent->lock acquired, however the
kmemleak_lock is already held. This is a potential race with the scanning
thread which acquires object->lock and kmemleak_lock in a
This patch removes the locking around the 'parent' object information
printing. Such object cannot be freed or removed from object_tree_root
and object_list since kmemleak_lock is already held. There is a very
small risk that some of the object data is being modified on another CPU
but the only downside is inconsistent information printing.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kmemleak_do_cleanup() work thread already waits for the kmemleak_scan
thread to finish via kthread_stop(). Waiting in kthread_stop() while
scan_mutex is held may lead to deadlock if kmemleak_scan_thread() also
waits to acquire for scan_mutex.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calling delete_object_*() on the same pointer is not a standard use case
(unless there is a bug in the code calling kmemleak_free()). However,
during kmemleak disabling (error or user triggered via /sys), there is a
potential race between kmemleak_free() calls on a CPU and
__kmemleak_do_cleanup() on a different CPU.
The current delete_object_*() implementation first performs a look-up
holding kmemleak_lock, increments the object->use_count and then
re-acquires kmemleak_lock to remove the object from object_tree_root and
object_list.
This patch simplifies the delete_object_*() mechanism to both look up
and remove an object from the object_tree_root and object_list
atomically (guarded by kmemleak_lock). This allows safe concurrent
calls to delete_object_*() on the same pointer without additional
locking for synchronising the kmemleak_free_enabled flag.
A side effect is a slight improvement in the delete_object_*() performance
by avoiding acquiring kmemleak_lock twice and incrementing/decrementing
object->use_count.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kmemleak scanning thread can run for minutes. Callbacks like
kmemleak_free() are allowed during this time, the race being taken care
of by the object->lock spinlock. Such lock also prevents a memory block
from being freed or unmapped while it is being scanned by blocking the
kmemleak_free() -> ... -> __delete_object() function until the lock is
released in scan_object().
When a kmemleak error occurs (e.g. it fails to allocate its metadata),
kmemleak_enabled is set and __delete_object() is no longer called on
freed objects. If kmemleak_scan is running at the same time,
kmemleak_free() no longer waits for the object scanning to complete,
allowing the corresponding memory block to be freed or unmapped (in the
case of vfree()). This leads to kmemleak_scan potentially triggering a
page fault.
This patch separates the kmemleak_free() enabling/disabling from the
overall kmemleak_enabled nob so that we can defer the disabling of the
object freeing tracking until the scanning thread completed. The
kmemleak_free_part() is deliberately ignored by this patch since this is
only called during boot before the scanning thread started.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>
Tested-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memcg->under_oom tracks whether the memcg is under OOM conditions and is
an atomic_t counter managed with mem_cgroup_[un]mark_under_oom(). While
atomic_t appears to be simple synchronization-wise, when used as a
synchronization construct like here, it's trickier and more error-prone
due to weak memory ordering rules, especially around atomic_read(), and
false sense of security.
For example, both non-trivial read sites of memcg->under_oom are a bit
problematic although not being actually broken.
* mem_cgroup_oom_register_event()
It isn't explicit what guarantees the memory ordering between event
addition and memcg->under_oom check. This isn't broken only because
memcg_oom_lock is used for both event list and memcg->oom_lock.
* memcg_oom_recover()
The lockless test doesn't have any explanation why this would be
safe.
mem_cgroup_[un]mark_under_oom() are very cold paths and there's no point
in avoiding locking memcg_oom_lock there. This patch converts
memcg->under_oom from atomic_t to int, puts their modifications under
memcg_oom_lock and documents why the lockless test in
memcg_oom_recover() is safe.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 4942642080 ("mm: memcg: handle non-error OOM situations
more gracefully"), nobody uses mem_cgroup->oom_wakeups. Remove it.
While at it, also fold memcg_wakeup_oom() into memcg_oom_recover() which
is its only user. This cleanup was suggested by Michal.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change frontswap single pointer to a singly linked list of frontswap
implementations. Update Xen tmem implementation as register no longer
returns anything.
Frontswap only keeps track of a single implementation; any
implementation that registers second (or later) will replace the
previously registered implementation, and gets a pointer to the previous
implementation that the new implementation is expected to pass all
frontswap functions to if it can't handle the function itself. However
that method doesn't really make much sense, as passing that work on to
every implementation adds unnecessary work to implementations; instead,
frontswap should simply keep a list of all registered implementations
and try each implementation for any function. Most importantly, neither
of the two currently existing frontswap implementations in the kernel
actually do anything with any previous frontswap implementation that
they replace when registering.
This allows frontswap to successfully manage multiple implementations by
keeping a list of them all.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory
address ranges. See UEFI 2.5 spec pages 157-158:
http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf
On EFI enabled systems scan the memory map and tell memblock about any
mirrored ranges.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Xiexiuqi <xiexiuqi@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Try to allocate all boot time kernel data structures from mirrored
memory.
If we run out of mirrored memory print warnings, but fall back to using
non-mirrored memory to make sure that we still boot.
By number of bytes, most of what we allocate at boot time is the page
structures. 64 bytes per 4K page on x86_64 ... or about 1.5% of total
system memory. For workloads where the bulk of memory is allocated to
applications this may represent a useful improvement to system
availability since 1.5% of total memory might be a third of the memory
allocated to the kernel.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Xiexiuqi <xiexiuqi@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some high end Intel Xeon systems report uncorrectable memory errors as a
recoverable machine check. Linux has included code for some time to
process these and just signal the affected processes (or even recover
completely if the error was in a read only page that can be replaced by
reading from disk).
But we have no recovery path for errors encountered during kernel code
execution. Except for some very specific cases were are unlikely to ever
be able to recover.
Enter memory mirroring. Actually 3rd generation of memory mirroing.
Gen1: All memory is mirrored
Pro: No s/w enabling - h/w just gets good data from other side of the
mirror
Con: Halves effective memory capacity available to OS/applications
Gen2: Partial memory mirror - just mirror memory begind some memory controllers
Pro: Keep more of the capacity
Con: Nightmare to enable. Have to choose between allocating from
mirrored memory for safety vs. NUMA local memory for performance
Gen3: Address range partial memory mirror - some mirror on each memory
controller
Pro: Can tune the amount of mirror and keep NUMA performance
Con: I have to write memory management code to implement
The current plan is just to use mirrored memory for kernel allocations.
This has been broken into two phases:
1) This patch series - find the mirrored memory, use it for boot time
allocations
2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
unused mirrored memory from mm/memblock.c and only give it out to
select kernel allocations (this is still being scoped because
page_alloc.c is scary).
This patch (of 3):
Add extra "flags" to memblock to allow selection of memory based on
attribute. No functional changes
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Xiexiuqi <xiexiuqi@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_cache_read, do_generic_file_read, __generic_file_splice_read and
__ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling
add_to_page_cache_lru which might cause recursion into fs down in the
direct reclaim path if the mapping really relies on GFP_NOFS semantic.
This doesn't seem to be the case now because page_cache_read (page fault
path) doesn't seem to suffer from the reclaim recursion issues and
do_generic_file_read and __generic_file_splice_read also shouldn't be
called under fs locks which would deadlock in the reclaim path. Anyway it
is better to obey mapping gfp mask and prevent from later breakage.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In oom_kill_process(), the variable 'points' is unsigned int. Print it as
such.
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_huge_page and hugetlb_reserve_pages use region_chg to calculate the
number of pages which will be added to the reserve map. Subpool and
global reserve counts are adjusted based on the output of region_chg.
Before the pages are actually added to the reserve map, these routines
could race and add fewer pages than expected. If this happens, the
subpool and global reserve counts are not correct.
Compare the number of pages actually added (region_add) to those expected
to added (region_chg). If fewer pages are actually added, this indicates
a race and adjust counters accordingly.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Modify region_add() to keep track of regions(pages) added to the reserve
map and return this value. The return value can be compared to the return
value of region_chg() to determine if the map was modified between calls.
Make vma_commit_reservation() also pass along the return value of
region_add(). In the normal case, we want vma_commit_reservation to
return the same value as the preceding call to vma_needs_reservation.
Create a common __vma_reservation_common routine to help keep the special
case return values in sync
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While working on hugetlbfs fallocate support, I noticed the following race
in the existing code. It is unlikely that this race is hit very often in
the current code. However, if more functionality to add and remove pages
to hugetlbfs mappings (such as fallocate) is added the likelihood of
hitting this race will increase.
alloc_huge_page and hugetlb_reserve_pages use information from the reserve
map to determine if there are enough available huge pages to complete the
operation, as well as adjust global reserve and subpool usage counts. The
order of operations is as follows:
- call region_chg() to determine the expected change based on reserve map
- determine if enough resources are available for this operation
- adjust global counts based on the expected change
- call region_add() to update the reserve map
The issue is that reserve map could change between the call to region_chg
and region_add. In this case, the counters which were adjusted based on
the output of region_chg will not be correct.
In order to hit this race today, there must be an existing shared hugetlb
mmap created with the MAP_NORESERVE flag. A page fault to allocate a huge
page via this mapping must occur at the same another task is mapping the
same region without the MAP_NORESERVE flag.
The patch set does not prevent the race from happening. Rather, it adds
simple functionality to detect when the race has occurred. If a race is
detected, then the incorrect counts are adjusted.
Review comments pointed out the need for documentation of the existing
region/reserve map routines. This patch set also adds documentation in
this area.
This patch (of 3):
This is a documentation only patch and does not modify any code.
Descriptions of the routines used for reserve map/region tracking are
added.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a very subtle difference between mmap()+mlock() vs
mmap(MAP_LOCKED) semantic. The former one fails if the population of the
area fails while the later one doesn't. This basically means that
mmap(MAPLOCKED) areas might see major fault after mmap syscall returns
which is not the case for mlock. mmap man page has already been altered
but Documentation/vm/unevictable-lru.txt deserves a clarification as well.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kenter/kleave/kdebug are wrapper macros to print functions flow and debug
information. This set was written before pr_devel() was introduced, so it
was controlled by "#if 0" construction. It is questionable if anyone is
using them [1] now.
This patch removes these macros, converts numerous printk(KERN_WARNING,
...) to use general pr_warn(...) and removes debug print line from
validate_mmap_request() function.
Signed-off-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add
_huge_ to pmdp_clear functions so that we are clear that they operate on
hugepage pte.
We don't bother about other functions like pmdp_set_wrprotect,
pmdp_clear_flush_young, because they operate on PTE bits and hence
indicate they are operating on hugepage ptes
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Also move the pmd_trans_huge check to generic code.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Architectures like ppc64 [1] need to do special things while clearing pmd
before a collapse. For them this operation is largely different from a
normal hugepage pte clear. Hence add a separate function to clear pmd
before collapse. After this patch pmdp_* functions operate only on
hugepage pte, and not on regular pmd_t values pointing to page table.
[1] ppc64 needs to invalidate all the normal page pte mappings we already
have inserted in the hardware hash page table. But before doing that we
need to make sure there are no parallel hash page table insert going on.
So we need to do a kick_all_cpus_sync() before flushing the older hash
table entries. By moving this to a separate function we capture these
details and mention how it is different from a hugepage pte clear.
This patch is a cleanup and only does code movement for clarity. There
should not be any change in functionality.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
RAS user space tools like rasdaemon which base on trace event, could
receive mce error event, but no memory recovery result event. So, I want
to add this event to make this scenario complete.
This patch add a event at ras group for memory-failure.
The output like below:
# tracer: nop
#
# entries-in-buffer/entries-written: 2/2 #P:24
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
mce-inject-13150 [001] .... 277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed
[xiexiuqi@huawei.com: fix build error]
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change type of action_result's param 3 to enum for type consistency,
and rename mf_outcome to mf_result for clearly.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Jim Davis <jim.epost@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export 'outcome' and 'action_page_type' to mm.h, so we could use
this emnus outside.
This patch is preparation for adding trace events for memory-failure
recovery action.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Jim Davis <jim.epost@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Historically memcg overhead was high even if memcg was unused. This has
improved a lot but it still showed up in a profile summary as being a
problem.
/usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
mem_cgroup_try_charge 2.950% 175781
__mem_cgroup_count_vm_event 1.431% 85239
mem_cgroup_page_lruvec 0.456% 27156
mem_cgroup_commit_charge 0.392% 23342
uncharge_list 0.323% 19256
mem_cgroup_update_lru_size 0.278% 16538
memcg_check_events 0.216% 12858
mem_cgroup_charge_statistics.isra.22 0.188% 11172
try_charge 0.150% 8928
commit_charge 0.141% 8388
get_mem_cgroup_from_mm 0.121% 7184
That is showing that 6.64% of system CPU cycles were in memcontrol.c and
dominated by mem_cgroup_try_charge. The annotation shows that the bulk
of the cost was checking PageSwapCache which is expected to be cache hot
but is very expensive. The problem appears to be that __SetPageUptodate
is called just before the check which is a write barrier. It is
required to make sure struct page and page data is written before the
PTE is updated and the data visible to userspace. memcg charging does
not require or need the barrier but gets unfairly hit with the cost so
this patch attempts the charging before the barrier. Aside from the
accidental cost to memcg there is the added benefit that the barrier is
avoided if the page cannot be charged. When applied the relevant
profile summary is as follows.
/usr/src/linux-4.0-chargefirst-v2r1/mm/memcontrol.c 3.7907 223277
__mem_cgroup_count_vm_event 1.143% 67312
mem_cgroup_page_lruvec 0.465% 27403
mem_cgroup_commit_charge 0.381% 22452
uncharge_list 0.332% 19543
mem_cgroup_update_lru_size 0.284% 16704
get_mem_cgroup_from_mm 0.271% 15952
mem_cgroup_try_charge 0.237% 13982
memcg_check_events 0.222% 13058
mem_cgroup_charge_statistics.isra.22 0.185% 10920
commit_charge 0.140% 8235
try_charge 0.131% 7716
That brings the overhead down to 3.79% and leaves the memcg fault
accounting to the root cgroup but it's an improvement. The difference
in headline performance of the page fault microbench is marginal as
memcg is such a small component of it.
pft faults
4.0.0 4.0.0
vanilla chargefirst
Hmean faults/cpu-1 1443258.1051 ( 0.00%) 1509075.7561 ( 4.56%)
Hmean faults/cpu-3 1340385.9270 ( 0.00%) 1339160.7113 ( -0.09%)
Hmean faults/cpu-5 875599.0222 ( 0.00%) 874174.1255 ( -0.16%)
Hmean faults/cpu-7 601146.6726 ( 0.00%) 601370.9977 ( 0.04%)
Hmean faults/cpu-8 510728.2754 ( 0.00%) 510598.8214 ( -0.03%)
Hmean faults/sec-1 1432084.7845 ( 0.00%) 1497935.5274 ( 4.60%)
Hmean faults/sec-3 3943818.1437 ( 0.00%) 3941920.1520 ( -0.05%)
Hmean faults/sec-5 3877573.5867 ( 0.00%) 3869385.7553 ( -0.21%)
Hmean faults/sec-7 3991832.0418 ( 0.00%) 3992181.4189 ( 0.01%)
Hmean faults/sec-8 3987189.8167 ( 0.00%) 3986452.2204 ( -0.02%)
It's only visible at single threaded. The overhead is there for higher
threads but other factors dominate.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlb pages uses add_to_page_cache to track shared mappings. This is
OK from the data structure point of view but it is less so from the
NR_FILE_PAGES accounting:
- huge pages are accounted as 4k which is clearly wrong
- this counter is used as the amount of the reclaimable page
cache which is incorrect as well because hugetlb pages are
special and not reclaimable
- the counter is then exported to userspace via /proc/meminfo
(in Cached:), /proc/vmstat and /proc/zoneinfo as
nr_file_pages which is confusing at least:
Cached: 8883504 kB
HugePages_Free: 8348
...
Cached: 8916048 kB
HugePages_Free: 156
...
thats 8192 huge pages allocated which is ~16G accounted as 32M
There are usually not that many huge pages in the system for this to
make any visible difference e.g. by fooling __vm_enough_memory or
zone_pagecache_reclaimable.
Fix this by special casing huge pages in both __delete_from_page_cache
and __add_to_page_cache_locked. replace_page_cache_page is currently
only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
is more robust to check for special casing there as well.
Hugetlb pages shouldn't get to any other paths where we do accounting:
- migration - we have a special handling via
hugetlbfs_migrate_page
- shmem - doesn't handle hugetlb pages directly even for
SHM_HUGETLB resp. MAP_HUGETLB
- swapcache - hugetlb is not swapable
This has a user visible effect but I believe it is reasonable because the
previously exported number is simply bogus.
An alternative would be to account hugetlb pages with their real size and
treat them similar to shmem. But this has some drawbacks.
First we would have to special case in kernel users of NR_FILE_PAGES and
considering how hugetlb is special we would have to do it everywhere. We
do not want Cached exported by /proc/meminfo to include it because the
value would be even more misleading.
__vm_enough_memory and zone_pagecache_reclaimable would have to do the
same thing because those pages are simply not reclaimable. The correction
is even not trivial because we would have to consider all active hugetlb
page sizes properly. Users of the counter outside of the kernel would
have to do the same.
So the question is why to account something that needs to be basically
excluded for each reasonable usage. This doesn't make much sense to me.
It seems that this has been broken since hugetlb was introduced but I
haven't checked the whole history.
[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The should_alloc_retry() function was meant to encapsulate retry
conditions of the allocator slowpath, but there are still checks
remaining in the main function, and much of how the retrying is
performed also depends on the OOM killer progress. The physical
separation of those conditions make the code hard to follow.
Inline the should_alloc_retry() checks. Notes:
- The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
replace it with looping on OOM killer progress
- The pm_suspended_storage() check is meant to skip the OOM killer
when reclaim has no IO available, move to __alloc_pages_may_oom()
- The order <= PAGE_ALLOC_COSTLY order is re-united with its original
counterpart of checking whether reclaim actually made any progress
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The zonelist locking and the oom_sem are two overlapping locks that are
used to serialize global OOM killing against different things.
The historical zonelist locking serializes OOM kills from allocations with
overlapping zonelists against each other to prevent killing more tasks
than necessary in the same memory domain. Only when neither tasklists nor
zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
bound to separate nodes) are OOM kills allowed to execute in parallel.
The younger oom_sem is a read-write lock to serialize OOM killing against
the PM code trying to disable the OOM killer altogether.
However, the OOM killer is a fairly cold error path, there is really no
reason to optimize for highly performant and concurrent OOM kills. And
the oom_sem is just flat-out redundant.
Replace both locking schemes with a single global mutex serializing OOM
kills regardless of context.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Disabling the OOM killer needs to exclude allocators from entering, not
existing victims from exiting.
Right now the only waiter is suspend code, which achieves quiescence by
disabling the OOM killer. But later on we want to add waits that hold
the lock instead to stop new victims from showing up.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It turns out that the mechanism to wait for exiting OOM victims is less
generic than it looks: it won't issue wakeups unless the OOM killer is
disabled.
The reason this check was added was the thought that, since only the OOM
disabling code would wait on this queue, wakeup operations could be
saved when that specific consumer is known to be absent.
However, this is quite the handgrenade. Later attempts to reuse the
waitqueue for other purposes will lead to completely unexpected bugs and
the failure mode will appear seemingly illogical. Generally, providers
shouldn't make unnecessary assumptions about consumers.
This could have been replaced with waitqueue_active(), but it only saves
a few instructions in one of the coldest paths in the kernel. Simply
remove it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody else
can clear it concurrently. Use clear_thread_flag() directly.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking
are related in functionality, but the interface is not symmetrical at
all: one is an internal OOM killer function used during the killing, the
other is for an OOM victim to signal its own death on exit later on.
This has locking implications, see follow-up changes.
While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
is easier on the eye.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Setting oom_killer_disabled to false is atomic, there is no need for
further synchronization with ongoing allocations trying to OOM-kill.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>