Commit 46c043ede4 ("mm: take i_mmap_lock in unmap_mapping_range() for
DAX") moved some code in __dax_pmd_fault() that was responsible for
zeroing newly allocated PMD pages. The new location didn't properly set
up 'kaddr', so when run this code resulted in a NULL pointer BUG.
Fix this by getting the correct 'kaddr' via bdev_direct_access().
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SunDong reported the following on
https://bugzilla.kernel.org/show_bug.cgi?id=103841
I think I find a linux bug, I have the test cases is constructed. I
can stable recurring problems in fedora22(4.0.4) kernel version,
arch for x86_64. I construct transparent huge page, when the parent
and child process with MAP_SHARE, MAP_PRIVATE way to access the same
huge page area, it has the opportunity to lead to huge page copy on
write failure, and then it will munmap the child corresponding mmap
area, but then the child mmap area with VM_MAYSHARE attributes, child
process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
functions (vma - > vm_flags & VM_MAYSHARE).
There were a number of problems with the report (e.g. it's hugetlbfs that
triggers this, not transparent huge pages) but it was fundamentally
correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
looks like this
vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0
pgoff 0 file ffff88106bdb9800 private_data (null)
flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
------------
kernel BUG at mm/hugetlb.c:462!
SMP
Modules linked in: xt_pkttype xt_LOG xt_limit [..]
CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
set_vma_resv_flags+0x2d/0x30
The VM_BUG_ON is correct because private and shared mappings have
different reservation accounting but the warning clearly shows that the
VMA is shared.
When a private COW fails to allocate a new page then only the process
that created the VMA gets the page -- all the children unmap the page.
If the children access that data in the future then they get killed.
The problem is that the same file is mapped shared and private. During
the COW, the allocation fails, the VMAs are traversed to unmap the other
private pages but a shared VMA is found and the bug is triggered. This
patch identifies such VMAs and skips them.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: SunDong <sund_sky@126.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit description is copied from the original post of this bug:
http://comments.gmane.org/gmane.linux.kernel.mm/135349
Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next
larger cache size than the size index INDEX_NODE mapping. In kernels
3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size.
However, sometimes we can't get the right output we expected via
kmalloc_size(INDEX_NODE + 1), causing a BUG().
The mapping table in the latest kernel is like:
index = {0, 1, 2 , 3, 4, 5, 6, n}
size = {0, 96, 192, 8, 16, 32, 64, 2^n}
The mapping table before 3.10 is like this:
index = {0 , 1 , 2, 3, 4 , 5 , 6, n}
size = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)}
The problem on my mips64 machine is as follows:
(1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC
&& DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150",
and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE
kmalloc_index(sizeof(struct kmem_cache_node))
(2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8.
(3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size
= PAGE_SIZE".
(4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and
"flags |= CFLGS_OFF_SLAB" will be covered.
(5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to
"cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result
here may be NULL while kernel bootup.
(6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the
BUG info as the following shows (may be only mips64 has this problem):
This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes
the BUG by adding 'size >= 256' check to guarantee that all necessary
small sized slabs are initialized regardless sequence of slab size in
mapping table.
Fixes: e33660165c ("slab: Use common kmalloc_index/kmalloc_size...")
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Liuhailong <liu.hailong6@zte.com.cn>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As include/uapi/linux/userfaultfd.h is a user visible header file, it
should not include kernel-exclusive header files.
So trying to build the userfaultfd test program from the selftests
directory fails, since it contains a reference to linux/compiler.h. As
it turns out, that header is not really needed there, so we can simply
remove it to fix that issue.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nikolay Aleksandrov says:
====================
bridge: vlan: cleanups & fixes
This is the first follow-up set, patch 01 reduces the default rhashtable
size and the number of locks that can be allocated. Patch 02 and 04 fix
possible null pointer dereferences due to the new ordering and
initialization on port add/del, and patch 03 moves the "pvid" member in
the net_bridge_vlan_group struct in order to simplify code (similar to how
it was with the older struct). Patch 05 fixes adding a vlan on a port which
is pvid and doesn't have a global context yet.
Please review carefully, I think this is the first use of rhashtable's
"locks_mul" member in the tree and I'd like to make sure it's correct.
Another thing that needs special attention is the nbp_vlan_flush() move
after the rx_handler unregister.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not pass the original flags when creating a context vlan only
because they may contain some flags that change behaviour in the bridge.
The new global context should be with minimal set of flags, so pass 0
and let br_vlan_add() set the master flag only.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a new port is being added we need to make vlgrp available after
rhashtable has been initialized and when removing a port we need to
flush the vlans and free the resources after we're sure noone can use
the port, i.e. after it's removed from the port list and synchronize_rcu
is executed.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One obvious way to converge more code (which was also used by the
previous vlan code) is to move pvid inside net_bridge_vlan_group. This
allows us to simplify some and remove other port-specific functions.
Also gives us the ability to simply pass the vlan group and use all of the
contained information.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While a new port is being initialized the rx_handler gets set, but the
vlans get initialized later in br_add_if() and in that window if we
receive a frame with a link-local address we can try to dereference
p->vlgrp in:
br_handle_frame() -> br_handle_local_finish() -> br_should_learn()
Fix this by checking vlgrp before using it.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Stephen pointed out the default initial size is more than we need, so
let's start small (4 elements, thus nelem_hint = 3). Also limit the hash
locks to the number of CPUs as we don't need any write-side scaling and
this looks like the minimum.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bugs have trickled in for a new feature in 4.2 (MTRR support in guests)
so I'm reverting it all; let's not make this -rc period busier for KVM
than it's been so far. This covers the four reverts from me.
The fifth patch is being reverted because Radim found a bug in the
implementation of stable scheduler clock, *but* also managed to implement
the feature entirely without hypervisor support. So instead of fixing
the hypervisor side we can remove it completely; 4.4 will get the new
implementation.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWDXc/AAoJEL/70l94x66D8GoH/0WXeSYHn8+Ql5oZ5vI0QcCG
6MiKVixhHTOpkug2QE4DGClYoFSUPuDEB/w6D7YciNn0quDHFZbI3XEMXYtLobHN
0J9cMv9Vpy5pBVMG/LJOw9pFAJRdhSx/cHU2DW9vUiRG9dO9zuxFzBtUciWLOPAX
tSQfDumeUV30BsTP5ldi9kaIUJBM9oBD4JhES0JHx6ePBvy+9vCRmHotugzrrGx6
N+AbCmwUwxnK29PF9i7KMfex6T8l1uQG3fwWVazHoswsqbFEQyF6NpaSTYoZkjM9
6gaXEE1FQ7tRhuio4bBDos0lLu6iGesveP71p/HpULleq2sbH2ER8TpzR5iSnQA=
=zAJS
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"(Relatively) a lot of reverts, mostly.
Bugs have trickled in for a new feature in 4.2 (MTRR support in
guests) so I'm reverting it all; let's not make this -rc period busier
for KVM than it's been so far. This covers the four reverts from me.
The fifth patch is being reverted because Radim found a bug in the
implementation of stable scheduler clock, *but* also managed to
implement the feature entirely without hypervisor support. So instead
of fixing the hypervisor side we can remove it completely; 4.4 will
get the new implementation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
Update KVM homepage Url
Revert "KVM: SVM: use NPT page attributes"
Revert "KVM: svm: handle KVM_X86_QUIRK_CD_NW_CLEARED in svm_get_mt_mask"
Revert "KVM: SVM: Sync g_pat with guest-written PAT value"
Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages"
Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
- Fixes for mlx5 related issues
- Fixes for ipoib multicast handling
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWCfALAAoJELgmozMOVy/dc+MQAKoD6echYpTkWE0otMuHQcYf
zMaVVots+JdRKpA6OqHYQHgKGA80z21BpnjGYwcwB5zB1zPrJwz4vxwGlOBHt01T
xLBReFgSKyJlgOWLXKfPx4bXUdivOBKm203wY0dh+/dC/VROGYoiXYTmSDsfsuKa
8OXT1kWgzRVLtqwqj5GSkgWvtFZ28CjKh6d9egjqcj9tpbh2UupQDZzMyOtZ52X6
Nz/Vo3u4T7qjzlhHOlCwHCDw+97x0yvmvLY1mWweGPfKOnxtXjkzQmTQEpyzU5Mo
EwcqJucrBnmjbLAIBMrbR1mzTUQeD4dHz1jx+EzWE0lVnRL3twe1UaY40176sNlm
aCBA4bIOQ242r3IJ++ss15ol1k5hu7PYKRn9Q8d2sSbQGcSnCHe/YOutQQ+FTEFG
yE9xiLL+pgT8koauROnxg66E3HDM78NGTpjP3EuG4r2Qwa1iFANPfDB6kikuv8bO
rG3qUJcloEPvfatZY+h5QC4UCoB0/W1DAhlfzE3tPBYPmhSEgQDfEOzXTKDakeF0
VB903bYrOL3CVOun4I7fLrDc1leVeiAUKqO2orZs3qIpRWvAKyV/VjolAusMv2+F
/4xPyh95AEMTFfmZogOCofQFk3eOnkWpLdrVTYCKy3i6NVBoy2wHldrl+LuCAN/m
r/DNRBmazShashbeU6wg
=8+cX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
- Fixes for mlx5 related issues
- Fixes for ipoib multicast handling
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/ipoib: increase the max mcast backlog queue
IB/ipoib: Make sendonly multicast joins create the mcast group
IB/ipoib: Expire sendonly multicast joins
IB/mlx5: Remove pa_lkey usages
IB/mlx5: Remove support for IB_DEVICE_LOCAL_DMA_LKEY
IB/iser: Add module parameter for always register memory
xprtrdma: Replace global lkey with lkey local to PD
The cpu feature flags are not ever going to change, so warning
everytime can cause a lot of kernel log spam
(in our case more than 10GB/hour).
The warning seems to only occur when nested virtualization is
enabled, so it's probably triggered by a KVM bug. This is a
sensible and safe change anyway, and the KVM bug fix might not
be suitable for stable releases anyway.
Cc: stable@vger.kernel.org
Signed-off-by: Dirk Mueller <dmueller@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The old one appears to be a generic catch all page, which
is unhelpful.
Signed-off-by: Dirk Mueller <dmueller@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
and UBIFS.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJWCmxzAAoJEEtJtSqsAOnWvfUP/R4NXpQmTJvmKfPaHJxuKMO3
uzEZET8qoc54OVN/GvvPFPRsZhZ5C6a1apWiCg77/WuDm9HHHEYrJVMYcOwqkPU1
5eqXSYdsvS7MjuSJS1fW4zIG+/HYaTXGJ/3bdP0vogtjzaKIBksKBmMTRNOAL8b8
2R6htwkVTMJdOUq6/xQuxG7FzT5m6wPEqUENfqGB3livbiqvU7OTud8I6yvcfD1M
tN02BuUduFgBR/4TwMQSbLzWH0T+XG74t79J5s7sBJwe5/dEeTUXV0HfcPEuG/9+
8TBDeoaxz+m9bvQYROPSRlkAIkh9TPsxTeKTdBDN67/CB2y5P06rz+Kta7ygNSTD
Dn/fZ0I2JhQOtz2EiXvK9N36aHbZAltUFpFp0KNf8GUUM9vNMDY3sjeGQidAwxMc
/qVtu+Syk5+HMz8hQCWpdIbqk3ahZsOvTADwedMn+vxxri6IaQqcnBWmIRy7rffq
prYxJx0VTVbLua5WXCOJILQCGEELqsnUKlnCm6LtznBUpff0Wmj6KsXmmXLs/X7X
NoztNx9FfhHQkWIIx92vu2cbC76LvsCXSuAfwC7k3KyW1hA9uWkc39Hs7yO5UcBp
lQZwsIZTe7qSuVt8lVC5omTeIgQiSc/Gte3WFEtNXNo2uq1VJa717NH6qwNOPayy
/L6on4YEUleHKrvJFjcd
=j/qn
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.3-rc4' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS fixes from Richard Weinberger:
"This contains three bug fixes for both UBI and UBIFS"
* tag 'upstream-4.3-rc4' of git://git.infradead.org/linux-ubifs:
UBI: return ENOSPC if no enough space available
UBI: Validate data_size
UBIFS: Kill unneeded locking in ubifs_init_security
Pull key signing fixes from James Morris:
"Keyrings and modsign fixes from David Howells"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
MODSIGN: Change from CMS to PKCS#7 signing if the openssl is too old
X.509: Don't strip leading 00's from key ID when constructing key description
KEYS: Remove unnecessary header #inclusions from extract-cert.c
KEYS: Fix race between key destruction and finding a keyring by name
This reverts commit 9abc378c66e3d6f437eed77c1c534cbc183523f7
("ieee802154: 6lowpan: change datagram var types").
The reason is that I forgot the IPv6 fragmentation here. Our MTU of
lowpan interface is 1280 and skb->len should not above of that. If we
reach a payload above 1280 in IPv6 header then we have a IPv6
fragmentation above 802.15.4 6LoWPAN fragmentation. The type "u16" was
fine, instead I added now a WARN_ON_ONCE if skb->len is above MTU which
should never happen otherwise IPv6 on minimum MTU size is broken.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This reverts commit 3c2e7f7de3.
Initializing the mapping from MTRR to PAT values was reported to
fail nondeterministically, and it also caused extremely slow boot
(due to caching getting disabled---bug 103321) with assigned devices.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Sebastian Schuette <dracon@ewetel.net>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 5492830370.
It builds on the commit that is being reverted next.
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit e098223b78,
which has a dependency on other commits being reverted.
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit fd717f1101.
It was reported to cause Machine Check Exceptions (bug 104091).
Reported-by: harn-solo@gmx.de
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This device has always ACPI companion because driver supports only ACPI
enumeration. Therefore there is no need to test it in bcm_acpi_probe() and
we can pass it directly to acpi_dev_get_resources() (which will return
-EINVAL in case of NULL argument is passed).
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Tree wide grep for "hci_bcm" doesn't reveal there is any code registering
this platform device and "struct acpi_device_id" use for passing the
platform data looks a debug/test code leftover to me.
I'm assuming this driver effectively supports only ACPI enumeration and
thus test for ACPI_HANDLE() and platform data can be removed.
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
There is no need to call acpi_match_device() in driver's probe path and
verify does it find a match to given ACPI _HIDs in .acpi_match_table as
driver/platform/acpi core code has found the match prior calling the probe.
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Driver doesn't handle possible error from acpi_dev_get_resources(). Test it
and return the error code in case of error.
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Caller of acpi_dev_get_resources() should free the constructed resource
list by calling the acpi_dev_free_resource_list() in order to avoid memory
leak.
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
There is some unneeded code in "hci_intel" probing. First
acpi_match_device() call is needless as driver/platform/acpi core code has
already done the matching before calling the probe and the driver does not
use the returned pointer to matching _HID other than checking is it NULL.
Then tree wide grep for "hci_intel" doesn't reveal that there is any code
registering this platform device so it looks this device is always backed
with ACPI companion so also ACPI_HANDLE() test can be removed.
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
I arranged the code so that the compiler can remove the unecessary bits
in ip_vs_leave when CONFIG_SYSCTL is unset, and removed an explicit
CONFIG_SYSCTL.
Unfortunately when rebasing my work on top of that of Alex Gartrell I
missed the fact that the newly added function ip_vs_addr_is_unicast was
surrounded by CONFIG_SYSCTL.
So remove the now unnecessary CONFIG_SYSCTL guards around
ip_vs_addr_is_unicast. It is causing build failures today when
CONFIG_SYSCTL is not selected and any self respecting compiler will
notice that sysctl_cache_bypass is always false without CONFIG_SYSCTL
and not include the logic from the function ip_vs_addr_is_unicast in
the compiled code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Pull RCU fixes from Ingo Molnar:
"Two RCU fixes:
- work around bug with recent GCC versions.
- fix false positive lockdep splat"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Suppress lockdep false positive for rcp->exp_funnel_mutex
rcu: Change _wait_rcu_gp() to work around GCC bug 67055
As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before
having initialized the IPC object state. Yes, we initialize the IPC
object in a locked state, but with all the lockless RCU lookup work,
that IPC object lock no longer means that the state cannot be seen.
We already did this for the IPC semaphore code (see commit e8577d1f03:
"ipc/sem.c: fully initialize sem_array before making it visible") but we
clearly forgot about msg and shm.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One global lock protecting hash-tables with 1024 buckets isn't
efficient and it shows up in a massive systems with truck
loads of RDS sockets serving multiple databases. The
perf data clearly highlights the contention on the rw
lock in these massive workloads.
When the contention gets worse, the code gets into a state where
it decides to back off on the lock. So while it has disabled interrupts,
it sits and backs off on this lock get. This causes the system to
become sluggish and eventually all sorts of bad things happen.
The simple fix is to move the lock into the hash bucket and
use per-bucket lock to improve the scalability.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
One need to take rds socket reference while using it and release it
once done with it. rds_add_bind() code path does not do that so
lets fix it.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
RDS bind and release locking scheme is very inefficient. It
uses RCU for maintaining the bind hash-table which is great but
it also needs to hold spinlock for [add/remove]_bound(). So
overall usecase, the hash-table concurrent speedup doesn't pay off.
In fact blocking nature of synchronize_rcu() makes the RDS
socket shutdown too slow which hurts RDS performance since
connection shutdown and re-connect happens quite often to
maintain the RC part of the protocol.
So we make the locking scheme simpler and more efficient by
replacing spin_locks with reader/writer locks and getting rid
off rcu for bind hash-table.
In subsequent patch, we also covert the global lock with per-bucket
lock to reduce the global lock contention.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
synchronize_rcu() slowing down un-necessarily the socket shutdown
path. It is used just kfree() the ip addresses in rds_ib_remove_ipaddr()
which is perfect usecase for kfree_rcu();
So lets use that to gain some speedup.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
This patch fixes checkpatch warnings:
- Comparison to NULL could be re-written
- no space required after a cast
Signed-off-by: Prasanna Karthik <mkarthi3@visteon.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When get a CRC error, start the mmc_retune, it will issue CMD19/CMD21
to do tune, assume there were 10 clock phase need to try, phase 0 to
phase 6 is ok, phase 7 to phase 9 is NG, we try it from 0 to 9, so
the last CMD19/CMD21 will get CRC error, host->need_retune was set and
cause mmc_retune was called, then dead loop of mmc_retune
Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: bd11e8bd03 ("mmc: core: Flag re-tuning is needed on CRC errors")
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Sparse found some issues with 32 bit compilation, which probably should
at least work without warning. Not only that, but the code was wrong.
Thanks sparse!!
And thanks to the kbuild robot zero day testing for finding this issue.
$ make ARCH=i386 M=drivers/net/ethernet/intel/i40e C=2 CF="-D__CHECK_ENDIAN__"
CHECK drivers/net/ethernet/intel/i40e/i40e_main.c
include/linux/etherdevice.h:79:32: warning: restricted __be16 degrades to integer
drivers/net/ethernet/intel/i40e/i40e_main.c:7565:17: warning: shift too big (32) for type unsigned long
drivers/net/ethernet/intel/i40e/i40e_main.c:7565:17: warning: shift too big (42) for type unsigned long
drivers/net/ethernet/intel/i40e/i40e_main.c:7565:17: warning: shift too big (39) for type unsigned long
drivers/net/ethernet/intel/i40e/i40e_main.c:7565:17: warning: shift too big (40) for type unsigned long
CC: kbuild-all@01.org
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The 0day build infrastructure found some issues in i40e, this
removes the warnings by adding a harmless cast to a dev_info.
CC: kbuild-all@01.org
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch tweaks the init timing of the driver just a little bit to
increase stability on load/unload and SR-IOV enable/disable cycles.
First, run the init_task loop a little quicker in order to reduce
overall init time.
Second, stagger the start of the init task based on the device's
PCIe function ID. This lessens the impact on the firmware when a
whole bunch of VFs are initialized simultaneously, e.g. enabling
SR-IOV without the VF driver blacklisted. For single VFs assigned
to VMs this will have no effect as the function ID will always be 0.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Down was requesting queue disables, but then exited immediately without
waiting for the queues to actually disable. This could allow any
function called after i40evf_down to run immediately, including
i40evf_up, and causes a memory leak.
This issue has been fixed in a recent refactor of the reset code, but
add a couple WARN_ONs in the slow path to help us recognize if we
reintroduce this issue or if we missed any cases.
Change-ID: I27b6b5c9a79c1892f0ba453129f116bc32647dd0
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The interrupt enable function was always making the caller add
the base_vector from the VSI struct which is already passed to
the function. Just collapse the math into the helper function.
Change-ID: I54ef33aa7ceebc3231c3cc48f7b39fd0c3ff5806
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Due to performance reasons, VEB stats have been disabled in the hw. This
patch adds code to check for that condition before accumulating these
stats.
Change-ID: I7d805669476fedabb073790403703798ae5d878e
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>