Commit graph

601186 commits

Author SHA1 Message Date
Suravee Suthikulpanit
3bbf3565f4 svm: Do not intercept CR8 when enable AVIC
When enable AVIC:
    * Do not intercept CR8 since this should be handled by AVIC HW.
    * Also, we don't need to sync cr8/V_TPR and APIC backing page.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:31 +02:00
Suravee Suthikulpanit
46781eae2e svm: Do not expose x2APIC when enable AVIC
Since AVIC only virtualizes xAPIC hardware for the guest, this patch
disable x2APIC support in guest CPUID.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:30 +02:00
Suravee Suthikulpanit
be8ca170ed KVM: x86: Introducing kvm_x86_ops.apicv_post_state_restore
Adding kvm_x86_ops hooks to allow APICv to do post state restore.
This is required to support VM save and restore feature.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:30 +02:00
Suravee Suthikulpanit
18f40c53e1 svm: Add VMEXIT handlers for AVIC
This patch introduces VMEXIT handlers, avic_incomplete_ipi_interception()
and avic_unaccelerated_access_interception() along with two trace points
(trace_kvm_avic_incomplete_ipi and trace_kvm_avic_unaccelerated_access).

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:29 +02:00
Suravee Suthikulpanit
340d3bc366 svm: Add interrupt injection via AVIC
This patch introduces a new mechanism to inject interrupt using AVIC.
Since VINTR is not supported when enable AVIC, we need to inject
interrupt via APIC backing page instead.

This patch also adds support for AVIC doorbell, which is used by
KVM to signal a running vcpu to check IRR for injected interrupts.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:29 +02:00
Suravee Suthikulpanit
44a95dae1d KVM: x86: Detect and Initialize AVIC support
This patch introduces AVIC-related data structure, and AVIC
initialization code.

There are three main data structures for AVIC:
    * Virtual APIC (vAPIC) backing page (per-VCPU)
    * Physical APIC ID table (per-VM)
    * Logical APIC ID table (per-VM)

Currently, AVIC is disabled by default. Users can manually
enable AVIC via kernel boot option kvm-amd.avic=1 or during
kvm-amd module loading with parameter avic=1.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[Avoid extra indentation (Boris). - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:28 +02:00
Suravee Suthikulpanit
3d5615e534 svm: Introduce new AVIC VMCB registers
Introduce new AVIC VMCB registers.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:28 +02:00
Radim Krčmář
dd1a4cc1fb KVM: split kvm_vcpu_wake_up from kvm_vcpu_kick
AVIC has a use for kvm_vcpu_wake_up.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:27 +02:00
Suravee Suthikulpanit
d1ed092f77 KVM: x86: Introducing kvm_x86_ops VCPU blocking/unblocking hooks
Adding new function pointer in struct kvm_x86_ops, and calling them
from the kvm_arch_vcpu[blocking/unblocking].

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:27 +02:00
Suravee Suthikulpanit
03543133ce KVM: x86: Introducing kvm_x86_ops VM init/destroy hooks
Adding function pointers in struct kvm_x86_ops for processor-specific
layer to provide hooks for when KVM initialize and destroy VM.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:26 +02:00
Suravee Suthikulpanit
dfb9595429 KVM: x86: Rename kvm_apic_get_reg to kvm_lapic_get_reg
Rename kvm_apic_get_reg to kvm_lapic_get_reg to be consistent with
the existing kvm_lapic_set_reg counterpart.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:25 +02:00
Suravee Suthikulpanit
1e6e2755b6 KVM: x86: Misc LAPIC changes to expose helper functions
Exporting LAPIC utility functions and macros for re-use in SVM code.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:25 +02:00
Christian Borntraeger
2086d3200d KVM: shrink halt polling even more for invalid wakeups
commit 3491caf275 ("KVM: halt_polling: provide a way to qualify
 wakeups during poll") added more aggressive shrinking of the
polling interval if the wakeup did not match some criteria. This
still allows to keep polling enabled if the polling time was
smaller that the current max poll time (block_ns <= vcpu->halt_poll_ns).
Performance measurement shows that even more aggressive shrinking
(shrink polling on any invalid wakeup) reduces absolute and relative
(to the workload) CPU usage even further.

Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:22 +02:00
Jon Hunter
280dc0e145 drm/tegra: Fix crash caused by reference count imbalance
Commit d2307dea14 ("drm/atomic: use connector references (v3)") added
reference counting for DRM connectors and this caused a crash when
exercising system suspend on Tegra114 Dalmore.

The Tegra DSI driver implements a Tegra specific function,
tegra_dsi_connector_duplicate_state(), to duplicate the connector state
and destroys the state using the generic helper function,
drm_atomic_helper_connector_destroy_state(). Following commit
d2307dea14 ("drm/atomic: use connector references (v3)") there is
now an imbalance in the connector reference count because the Tegra
function to duplicate state does not take a reference when duplicating
the state information. However, the generic helper function to destroy
the state information assumes a reference has been taken and during
system suspend, when the connector state is destroyed, this leads to a
crash because we attempt to put the reference for an object that has
already been freed.

Fix this by calling __drm_atomic_helper_connector_duplicate_state() from
tegra_dsi_connector_duplicate_state() to ensure that we take a reference
on a connector if crtc is set. Note that this will also copy the
connector state a 2nd time, but this should be harmless.

By fixing tegra_dsi_connector_duplicate_state() to take a reference,
although a crash was no longer seen, it was then observed that after
each system suspend-resume cycle, the reference would be one greater
than before the suspend-resume cycle. Following commit d2307dea14
("drm/atomic: use connector references (v3)"), it was found that we
also need to put the reference when calling the function
tegra_dsi_connector_reset() before freeing the state. Fix this by
updating tegra_dsi_connector_reset() to call the function
__drm_atomic_helper_connector_destroy_state() in order to put the
reference for the connector.

Fixes: d2307dea14 ("drm/atomic: use connector references (v3)")

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: http://patchwork.freedesktop.org/patch/msgid/1463585856-16606-1-git-send-email-jonathanh@nvidia.com
2016-05-18 18:03:33 +02:00
Matan Barak
c16d2750a0 IB/mlx5: Fire the CQ completion handler from tasklet
Previously, mlx5_ib_cq_comp was executed from interrupt context.
Under heavy load, this could cause the CPU core to be in an interrupt
context too long.
Instead of executing the handler from the interrupt context we
execute it from a much friendly tasklet context.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-18 10:45:49 -04:00
Matan Barak
94c6825e0f net/mlx5_core: Use tasklet for user-space CQ completion events
Previously, we've fired all our completion callbacks straight from
our ISR.

Some of those callbacks were lightweight (for example, mlx5 Ethernet
napi callbacks), but some of them did more work (for example,
the user-space RDMA stack uverbs' completion handler). Besides that,
doing more than the minimal work in ISR is generally considered wrong,
it could even lead to a hard lockup of the system. Since when a lot
of completion events are generated by the hardware, the loop over
those events could be so long, that we'll get into a hard lockup by
the system watchdog.

In order to avoid that, add a new way of invoking completion events
callbacks. In the interrupt itself, we add the CQs which receive
completion event to a per-EQ list and schedule a tasklet. In the
tasklet context we loop over all the CQs in the list and invoke the
user callback.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-18 10:45:49 -04:00
Takashi Sakamoto
17e1717c11 ALSA: firewire-lib: change a member of event structure to suppress sparse wanings to bool type
Commit a9c4284bf5 ("ALSA: firewire-lib: add context information to
tracepoints") adds new members to tracepoint events of this module, to
represent context information. One of the members is bool type and
this causes sparse warnings.

16:1: warning: expression using sizeof bool
60:1: warning: expression using sizeof bool
16:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
60:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)

This commit suppresses the warnings, by changing type of the member
to 'unsigned int'. Additionally, this commit applies '!!' idiom to
get 0/1 from 'in_interrupt()'.

Fixes: a9c4284bf5 ("ALSA: firewire-lib: add context information to tracepoints")
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2016-05-18 16:32:09 +02:00
Christoph Lameter
e3b6d8cf8d IB/core: Do not require CAP_NET_ADMIN for packet sniffing
In the Ethernet/TCP world, CAP_NET_RAW is sufficient to allow a program
to listen to all incoming packets on a specific interface, and the
higher CAP_NET_ADMIN is required to set the interface into promiscuous
mode.  We want to emulate that same basic division of privilege in the
RDMA stack, so when dealing with Raw Ethernet QPs, allow apps with
CAP_NET_RAW to listen to all incoming flows (and direct them as they see
fit in their own listen stream).  Do not require CAP_NET_ADMIN just to
listen to traffic already incoming.  Reserve CAP_NET_ADMIN if we attempt
to set promiscuous mode.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-18 10:31:58 -04:00
shamir rabinovitch
04ef0f1a01 IB/mlx4: Fix unaligned access in send_reply_to_slave
The problem is that the function 'send_reply_to_slave' gets the
'req_sa_mad' as a pointer whose address is only aliged to 4 bytes
but is 8 bytes in size.  This can result in unaligned access faults
on certain architectures.

Sowmini Varadhan pointed to this reply from Dave Miller that say
that memcpy should not be used to solve alignment issues:
https://lkml.org/lkml/2015/10/21/352

Optimization of memcpy to 'ldx' instruction can only happen if the
compiler knows that the size of the data we are copying is 8 bytes
and it assumes it is aligned to 8 bytes. If the compiler know the
type is not aligned to 8 it must not optimize the 8 byte copy.
Defining the data type as aligned to 4 forces the compiler to treat
all accesses as though they aren't aligned and avoids the 'ldx'
optimization.

Full credit for the idea goes to Jason Gunthorpe
<jgunthorpe@obsidianresearch.com>.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-18 10:25:56 -04:00
Jiri Olsa
ef3f00a4d3 perf/x86/intel/uncore: Remove WARN_ON_ONCE in uncore_pci_probe
When booting with nr_cpus=1, uncore_pci_probe tries to init the PCI/uncore
also for the other packages and fails with warning when they are not found.

The warning is bogus because it's correct to fail here for packages which are
not initialized. Remove it and return silently.

Fixes: cf6d445f68 "perf/x86/uncore: Track packages, not per CPU data"
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: stable@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-18 16:17:25 +02:00
Daniel Vetter
fdf2c85f26 drm: Fix error handling in drm_connector_register
When debugfs or sysfs registration failed, we failed to clean up the
idr registration. Reorder to fix this.

Cc: Dave Airlie <airlied@gmail.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: http://patchwork.freedesktop.org/patch/msgid/1462539302-27764-1-git-send-email-daniel.vetter@ffwll.ch
2016-05-18 12:24:51 +02:00
Chris Wilson
e2d800a3ce drm: Avoid connector reference imbalance on error path
Whilst looking at the fallout from using connector references for
atomic, I noticed that there is an early return buried in
drm_atomic_set_crtc_for_connector() that if hit could cause us to leak a
reference on the connector.

Fixes: d2307dea14 (drm/atomic: use connector references (v3))
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Stone <daniels@collabora.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: http://patchwork.freedesktop.org/patch/msgid/1462535265-13058-1-git-send-email-chris@chris-wilson.co.uk
2016-05-18 10:42:01 +02:00
Zhang Rui
88ac99063e Merge branches 'thermal-core', 'thermal-intel' and 'thermal-soc' into next 2016-05-18 15:37:08 +08:00
Chen Feng
b52207ef4e mfd: hi655x: Add MFD driver for hi655x
Add PMIC MFD driver to support hisilicon hi665x.

Signed-off-by: Chen Feng <puck.chen@hisilicon.com>
Signed-off-by: Fei Wang <w.f@huawei.com>
Signed-off-by: Xinwei Kong <kong.kongxinwei@hisilicon.com>
Reviewed-by: Haojian Zhuang <haojian.zhuang@linaro.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
2016-05-18 08:25:26 +01:00
Alexey Brodkin
776d7f1694 arc: axs103_smp: Fix CPU frequency to 100MHz for dual-core
The most recent release of AXS103 [v1.1] is proven to work
at 100 MHz in dual-core mode so this change uses mentioned feature.
For that we:
 * Update axc003_idu.dtsi with mention of really-used CPU clock freq
 * Remove clock override in AXS platform code for dual-core HW

Note we're still leaving a hack for clock "downgrade" on early boot
for quad-core hardware.

Also note this change will break functionality of AXS103 v1.0 hardware.
That means all users of AXS103 __must__ upgrade their boards with the
most recent firmware.

Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2016-05-18 10:50:18 +05:30
Dave Chinner
ad438c4038 xfs: move reclaim tagging functions
Rearrange the inode tagging functions so that they are higher up in
xfs_cache.c and so there is no need for forward prototypes to be
defined. This is purely code movement, no other change.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-05-18 14:20:08 +10:00
Dave Chinner
545c0889d2 xfs: simplify inode reclaim tagging interfaces
Inode radix tree tagging for reclaim passes a lot of unnecessary
variables around. Over time the xfs-perag has grown a xfs_mount
backpointer, and an internal agno so we don't need to pass other
variables into the tagging functions to supply this information.

Rework the functions to pass the minimal variable set required
and simplify the internal logic and flow.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:11:41 +10:00
Dave Chinner
194293631d xfs: rename variables in xfs_iflush_cluster for clarity
The cluster inode variable uses unconventional naming - iq - which
makes it hard to distinguish it between the inode passed into the
function - ip - and that is a vector for mistakes to be made.
Rename all the cluster inode variables to use a more conventional
prefixes to reduce potential future confusion (cilist, cilist_size,
cip).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:09:46 +10:00
Dave Chinner
5a90e53e81 xfs: xfs_iflush_cluster has range issues
xfs_iflush_cluster() does a gang lookup on the radix tree, meaning
it can find inodes beyond the current cluster if there is sparse
cache population. gang lookups return results in ascending index
order, so stop trying to cluster inodes once the first inode outside
the cluster mask is detected.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:09:13 +10:00
Dave Chinner
8a17d7dded xfs: mark reclaimed inodes invalid earlier
The last thing we do before using call_rcu() on an xfs_inode to be
freed is mark it as invalid. This means there is a window between
when we know for certain that the inode is going to be freed and
when we do actually mark it as "freed".

This is important in the context of RCU lookups - we can look up the
inode, find that it is valid, and then use it as such not realising
that it is in the final stages of being freed.

As such, mark the inode as being invalid the moment we know it is
going to be reclaimed. This can be done while we still hold the
XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that
it occurs well before we remove it from the radix tree, and that
the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as
synchronisation points for detecting that an inode is about to go
away.

For defensive purposes, this allows us to add a further check to
xfs_iflush_cluster to ensure we skip inodes that are being freed
after we grab the XFS_ILOCK_SHARED and the flush lock - we know that
if the inode number if valid while we have these locks held we know
that it has not progressed through reclaim to the point where it is
clean and is about to be freed.

[bfoster: fixed __xfs_inode_clear_reclaim() using ip->i_ino after it
	  had already been zeroed.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:09:12 +10:00
Dave Chinner
1f2dcfe89e xfs: xfs_inode_free() isn't RCU safe
The xfs_inode freed in xfs_inode_free() has multiple allocated
structures attached to it. We free these in xfs_inode_free() before
we mark the inode as invalid, and before we run call_rcu() to queue
the structure for freeing.

Unfortunately, this freeing can race with other accesses that are in
the RCU current grace period that have found the inode in the radix
tree with a valid state.  This includes xfs_iflush_cluster(), which
calls xfs_inode_clean(), and that accesses the inode log item on the
xfs_inode.

The log item structure is freed in xfs_inode_free(), so there is the
possibility we can be accessing freed memory in xfs_iflush_cluster()
after validating the xfs_inode structure as being valid for this RCU
context. Hence we can get spuriously incorrect clean state returned
from such checks. This can lead to use thinking the inode is dirty
when it is, in fact, clean, and so incorrectly attaching it to the
buffer for IO and completion processing.

This then leads to use-after-free situations on the xfs_inode itself
if the IO completes after the current RCU grace period expires. The
buffer callbacks will access the xfs_inode and try to do all sorts
of things it shouldn't with freed memory.

IOWs, xfs_iflush_cluster() only works correctly when racing with
inode reclaim if the inode log item is present and correctly stating
the inode is clean. If the inode is being freed, then reclaim has
already made sure the inode is clean, and hence xfs_iflush_cluster
can skip it. However, we are accessing the inode inode under RCU
read lock protection and so also must ensure that all dynamically
allocated memory we reference in this context is not freed until the
RCU grace period expires.

To fix this, move all the potential memory freeing into
xfs_inode_free_callback() so that we are guarantee RCU protected
lookup code will always have the memory structures it needs
available during the RCU grace period that lookup races can occur
in.

Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:01:53 +10:00
Alex Lyakas
32b43ab6fb xfs: optimise xfs_iext_destroy
When unmounting XFS, we call:

xfs_inode_free => xfs_idestroy_fork => xfs_iext_destroy

This goes over the whole indirection array and calls
xfs_iext_irec_remove for each one of the erps (from the last one to
the first one). As a result, we keep shrinking (reallocating
actually) the indirection array until we shrink out all of its
elements. When we have files with huge numbers of extents, umount
takes 30-80 sec, depending on the amount of files that XFS loaded
and the amount of indirection entries of each file. The unmount
stack looks like:

[<ffffffffc0b6d200>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
[<ffffffffc0b6cd8e>] xfs_iext_irec_remove+0xee/0xf0 [xfs]
[<ffffffffc0b6cdcd>] xfs_iext_destroy+0x3d/0xb0 [xfs]
[<ffffffffc0b6cef6>] xfs_idestroy_fork+0xb6/0xf0 [xfs]
[<ffffffffc0b87002>] xfs_inode_free+0xb2/0xc0 [xfs]
[<ffffffffc0b87260>] xfs_reclaim_inode+0x250/0x340 [xfs]
[<ffffffffc0b87583>] xfs_reclaim_inodes_ag+0x233/0x370 [xfs]
[<ffffffffc0b8823d>] xfs_reclaim_inodes+0x1d/0x20 [xfs]
[<ffffffffc0b96feb>] xfs_unmountfs+0x7b/0x1a0 [xfs]
[<ffffffffc0b98e4d>] xfs_fs_put_super+0x2d/0x70 [xfs]
[<ffffffff811e9e36>] generic_shutdown_super+0x76/0x100
[<ffffffff811ea207>] kill_block_super+0x27/0x70
[<ffffffff811ea519>] deactivate_locked_super+0x49/0x60
[<ffffffff811eaaee>] deactivate_super+0x4e/0x70
[<ffffffff81207593>] cleanup_mnt+0x43/0x90
[<ffffffff81207632>] __cleanup_mnt+0x12/0x20
[<ffffffff8108f8e7>] task_work_run+0xa7/0xe0
[<ffffffff81014ff7>] do_notify_resume+0x97/0xb0
[<ffffffff81717c6f>] int_signal+0x12/0x17

Further, this reallocation prevents us from freeing the extent list
from a RCU callback as allocation can block. Hence if the extent
list is in indirect format, optimise the freeing of the extent list
to only use kmem_free calls by freeing entire extent buffer pages at
a time, rather than extent by extent.

[dchinner: simplified freeing loop based on Christoph's suggestion]

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:01:52 +10:00
Dave Chinner
7d3aa7fe97 xfs: skip stale inodes in xfs_iflush_cluster
We don't write back stale inodes so we should skip them in
xfs_iflush_cluster, too.

cc: <stable@vger.kernel.org> # 3.10.x-
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 13:54:23 +10:00
Dave Chinner
51b07f30a7 xfs: fix inode validity check in xfs_iflush_cluster
Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs:
convert inode cache lookups to use RCU locking") back in late 2010,
and so xfs_iflush_cluster checks the wrong inode for whether it is
still valid under RCU protection. Fix it to lock and check the
correct inode.

(*) Careless-idiot: Dave Chinner <dchinner@redhat.com>

cc: <stable@vger.kernel.org> # 3.10.x-
Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 13:54:22 +10:00
Dave Chinner
b1438f4779 xfs: xfs_iflush_cluster fails to abort on error
When a failure due to an inode buffer occurs, the error handling
fails to abort the inode writeback correctly. This can result in the
inode being reclaimed whilst still in the AIL, leading to
use-after-free situations as well as filesystems that cannot be
unmounted as the inode log items left in the AIL never get removed.

Fix this by ensuring fatal errors from xfs_imap_to_bp() result in
the inode flush being aborted correctly.

cc: <stable@vger.kernel.org> # 3.10.x-
Reported-by: Shyam Kaushik <shyam@zadarastorage.com>
Diagnosed-by: Shyam Kaushik <shyam@zadarastorage.com>
Tested-by: Shyam Kaushik <shyam@zadarastorage.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 13:53:42 +10:00
Dave Chinner
8179c03629 xfs: remove xfs_fs_evict_inode()
Joe Lawrence reported a list_add corruption with 4.6-rc1 when
testing some custom md administration code that made it's own
block device nodes for the md array. The simple test loop of:

for i in {0..100}; do
	mknod --mode=0600 $tmp/tmp_node b $MAJOR $MINOR
	mdadm --detail --export $tmp/tmp_node > /dev/null
	rm -f $tmp/tmp_node
done


Would produce this warning in bd_acquire() when mdadm opened the
device node:

list_add double add: new=ffff88043831c7b8, prev=ffff8804380287d8, next=ffff88043831c7b8.

And then produce this from bd_forget from kdevtmpfs evicting a block
dev inode:

list_del corruption. prev->next should be ffff8800bb83eb10, but was ffff88043831c7b8

This is a regression caused by commit c19b3b05 ("xfs: mode di_mode
to vfs inode"). The issue is that xfs_inactive() frees the
unlinked inode, and the above commit meant that this freeing zeroed
the mode in the struct inode. The problem is that after evict() has
called ->evict_inode, it expects the i_mode to be intact so that it
can call bd_forget() or cd_forget() to drop the reference to the
block device inode attached to the XFS inode.

In reality, the only thing we do in xfs_fs_evict_inode() that is not
generic is call xfs_inactive(). We can move the xfs_inactive() call
to xfs_fs_destroy_inode() without any problems at all, and this
will leave the VFS inode intact until it is completely done with it.

So, remove xfs_fs_evict_inode(), and do the work it used to do in
->destroy_inode instead.

cc: <stable@vger.kernel.org> # 4.6
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 13:52:42 +10:00
Marek Lindner
ebe24cea95 batman-adv: initialize ELP orig address on secondary interfaces
This fix prevents nodes to wrongly create a 00:00:00:00:00:00 originator
which can potentially interfere with the rest of the neighbor statistics.

Fixes: d6f94d91f7 ("batman-adv: ELP - adding basic infrastructure")
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-18 11:49:44 +08:00
Linus Lüssing
e123705e58 batman-adv: Avoid duplicate neigh_node additions
Two parallel calls to batadv_neigh_node_new() might race for creating
and adding the same neig_node. Fix this by including the check for any
already existing, identical neigh_node within the spin-lock.

This fixes splats like the following:

[  739.535069] ------------[ cut here ]------------
[  739.535079] WARNING: CPU: 0 PID: 0 at /usr/src/batman-adv/git/batman-adv/net/batman-adv/bat_iv_ogm.c:1004 batadv_iv_ogm_process_per_outif+0xe3f/0xe60 [batman_adv]()
[  739.535092] too many matching neigh_nodes
[  739.535094] Modules linked in: dm_mod tun ip6table_filter ip6table_mangle ip6table_nat nf_nat_ipv6 ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TCPMSS xt_mark iptable_mangle xt_tcpudp xt_conntrack iptable_filter ip_tables x_tables ip_gre ip_tunnel gre bridge stp llc thermal_sys kvm_intel kvm crct10dif_pclmul crc32_pclmul sha256_ssse3 sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd evdev pcspkr ip6_gre ip6_tunnel tunnel6 batman_adv(O) libcrc32c nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack autofs4 ext4 crc16 mbcache jbd2 xen_netfront xen_blkfront crc32c_intel
[  739.535177] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  O    4.2.0-0.bpo.1-amd64 #1 Debian 4.2.6-3~bpo8+2
[  739.535186]  0000000000000000 ffffffffa013b050 ffffffff81554521 ffff88007d003c18
[  739.535201]  ffffffff8106fa01 0000000000000000 ffff8800047a087a ffff880079c3a000
[  739.735602]  ffff88007b82bf40 ffff88007bc2d1c0 ffffffff8106fa7a ffffffffa013aa8e
[  739.735624] Call Trace:
[  739.735639]  <IRQ>  [<ffffffff81554521>] ? dump_stack+0x40/0x50
[  739.735677]  [<ffffffff8106fa01>] ? warn_slowpath_common+0x81/0xb0
[  739.735692]  [<ffffffff8106fa7a>] ? warn_slowpath_fmt+0x4a/0x50
[  739.735715]  [<ffffffffa012448f>] ? batadv_iv_ogm_process_per_outif+0xe3f/0xe60 [batman_adv]
[  739.735740]  [<ffffffffa0124813>] ? batadv_iv_ogm_receive+0x363/0x380 [batman_adv]
[  739.735762]  [<ffffffffa0124813>] ? batadv_iv_ogm_receive+0x363/0x380 [batman_adv]
[  739.735783]  [<ffffffff810b0841>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  739.735804]  [<ffffffffa012cb39>] ? batadv_batman_skb_recv+0xc9/0x110 [batman_adv]
[  739.735825]  [<ffffffff81464891>] ? __netif_receive_skb_core+0x841/0x9a0
[  739.735838]  [<ffffffff810b0841>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  739.735853]  [<ffffffff81465681>] ? process_backlog+0xa1/0x140
[  739.735864]  [<ffffffff81464f1a>] ? net_rx_action+0x20a/0x320
[  739.735878]  [<ffffffff81073aa7>] ? __do_softirq+0x107/0x270
[  739.735891]  [<ffffffff81073d82>] ? irq_exit+0x92/0xa0
[  739.735905]  [<ffffffff8137e0d1>] ? xen_evtchn_do_upcall+0x31/0x40
[  739.735924]  [<ffffffff8155b8fe>] ? xen_do_hypervisor_callback+0x1e/0x40
[  739.735939]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[  739.735965]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[  739.735979]  [<ffffffff8100a39c>] ? xen_safe_halt+0xc/0x20
[  739.735991]  [<ffffffff8101da6c>] ? default_idle+0x1c/0xa0
[  739.736004]  [<ffffffff810abf6b>] ? cpu_startup_entry+0x2eb/0x350
[  739.736019]  [<ffffffff81b2af5e>] ? start_kernel+0x480/0x48b
[  739.736032]  [<ffffffff81b2d116>] ? xen_start_kernel+0x507/0x511
[  739.736048] ---[ end trace c106bb901244bc8c ]---

Fixes: f987ed6ebd ("batman-adv: protect neighbor list with rcu locks")
Reported-by: Martin Weinelt <martin@darmstadt.freifunk.net>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-18 11:49:43 +08:00
Sven Eckelmann
d285f52cc0 batman-adv: Fix integer overflow in batadv_iv_ogm_calc_tq
The undefined behavior sanatizer detected an signed integer overflow in a
setup with near perfect link quality

    UBSAN: Undefined behaviour in net/batman-adv/bat_iv_ogm.c:1246:25
    signed integer overflow:
    8713350 * 255 cannot be represented in type 'int'

The problems happens because the calculation of mixed unsigned and signed
integers resulted in an integer multiplication.

      batadv_ogm_packet::tq (u8 255)
    * tq_own (u8 255)
    * tq_asym_penalty (int 134; max 255)
    * tq_iface_penalty (int 255; max 255)

The tq_iface_penalty, tq_asym_penalty and inv_asym_penalty can just be
changed to unsigned int because they are not expected to become negative.

Fixes: c039876892 ("batman-adv: add WiFi penalty")
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-18 11:49:42 +08:00
Antonio Quartulli
1653f61d65 batman-adv: make sure ELP/OGM orig MAC is updated on address change
When the MAC address of the primary interface is changed,
update the originator address in the ELP and OGM skb buffers as
well in order to reflect the change.

Fixes: d6f94d91f7 ("batman-adv: ELP - adding basic infrastructure")
Reported-by: Marek Lindner <marek@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-18 11:49:41 +08:00
Sven Eckelmann
f7dcdf5fdb batman-adv: Fix unexpected free of bcast_own on add_if error
The function batadv_iv_ogm_orig_add_if allocates new buffers for bcast_own
and bcast_own_sum. It is expected that these buffers are unchanged in case
either bcast_own or bcast_own_sum couldn't be resized.

But the error handling of this function frees the already resized buffer
for bcast_own when the allocation of the new bcast_own_sum buffer failed.
This will lead to an invalid memory access when some code will try to
access bcast_own.

Instead the resized new bcast_own buffer has to be kept. This will not lead
to problems because the size of the buffer was only increased and therefore
no user of the buffer will try to access bytes outside of the new buffer.

Fixes: d0015fdd3d ("batman-adv: provide orig_node routing API")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-18 11:49:40 +08:00
Sven Eckelmann
71f9d27daa batman-adv: Fix refcnt leak in batadv_v_neigh_*
The functions batadv_neigh_ifinfo_get increase the reference counter of the
batadv_neigh_ifinfo. These have to be reduced again when the reference is
not used anymore to correctly free the objects.

Fixes: 9786906022 ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-18 11:49:40 +08:00
Sven Eckelmann
a45e932a3c batman-adv: Avoid nullptr derefence in batadv_v_neigh_is_sob
batadv_neigh_ifinfo_get can return NULL when it cannot find (even when only
temporarily) anymore the neigh_ifinfo in the list neigh->ifinfo_list. This
has to be checked to avoid kernel Oopses when the ifinfo is dereferenced.

This a situation which isn't expected but is already handled by functions
like batadv_v_neigh_cmp. The same kind of warning is therefore used before
the function returns without dereferencing the pointers.

Fixes: 9786906022 ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-18 11:49:39 +08:00
Florian Westphal
63d443efe8 batman-adv: fix skb deref after free
batadv_send_skb_to_orig() calls dev_queue_xmit() so we can't use skb->len.

Fixes: 953324776d ("batman-adv: network coding - buffer unicast packets before forward")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-18 11:49:38 +08:00
James Bottomley
e7ca7f9fa2 Merge branch 'fixes' into misc 2016-05-17 21:12:50 -04:00
Carlos Maiolino
e6b3bb7896 xfs: add "fail at unmount" error handling configuration
If we take "retry forever" literally on metadata IO errors, we can
hang at unmount, once it retries those writes forever. This is the
default behavior, unfortunately.

Add an error configuration option for this behavior and default it
to "fail" so that an unmount will trigger actuall errors, a shutdown
and allow the unmount to succeed. It will be noisy, though, as it
will log the errors and shutdown that occurs.

To fix this, we need to mark the filesystem as being in the process
of unmounting. Do this with a mount flag that is added at the
appropriate time (i.e. before the blocking AIL sync). We also need
to add this flag if mount fails after the initial phase of log
recovery has been run.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 11:11:27 +10:00
Carlos Maiolino
e0a431b3a3 xfs: add configuration handlers for specific errors
now most of the infrastructure is in place, we can start adding
support for configuring specific errors such as ENODEV, ENOSPC, EIO,
etc. Add these error configurations and configure them all to have
appropriate behaviours. That is, all will be configured to retry
forever by default, except for ENODEV, which is an unrecoverable
error, so it will be configured to not retry on error

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 11:09:28 +10:00
Carlos Maiolino
a5ea70d25d xfs: add configuration of error failure speed
On reception of an error, we can fail immediately, perform some
bound amount of retries or retry indefinitely. The current behaviour
we have is to retry forever.

However, we'd like the ability to choose how long the filesystem
should try after an error, it can either fail immediately, retry a
few times, or retry forever. This is implemented by using
max_retries sysfs attribute, to hold the amount of times we allow
the filesystem to retry after an error. Being -1 a special case
where the filesystem will retry indefinitely.

Add both a maximum retry count and a retry timeout so that we can
bound by time and/or physical IO attempts.

Finally, plumb these into xfs_buf_iodone error processing so that
the error behaviour follows the selected configuration.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 11:08:15 +10:00
Carlos Maiolino
ef6a50fbb1 xfs: introduce table-based init for error behaviors
Before we start expanding the number of error classes and errors we
can configure behaviour for, we need a simple and clear way to
define the default behaviour that we initialized each mount with.
Introduce a table based method for keeping the initial configuration
in, and apply that to the existing initialization code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 11:06:44 +10:00
Carlos Maiolino
df3093907c xfs: add configurable error support to metadata buffers
With the error configuration handle for async metadata write errors
in place, we can now add initial support to the IO error processing
in xfs_buf_iodone_error().

Add an infrastructure function to look up the configuration handle,
and rearrange the error handling to prepare the way for different
error handling conigurations to be used.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 11:05:33 +10:00